WallStreetBets(WSB) is a subreddit where participants discuss stock and option trading. The subreddit culture is notable for its profane nature, calling its followers “autists,”, “retards”, and “apes.” It is also a place where large losses and gains are applauded.
On January 22, 2021, WSB initiated a short squeeze on GameStop (GME), after noticing a large number of short-selling activities relating to the stock. The sudden popularity of GME stock drove the stock prices up significantly. Furthermore, a series of posts by user u/DeepFuckingValue on his weekly gains also helped to generate interest in the stock.
In the new week, the short squeeze on GME became a class war between the working class and the hedge fund managers. Followed by the tweets of Elon Musk, the GME stock reached its all-time high on January 27th of $345.
In January, I bought a few shares of GME after noticing its popularity on the WallStreetBets subreddit. After having taken a number of finance classes at Wharton, I was especially intrigued by the concept of market efficiency. Without a doubt, WallStreetBets has played a tremendous role in the historical event of GameStop's short squeeze. With my elementary knowledge of R, I decided to embark on a project that involves natural language processing and sentiment analysis to further understand this subreddit community and the role it played in driving GameStop's popularity.
- How can I write a post that is more likely to be popular on WallStreetBets?
- How does GameStop’s sentiment relate to its price?
- How has the sentiment towards GameStop changed over time?
- How is GameStop being discussed by news channels such as CNBC?
With my analysis, I hope to help others understand what are the common themes in popular WallStreetBets posts, and whether WallStreetBets is a reliable source for investment advice
I decided to use a combination of datasets from Kaggle and scraped historical stock prices. In order to perform an LDA, I also scraped the most recent 150 articles from CNBC on GameStop.
Some variables in the Kaggle dataset:
- score: sum of the number of up and down votes given by the readers
- id: unique to each post
- timestamp: written in mdyhs format
I decided to scrape CNBC as the news source for GameStop articles because its format is consistent and easy to navigate. In a more ideal scenario, I would also scrape other media sources to prevent bias sentiment from only one news source.
- Data Sources: Reddit (via Kaggle), Yahoo Finance, CNBC
- Data Scraping: WebScraper, SelectorGadget
- Packages Used: readr, magrittr, quantmod, dygraphs, tidyverse, rvest, stringr, ggplot2, tm, wordcloud2 , + more
With the data extracted from Reddit and CNBC, I needed to go a step further to refine the data in order to run an analysis. The steps I took to process them are as follows:
- remove numbers
- remove punctuation
- convert to lower case
In some circumstances, specific stopwords were also utilized such as “stock” and “GME.” The reasoning is because their frequency was far beyond other words and I wanted to get a better sense of the underlying texts affiliated with these commonly mentioned words.
Since the original timestamp on Reddit posts was in mdyhs, I also transformed it in a number of ways to see trends on a weekly and monthly basis.
1) Understanding the WallStreetBets Community
I wanted to start by examining the WallStreetBets community at a high level. A couple of questions I had in mind were:
- How has the popularity of the community changed over time?
- Which days are people most active on the subreddit?
- What are the words popular words used in WSB posts?
Popularity Over Time
The graph below shows the change in the number of posts over time. The dates that correspond to the highest number of posts correlate to the days in which we observed a huge spike in GME prices as well.
This is in part due to the feedback cycle generated in the GME short squeeze. The journey begins with retail investors sharing a “dd” or due diligence on why they believe a certain stock is undervalued, then through using bots and other methods of generating attention on WSB to encourage more retail investors to buy, the rise of demand then leads to increase prices of the stock.
The trendline shows a negative relationship between the number of posts and the passage of time.
The graph below is made using the quantmod and dygraphs packages in r. As we can see, the trend in stock prices of GME follows a similar shape as the number of posts on WallStreetBets over time. The spike observed on Jan. 29th, 2021 corresponds to the highest number of posts on WSB on Jan. 28th, 2021.
Days Most Active
Next, in part to my overall goal to understand how to create a popular post on WSB, I wanted to see which days are most popular to post on Reddit. My reasoning is that typically have the highest probability to become popular within the first day of posting. If I want to create a popular post, I would need to post on a day in which most people are active, indicated by the day of the week with the most posts.
My graph shows that most numbers of posts are made on Friday. This makes intuitive sense since most earnings reports are announced on Fridays. Furthermore, Fridays are generally a common day for discussion on sales or buy of new stocks too because there is so much volatility going into the weekend.
The trend on sentiment score is harder to detect as there is such much variability across the posts. However, it seems to show generally a negative trend over time with the highest being end of January or beginning of February.
I created a word cloud on the most popular words used on WallStreetBets titles. With the number of amateur investment advice from this subreddit community, the words “buy” and “hold” have made their way to the top of the popularity list. Note: I removed words such as GME and stock because I wanted to see what other words are popular, as we already know from the earlier analysis that GME posts drove WSB to its all-time high user activity.
2) Visualizing the GameStop impact
Similar to the analysis on the WallStreetBets community, but this time I wanted to focus specifically on posts related to GameStop. In order to accomplish this, I filtered for only posts that mentioned “GME” or “GameStop.” As I begin this section of the analysis, some things I hope to answer are:
- What are the words most used in positive sentiment posts? How are they different than the negative sentiment posts?
- What are the words most positively correlated with the popularity of a post (measured by the score variable from Reddit)
- How has the sentiment towards GameStop stock changed over time?
- Can we categorize GameStop-related articles into topics?
Most used words by sentiment
Wow, GME! An interesting thing to note is how common phrases used in WSB can be pieced together by this word cloud. I’m thinking phrases like “apes strong,” “just hold,” and “we like the stock.”
The logical next step would be to see what are the most common words used in posts with negative sentiment. We see some more profanity on this list as expected. It’s funny to see how Robinhood has made its way to the top of the list. This is likely due to the company’s decision to restrict the trading of stocks such as GameStop, AMC, and Blackberry on its platform in an effort to “protect” retail investors from the high volatility of stocks.
Words most positively correlated with the popularity
If you’re wondering what are the words most positively correlated to the popularity of a post, here is a shortlist of the top 10 words:
 "yolo" "today" "gamestop" "like" "market" "squeeze"
 "time" "price" "bought" "buying"
Now, just for fun, here is a title I would write which in theory would likely become one of the most popular posts on WSB!
“Yolo time- today I bought GameStop at market price. HOLD the squeeze”
A quick linear regression analysis shows that having a more positive sentiment is positively correlated to more comments on a post!
lm(formula = comms_num ~ emotion, data = reddit_wsb_sentiment_gme)
Min 1Q Median 3Q Max
-4190 -51 -44 -15 44299
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.256479 13.911169 3.756 0.000175 ***
emotion 0.056600 0.001598 35.430 < 2e-16 ***
Sentiment on GME changed over time
Beyond knowing the word clouds, I wanted to see how has sentiment on GME changed over time. I converted the timestamp variable to a weekly basis. Week 1 is marked as the first week of 2021. Similar to the number of posts and stock price of GME, the positive sentiment reached its peak in week 5 (first week of February.)
Article topics on GME
I scraped for the URL of 150 most articles on CNBC related to GameStop. Then, I scraped the content of each of the articles corresponding to the URLs using R. With the content of the articles, I ran an LDA to divide the words into five topics. My suggested names for each of the topics are Meme Stocks, GameStop Corporate Finances, WallStreetBets Effect, Historical Performance, and GameStop Corporate Strategy
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
[1,] "the" "the" "stock" "said" "said"
[2,] "share" "company" "shares" "the" "social"
[3,] "per" "shares" "million" "stock" "added"
[4,] "premarket" "buy" "said" "year" "clients"
[5,] "revenue" "premarket" "gamestop" "last" "media"
[6,] "bitcoin" "million" "trading" "trading" "new"
[7,] "dogecoin" "share" "retail" "gamestop" "first"
[8,] "shares" "stock" "model" "monday" "investing"
[9,] "also" "rose" "top" "also" "one"
[10,] "said" "price" "can" "around" "stock"
Through this analysis, we learned about what drives the popularity of posts on WallStreetBets from the time of the week to post, the common words to use, and the general sentiment. We visualized similar patterns in Gamestop stock price, number of posts on WSB, and the sentiment on GameStop on WSB. Through the sentiment analysis, we captured the words most used in positive and negative sentiment posts on Reddit. And the trend of decreasing sentiment score on GameStock. Lastly, analyzing the CNBC article contend also revealed the topics that GameStop is most affiliated with.
A couple of extensions beyond this analysis :
- Visualizing the “Elon Musk” effect and incorporating Twitter data
- Applying statistical models such as the NBD to predict the number of posts
- Examining changes in related meme stocks such as AMC and Nokia
- Analyzing the emojis on WallStreetBets
What will be the next GameStop? Keep an eye out for any overly confident posts with excessive mention of buy and hold.
Lin Jia Chen is a student in Oidd 245, Analytics and the Digital Economy, with Professor Tambe. She is trying to make sense of this world with her limited knowledge of R.