What’s That YouTube Channel About? Find Out Using NLP

Miloni Mittal
5 min readJul 23, 2020

I love watching YouTube. If I like a particular YouTuber I binge-watch all their videos, no matter what. Everyone encounters that one period of time when they want to break from the monotony and watch something new. The same happened with me and that’s when I realised- there are so many YouTubers out there. How is it possible that I haven’t found a new one? I then remembered that so many of my friends had recommended channels. But I am a bit lazy, I didn’t want to Google each of these YouTubers and then check if people liked them or not or what their channels are about. Moreover, blogs about them would be biased. What’s the one place where you can find opinions of many viewers? The comments section!

Photo by NordWood Themes on Unsplash

The YouTube comments section is one place where people write their unfiltered opinion about the channel. There is a lot of discussion and fan-girling going on there. What if I wrote a program to extract these comments from videos and try to understand what the channel is about without actually visiting the channel?

The first step was to extract the comments. I wanted to extract the first 100 comments from the latest 10 videos from a YouTubers channel. For this, I used the YouTube API. I also extracted the number of subscribers as I feel it is a good measure of whether people like the channel or not.

After this we need to process all the comments so as to understand what people generally talk about in the comments section. I divided this task into three parts:
1. Listing out what positive or negative things people have to say
2. Listing out other things that people talk about. This would help in understanding what the most recent videos are about.
3. Building a word cloud for better interpretation of the data

Step 1: Pre-processing

The first step was to bring all the text to common ground. I performed the following steps on all the comments:

  • Converted to lower case
  • Removed all punctuations
  • Removed all emojis
  • Since we are not analyzing each comment separately, there is no harm in combining all the comments into one piece of text (let’s call it joined_comments)
  • Removed all stop words such as- the, an, a etc. since they play no role in determining sentiments and topics of discussion
  • Retained only those words which are in the dictionary. I did this because there are a lot of slangs (such as osum, ur, gud etc.) used in the comments section
  • Tokenized the words
  • One thing that I haven’t done here (but is usually done in natural language processing) is lemmatization. This is because in a further section, I would be using the same data to obtain bigram collocations. If I lemmatize the words, there would occur some cases where the bigram does not make as much sense as the original phrase. For example, “time management” versus “time manage”, “machine learning” versus “machine learn” and so on.

Step 2: Understanding the emotion behind the comments

For this, the crux of the problem is to understand whether the emotion behind the message is positive or negative. I first created a list of the most common words in the joined_comments. I decided to take the upper quartile of words for this list (for no particular reason though).

I then used the Vader polarity scores developed by MIT to understand whether the word is positive or negative. The “positive” index determines how positive the word is. The higher the value the more positive it is. Similarly, for “negative” index. By trial and error, I then decided a threshold value of 0.3 (for the compound value) for a word to be on either extremes and not a neutral word. This way I created two lists of words- positive words and negative words.

If I want to show 10 words to a user depicting the sentiments of the viewer, it will be wrong if I decide to show 5 positive and 5 negative words. Why so? Because any comment section will obviously have both kinds of comments. Higher number of negative comments in the comments section would imply a general negative opinion and same for positive comments. It is important to maintain the ratio of positive and negative opinions. Hence, it wouldn’t be fair to include both types with 50% weightage each. To solve this, I selected random words from the two lists. The number of words taken from each list was in the ratio of the total number of words in each list. For example, if there were 80 words in the positive words list and 20 in the negative words list, I chose 8 random words from the positive list and 2 from the negative list.

Step 3: Understanding what the users are talking about

The approach I followed here was to extract the top bigram collocations. Bigrams are two words that appear side by side in a corpus and bigram collocations are bigrams that occur more often and make sense as a phrase. Basically, our aim is to find out whether two words occur together more often than just by chance. For example, “red wine”, “Andhra Pradesh”, “fast computers” and so on. For this particular aspect, I used hypothesis testing.

A COLLOCATION is an expression consisting of two or more words that correspond to some conventional way of saying things.

I have used Chi-square testing here. I will not be delving too deep into the working of the test in natural language processing as the details are explained beautifully here. But I will point out one example that the book mentions: We see that “new companies” occurs together 8 times. Whereas, “new” and “companies” occur independently for 15820 and 4667 times respectively which is very large in comparison to the number of times they occur together. Intuitively, we can conclude that “new companies” just happens to occur multiple times together and is not a collocation. The chi-square test provides a mathematical aspect to this intuition, providing a threshold value to compare with.

Source: Introduction to Information Retrieval by C. D. Manning, P. Raghavan and H. Schutze

Step 4: Providing a visualization

Data visualization helps in gaining insights into the data and communicating it more efficiently to the user. I built a word cloud using the unigrams of joined_comments so that it is easier for the user to understand the comments section easily. The word cloud is essentially a cluster of words whose font size and style are based on how frequently the word appears. Higher the frequency of the word, the bigger and bolder it appears in the wordcloud.

Results

I ran the program for a few YouTubers and here are the results:

Results obtained

The Github link is here. And this concludes it! :)

Hi there! Thank you so much for giving this blog a read. You can reach out to me via mail (miloni.mittal@gmail.com) for any queries or for a small chit-chat. :)

--

--

Miloni Mittal

Expressing my thoughts (mostly the weird 2am ones) and experiences