Published on

Analysing YouTube Comments — Stuff Made Here

views
Authors

Introduction

Shane Wighton’s Youtube channel Stuff Made Here is one of my favourite Youtube channels. His is an engineering-focused channel, where he makes videos on various innovative inventions. I have been watching his videos since he started back in March 2020. If you haven’t yet, I will definitely recommend you to check his content.

On November 26, 2020, Stuff Made Here uploaded a video Making an unpickable lock. Calling locksmiths. In that video, Shane made a lock using interesting techniques, which proved to be unbreakable by a local locksmith. His wife suggested that he should send it to LockPickingLawyer (a Youtuber popular for picking numerous types of locks). LockPickingLawyer and Shane talked to each other and Shane decided to send an “improved” version to the LockPickingLawyer. It took him around 6 months to improve and send it. As it was one of the most anticipated Youtube crossovers, everyone was asking Shane about it.

When he finally uploaded the video about it titled TWO Unpickable (?) Locks for Lock Picking Lawyer!, he talked about the attention the idea received. He mentioned that it was difficult to count the number of times LockPickingLawyer was mentioned. That gave me the idea to count the number of times LockPickingLawyer was mentioned.


Visualizations

Word - Count table Stuff Made Here comments
Word - Count table Stuff Made Here comments
Wordcloud from Stuff Made Here comments
Wordcloud from Stuff Made Here comments

Interesting Words

One of the things I noticed during the project is the number of typographical errors people make with ‘lockpickinglawyer’. Other than that, some interesting words were releaselplcut, teamlockpickinglawyer, unpicklockeble.


Making the vizualizations

I made all the visualizations using Python and various libraries. I used youtube-comment-downloader to fetch all the comments into JSON files. I used Natural Language Toolkit to tokenize, count and filter the words.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import json_lines
import os
from progress.bar import Bar
import json

files  = os.listdir("rawdata")
stopwords = stopwords.words('english')

bar = Bar('Progress: ', max=len(files))
data = {}
for file in files:
    file_data = json_lines.reader(open('rawdata/'+file,'r'))
    for comment in file_data:
        tokens = word_tokenize(comment['text'])
        for word in tokens:
            word = word.lower()
            if word not in stopwords and word.isalpha():
                if word in data.keys():
                    data[word] += 1
                else:
                    data[word] = 1
    bar.next()
bar.finish()

data = {k: v for k, v in sorted(data.items(), key=lambda item: item[1],reverse=True)}
json.dump(data,fp=open("wordcount.json","w"))
                

The Wordcloud was also made in Python using the Wordcloud library.

import os
from wordcloud import WordCloud
import numpy as np
from PIL import Image
import json

mask = np.array(Image.open("mask.png"))
data = json.load(open("wordcount.json","r"))


wc = WordCloud(width=3888,height=5180, background_color="white", max_words=6000,mask=mask,max_font_size=1000, random_state=32)

wc.generate_from_frequencies(data)
wc.to_file("cloud.png")```

The other image (the Word-Count table, I manually typed in Google Docs).

The source code for the complete project is available on Github.
The data was fetched on June 4, 2021.