Published on

The Elder Scrolls V Skyrim Special Edition - Analysis of Dialogues

views
Authors

Background

After my last visualisation of Harry Potter data, I decided to use some other data to create word clouds. I am a huge Skyrim fan and always wanted to learn how to use xEdit Scripts. As a result, here I am with another Word Cloud.

Disclaimer: A lot of the code is shared between my previous project and this one

Visualisations

WordCloud of Most Occurring Words in Dialogues of Skyrim Special Edition
WordCloud of Most Occurring Words in Dialogues of Skyrim Special Edition
Graph of DLC contribution towards dialogues
Graph of DLC contribution towards dialogues

Getting the data

I don't like opening my Windows installation (I have a dual boot setup, and use Manjaro mainly), and looked around the internet for some sort of data dump of Skyrim Dialogues. Unfortunately, I couldn't find any and then decided to extract the data myself. I had recently formatted my Windows Partition so had to reinstall the game. It also provided the benefit that no mods would pollute the data. (I had over 150 mods before the format). I downloaded the latest xEdit and used the Export dialogues.pas script that comes with it to export all the dialogues. (It took me 22:05 minutes).

I am going to look into other data I can extract this way, and maybe make some other stuff.

Processing the data

In the CSV, there were two columns of data I was interested in RESPONSE TEXT & TOPIC TEXT. Response Text was the larger one, with over 40k unique dialogues. Topic Text had only around 5.5K unique dialogues and also needed some additional processing. Topic Text contained some game constants such as RoomCost , HorseCost, and other prices, which had to be filtered out. I did all that in csv_to_json.py. Here's the code for it:

# Python Script to extract RESPONSE TEXT & TOPIC TEXT from data.csv to output JSON files
import numpy as np
import csv
import pandas as pd
import json

import re
regex_string = "\<.+\>"

def regex_filter(s):
    return re.sub(regex_string,'',s)

data=list(csv.reader(open('raw_data/data.csv','r'),delimiter='\t'))

data[0] = data[0][0:19]
df = pd.DataFrame(data[1:], columns =data[0])  

data_rt = df.groupby(["RESPONSE TEXT"]).size().sort_values().to_dict()
data_tt = df.groupby(["TOPIC TEXT"]).size().sort_values().to_dict()

# To filter out some Constants or non dialogues from TOPIC TEXT
to_remove = []
data_tt = {regex_filter(k): v for k, v in data_tt.items()}

for key in data_tt.keys():
    if not key.__contains__(" ") and not key.__contains__(".") and not key.__contains__("?"):
        to_remove.append(key)
   
for key in to_remove:
    del data_tt[key]


json.dump(data_rt,open("output_data/out_rt.json","w"))
json.dump(data_tt,open("output_data/out_tt.json","w"))

Counting the Words

Like the previous visualisation, I used nltk's stopwords corpus, along with a modified version of 20k most common words by Google. Interestingly, the modifications I did for Harry Potter were valid for Skyrim as well, because there is no dialogue with names like Harry, Ron, Arthur, etc and they share words like vampires, magic, etc.

I counted both RESPONSE TEXT & TOPIC TEXT data separately and then merged them into a single file count.json

Additional Tip: progress is a great Python Package to show progress in your scripts.

Here's the code I used for counting:

# Python Script to count & filter the words in the JSON files output by csv_to_json.py
import json
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from progress.bar import Bar



gfile = open("raw_data/custom_stopwords.txt","r")
to_remove = [l.strip() for l in gfile.readlines()]
stopwords = stopwords.words('english')
stopwords.extend(to_remove)

data_rt = json.load(open("output_data/out_rt.json","r"))
data_tt = json.load(open("output_data/out_tt.json","r"))

data = {}

bar = Bar('Counting Response Text', max=len(list(data_rt.keys())))
for key in list(data_rt.keys()):
    tokens = word_tokenize(key)
    for word in tokens:   
        if word.lower() not in stopwords and word.isalpha():
            if word in data.keys():
                data[word] += data_rt[key]
            else:
                data[word] = data_rt[key]
    bar.next()
bar.finish()

bar = Bar('Counting Topic Text', max=len(list(data_tt.keys())))
for key in list(data_tt.keys()):
    tokens = word_tokenize(key)
    for word in tokens:   
        if word.lower() not in stopwords and word.isalpha():
            if word in data.keys():
                data[word] += data_tt[key]
            else:
                data[word] = data_tt[key]
    bar.next()
bar.finish()

data = {k: v for k, v in sorted(data.items(), key=lambda item: item[1],reverse=True)}
json.dump(data,open("output_data/count.json","w"))```

Making the WordCloud

I used pretty much the same process as the last visualisation. I changed the maximum font size to depict the variation properly and used a custom font this time.

To make the WordCloud, I used the wordcloud package. For the mask, I used Skyrim Logo Vector. For the font, I used Sovngarde font.

Here's the code for the wordcloud:

# Python Script to make WordClouds
import os
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import json

mask = np.array(Image.open("masks/Skyrim Logo.png"))
data = json.load(open("output_data/count.json","r"))

print(f"Diffrent Words: {len(data.keys())} | Total Words: {sum(data.values())}")
wc = WordCloud(width=512,height=512,background_color="white", max_words=6000,mask=mask,
               max_font_size=250, random_state=42,contour_width=1, font_path="fonts/SovngardeLight.ttf")

wc.generate_from_frequencies(data)
wc.to_file("out/cloud.png")
s = wc.to_svg()
print(s,file=open("out/cloud.svg","w"))


Making the Graph

I initially planned on making a set of graphs from the data, but wasn't able to due to two reasons:

  1. Some of the data was weird. Argneir had the highest dialogue count due to dialogues of many NPCs (including General Tullius, I think) is assigned to him.
  2. Some of the data doesn't; produce interesting visualisations. Nords have the highest dialogue count, and after the difference between the first few races and the remaining is so huge that a lot of the races aren't visible.

Since I had already made this, I thought of sharing it here, in case someone is interested in the image or its code.

# Python Script to make graphs
import numpy as np
import csv
import pandas as pd
import matplotlib.pyplot as plt

data=list(csv.reader(open('raw_data/data.csv','r'),delimiter='\t'))

data[0] = data[0][0:19]
df = pd.DataFrame(data[1:], columns =data[0])  

plugin_data = df.groupby(["PLUGIN"]).size().sort_values()
plugin_data = plugin_data.rename(lambda x: x.replace(".esm",""))
plt.figure(figsize=(14,10))
ax = plugin_data.plot.bar(title="No. of Dialogues per DLC", cmap="jet")
ax.set(xlabel="DLC", ylabel="Dialogues")
plt.savefig("out/graph.eps",format="eps")

Future Plans

I will look into making custom scripts (if someone already has them, do share it with me) to extract other interesting data from Skyrim and see what I can do with them.

References: