## Introduction

Recently, I’ve had a chance to play with word embedding models. Word embedding models involve taking a text corpus and generating vector representations for the words in said corpus. These types of models have many uses such as computing similarities between words (usually done via cosine similarity between the vector representations) and detecting analogies between words (king is to queen as man is to woman). In this post I will compare two methods of generating word embeddings on a well known data set – The Scandal in Bohemia story from the Sherlock Holmes novels.

The standard method for generating word embeddings is using a model called Word2Vec which is a shallow Neural Network that learns word embeddings from training data by incorporating surrounding words (called context words). I will leave technical details about these models for future posts. In most cases, conditional on the training data, Word2Vec works very well. My use case was related to generating synonyms for a given word and using the set of synonyms to search the entire corpus.

However, Word2Vec has one flaw. If you give it a word which is not in the vocabulary used to train the model, it cannot give you similar words. It’s a sensible limitation – how is it meant to tell you about a word which doesn’t exist as far as it’s concerned?

This is where Fasttext comes in. Fasttext is a word embedding model invented by Facebook research which is built on not just using the words in the vocabulary but also substrings of these words. As a result, if you feed Fasttext a word that it has not been trained on, it will look at substrings for that word and see if that appears in the corpus. With either method you’re going to suffer from poor performance if your training set is small or if you input words it doesn’t recognise but Fasttext has some contingency built in.

For today’s post I decided to take a text corpus I love – The Scandal in Bohemia story from the Sherlock Holmes novels and generate word embeddings for it using Word2Vec and Fasttext. Then we can look at the similar words generated by each method to see which performs ‘better’. Ultimately the objective of this post is to provide some quick start code in Python for anyone interested in building these models for the first time. I found the official Fasttext quick start tutorial (in references) a little difficult to follow, chiefly because I didn’t know what format the input data should be in and I had to use the command line to download the data (this was enough to deter me). So I will walkthrough my code in an effort to make setting up as easy as possible.

Moreover, this comparison is by no means extensive or reliable! I only compare one word to check similarities for, model tuning has been arbitrary and I used no statistical measures to legitimately check quality for different word embedding models – I don’t even know what they are yet. I will be looking at these subjects in more detail in the coming weeks so I can learn the proper statistical procedures for comparison, in future posts I may come back to this example and refine it. For now, it’s a simple, erroneous quick start to both models. If anyone has any reading materials or suggestions for how to make such a comparison half way decent – please drop it in the comments.

## Code

# Install packages
!conda install gensim
!conda install nltk
!conda install spacy
!pip install fasttext

# Import packages
import fasttext
import string
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
import spacy
import gensim
import gensim
from gensim.utils import simple_preprocess
from collections import Counter
from datetime import datetime, timedelta
from gensim.models.phrases import Phrases, Phraser
from gensim.utils import simple_preprocess
from gensim.corpora.dictionary import Dictionary
from gensim import models
from numpy import asarray
import unicodedata

## Scandal in Bohemia sentences
scandal_in_bohemia_sentences = open("scandal_in_bohemia_sentences.txt", "r")

scandal_in_bohemia_sentences_no_stopwords = open("scandal_in_bohemia_sentences_no_stopwords.txt", "r")

# Helper functions
## NLP Functions
def strip_accents(text):
try:
text = unicode(text, 'utf-8')
except NameError:
pass
text = unicodedata.normalize('NFD', text)\
.encode('ascii', 'ignore')\
.decode("utf-8")
return str(text)

def preprocess(text, remove_accents=False, lower = True, remove_less_than=0, remove_more_than=100, remove_punct=True,
remove_alpha=False, remove_stopwords=True, add_custom_stopwords = [], lemma=False, stem=False, remove_url=True):
'''Tokenises and preprocesses text.
Parameters:
text (string): a string of text
remove_accents (boolean): removes accents
lower (boolean): lowercases text
remove_less_than (int): removes words less than X letters
remove_more_than (int): removes words more than X letters
remove_punct (boolean): removes punctuation
remove_alpha (boolean): removes non-alphabetic tokens
remove_stopwords (boolean): removes stopwords
lemma (boolean): lemmantises tokens
stem (boolean): stems tokes using the Porter Stemmer
Output:
tokens (list): a list of cleaning tokens
'''
if remove_accents == True:
text = strip_accents(text)
if lower == True:
text = text.lower()
if remove_url == True:
text = re.sub(r'http\S+', '', text)
#tokens = simple_preprocess(text, deacc=remove_accents, min_len=remove_less_than, max_len=remove_more_than)
tokens = text.split()
if remove_punct == True:
tokens = [ch.translate(str.maketrans('', '', string.punctuation)) for ch in tokens]
if remove_alpha == True:
tokens = [token for token in tokens if token.isalpha()]
if remove_stopwords == True:
tokens = [token for token in tokens if not token in stop_words]
tokens = [i for i in tokens if remove_less_than <=  len(i) <= remove_more_than]
if lemma == True:
tokens = [WordNetLemmatizer().lemmatize(token) for token in tokens]
if stem == True:
tokens = [PorterStemmer().stem(token) for token in tokens]

# Build a word2vec model
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

# For word2vec, we need a list tokens
scandal_in_bohemia_tokens = [preprocess(i) for i in scandal_in_bohemia_sentences]
w2v_model = gensim.models.Word2Vec(scandal_in_bohemia_tokens, size = 500, window = 10, min_count=1, workers = 4)

# Build a fasttext model - one with and without subwords
# Fasttext takes as an input, a txt file in the environment - so we refer to the raw file and not the one in our Python environment
# The input file here is the same as the list of sentences we have already seen but with stop words removed

ft_model = fasttext.train_unsupervised('scandal_in_bohemia_sentences_no_stopwords.txt', minn = 2, maxn = 5, dim = 500)
ft_model_wo_subwords = fasttext.train_unsupervised('scandal_in_bohemia_sentences_no_stopwords.txt', maxn = 0, dim = 500)

# Comparing the outputs from each model
w2v_model.wv.most_similar('woman', topn = 20)
ft_model.get_nearest_neighbors('woman', k = 20)
ft_model_wo_subwords.get_nearest_neighbors('woman', k = 20)


From the references you can clone my Github repo to get the data and the Python file above. The input file ‘scandal_in_bohemia_sentences.txt’ is a txt file where the story has been separated into sentences. Every new line in the file is a sentence – I’ve removed punctuation and converted it all to lower case. Word2Vec and Fasttext take the input data in different formats which you should be able to see if you follow along with the Python in your own notebook/ IDE. Word2Vec takes a nested list of tokens and Fasttext takes a single list of sentences.

Suppose we had the following text corpus ‘The apple is red. The lemon is yellow. The lime is green.’ . The format Word2Vec would like is [[‘the’, ‘apple’, ‘is’, ‘red’], [‘the’, ‘lemon’, ‘is’, ‘yellow’], [‘the’, ‘lime’, ‘is’, ‘green’]] and the format Fasttext would like is [‘the apple is red’, ‘the lemon is yellow’, ‘the lime is green’]. These are not strict rules but the conventions used in the docs for each model.

Let’s briefly comment on the main model parameters. For Word2Vec, we have size and window; size represents the dimension of the word embedding vector and window represents how many context words we will take in the model. For Fasttext we have minn, maxn and dim; minn and maxn relate to the subwords – the subwords are all the substrings contained in a word between the minimum size (minn) and the maximal size (maxn). Dim represents the dimension of the word embedding vector and is equivalent to the size parameter in Word2Vec. I have built two Fasttext models, one using subwords and one without to compare the differences in results.

## Results

Looking at the results when we look for similar words to the word ‘woman’ we find:

Word2Vec:

[('hand', 0.14927232265472412),
('remove', 0.1452692747116089),
('peculiar', 0.1419636309146881),
('intention', 0.14183956384658813),
('godfrey', 0.13988855481147766),
('fro', 0.1376987099647522),
('driver', 0.13480930030345917),
('came', 0.13391801714897156),
('strange', 0.12853017449378967),
('coming', 0.1273207664489746),
('anything', 0.1268802285194397),
('black', 0.12454517185688019),
('numerous', 0.1207168847322464),
('annoyance', 0.11609496176242828),
('hot', 0.11296975612640381),
('fortunate', 0.11278606951236725),
('half', 0.11249668151140213),
('waving', 0.10978461802005768),
('country', 0.10923135280609131)]

Fasttext (with subwords):

[(0.3996190130710602, 'german'),
(0.37522637844085693, 'gentleman'),
(0.36704689264297485, 'clergyman'),
(0.3200114369392395, '</s>'),
(0.2853983938694, 'hand'),
(0.272637277841568, 'lodge'),
(0.26899453997612, 'window'),
(0.2592756748199463, 'norton'),
(0.2512776553630829, 'note'),
(0.239231139421463, 'fire'),
(0.2387862354516983, 'understand'),
(0.2316322773694992, 'corner'),
(0.22655518352985382, 'irene'),
(0.22640398144721985, 'matter'),
(0.2216232419013977, 'found'),
(0.22161732614040375, 'street'),
(0.21747052669525146, 'moment'),
(0.20840883255004883, 'love'),
(0.20640386641025543, 'serpentine'),
(0.20150649547576904, 'adler')]

Fasttext (without subwords):

[(0.15502291917800903, 'cried'),
(0.08252811431884766, 'baker'),
(0.05087444186210632, 'soul'),
(0.05002214387059212, 'sitting'),
(0.04498187080025673, 'watson'),
(0.04359939694404602, 'hand'),
(0.03805050253868103, 'dear'),
(0.03730574622750282, 'bohemia'),
(0.036167293787002563, 'heard'),
(0.036113884299993515, 'understand'),
(0.034881722182035446, 'client'),
(0.034294936805963516, 'hands'),
(0.026312250643968582, 'visitor'),
(0.0246898103505373, 'clock'),
(0.024095812812447548, 'clergyman'),
(0.022810054942965508, 'german'),
(0.01706317439675331, 'matter')]

At a first glance, we see that the words similar to ‘woman’ don’t make sense in the general use of the term. Within the context of the story, the words make more sense. The Fasttext model using subwords gives the ‘best’ subjective results because it returns terms like ‘Irene’, ‘Adler’, ‘clergyman’, ‘fire’ and ‘window’. These terms make sense if you recall that in the story, the term ‘that woman’ was used by Holmes to refer exclusively to Irene Adler, that Holmes dressed as a clergyman to infiltrate her home, that a firecracker was thrown through the window in Adler’s home which enabled Holmes to see where she hides her letters. Given the context, it’s quite impressive.

The model performance could be improved via better preprocessing, a larger text corpus (the story is quite short) and a more optimal tuning of model parameters. However I hope this post provides an easy quick start for those interested in these models.

## References

1. Fasttext docs: https://fasttext.cc/docs/en/unsupervised-tutorial.html