Recently, I’ve had a chance to play with word embedding models. Word embedding models involve taking a text corpus and generating vector representations for the words in said corpus. These types of models have many uses such as computing similarities between words (usually done via cosine similarity between the vector representations) and detecting analogies between words (king is to queen as man is to woman). In this post I will compare two methods of generating word embeddings on a well known data set – The Scandal in Bohemia story from the Sherlock Holmes novels.
The standard method for generating word embeddings is using a model called Word2Vec which is a shallow Neural Network that learns word embeddings from training data by incorporating surrounding words (called context words). I will leave technical details about these models for future posts. In most cases, conditional on the training data, Word2Vec works very well. My use case was related to generating synonyms for a given word and using the set of synonyms to search the entire corpus.
However, Word2Vec has one flaw. If you give it a word which is not in the vocabulary used to train the model, it cannot give you similar words. It’s a sensible limitation – how is it meant to tell you about a word which doesn’t exist as far as it’s concerned?
This is where Fasttext comes in. Fasttext is a word embedding model invented by Facebook research which is built on not just using the words in the vocabulary but also substrings of these words. As a result, if you feed Fasttext a word that it has not been trained on, it will look at substrings for that word and see if that appears in the corpus. With either method you’re going to suffer from poor performance if your training set is small or if you input words it doesn’t recognise but Fasttext has some contingency built in.
For today’s post I decided to take a text corpus I love – The Scandal in Bohemia story from the Sherlock Holmes novels and generate word embeddings for it using Word2Vec and Fasttext. Then we can look at the similar words generated by each method to see which performs ‘better’. Ultimately the objective of this post is to provide some quick start code in Python for anyone interested in building these models for the first time. I found the official Fasttext quick start tutorial (in references) a little difficult to follow, chiefly because I didn’t know what format the input data should be in and I had to use the command line to download the data (this was enough to deter me). So I will walkthrough my code in an effort to make setting up as easy as possible.
Moreover, this comparison is by no means extensive or reliable! I only compare one word to check similarities for, model tuning has been arbitrary and I used no statistical measures to legitimately check quality for different word embedding models – I don’t even know what they are yet. I will be looking at these subjects in more detail in the coming weeks so I can learn the proper statistical procedures for comparison, in future posts I may come back to this example and refine it. For now, it’s a simple, erroneous quick start to both models. If anyone has any reading materials or suggestions for how to make such a comparison half way decent – please drop it in the comments.
# Install packages !conda install gensim !conda install nltk !conda install spacy !git clone https://github.com/facebookresearch/fastText.git !pip install fasttext # Import packages import fasttext import string import re import nltk nltk.download('punkt') nltk.download('vader_lexicon') nltk.download('stopwords') nltk.download('wordnet') from nltk.tokenize import sent_tokenize, word_tokenize from nltk.sentiment.vader import SentimentIntensityAnalyzer from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) from nltk.stem import WordNetLemmatizer from nltk.stem.porter import PorterStemmer import spacy import gensim import gensim from gensim.utils import simple_preprocess from collections import Counter from datetime import datetime, timedelta from gensim.models.phrases import Phrases, Phraser from gensim.utils import simple_preprocess from gensim.corpora.dictionary import Dictionary from gensim import models from numpy import asarray import unicodedata # Load Files ## Scandal in Bohemia sentences scandal_in_bohemia_sentences = open("scandal_in_bohemia_sentences.txt", "r") scandal_in_bohemia_sentences = scandal_in_bohemia_sentences.readlines() scandal_in_bohemia_sentences_no_stopwords = open("scandal_in_bohemia_sentences_no_stopwords.txt", "r") scandal_in_bohemia_sentences_no_stopwords = scandal_in_bohemia_sentences_no_stopwords.readlines() # Helper functions ## NLP Functions def strip_accents(text): try: text = unicode(text, 'utf-8') except NameError: pass text = unicodedata.normalize('NFD', text)\ .encode('ascii', 'ignore')\ .decode("utf-8") return str(text) def preprocess(text, remove_accents=False, lower = True, remove_less_than=0, remove_more_than=100, remove_punct=True, remove_alpha=False, remove_stopwords=True, add_custom_stopwords = , lemma=False, stem=False, remove_url=True): '''Tokenises and preprocesses text. Parameters: text (string): a string of text remove_accents (boolean): removes accents lower (boolean): lowercases text remove_less_than (int): removes words less than X letters remove_more_than (int): removes words more than X letters remove_punct (boolean): removes punctuation remove_alpha (boolean): removes non-alphabetic tokens remove_stopwords (boolean): removes stopwords add_custom_stopwords (list): adds custom stopwords lemma (boolean): lemmantises tokens stem (boolean): stems tokes using the Porter Stemmer Output: tokens (list): a list of cleaning tokens ''' if remove_accents == True: text = strip_accents(text) if lower == True: text = text.lower() if remove_url == True: text = re.sub(r'http\S+', '', text) #tokens = simple_preprocess(text, deacc=remove_accents, min_len=remove_less_than, max_len=remove_more_than) tokens = text.split() if remove_punct == True: tokens = [ch.translate(str.maketrans('', '', string.punctuation)) for ch in tokens] if remove_alpha == True: tokens = [token for token in tokens if token.isalpha()] if remove_stopwords == True: for i in add_custom_stopwords: stop_words.add(i) tokens = [token for token in tokens if not token in stop_words] tokens = [i for i in tokens if remove_less_than <= len(i) <= remove_more_than] if lemma == True: tokens = [WordNetLemmatizer().lemmatize(token) for token in tokens] if stem == True: tokens = [PorterStemmer().stem(token) for token in tokens] return tokens # Build a word2vec model from gensim.test.utils import common_texts, get_tmpfile from gensim.models import Word2Vec # For word2vec, we need a list tokens scandal_in_bohemia_tokens = [preprocess(i) for i in scandal_in_bohemia_sentences] w2v_model = gensim.models.Word2Vec(scandal_in_bohemia_tokens, size = 500, window = 10, min_count=1, workers = 4) # Build a fasttext model - one with and without subwords # Fasttext takes as an input, a txt file in the environment - so we refer to the raw file and not the one in our Python environment # The input file here is the same as the list of sentences we have already seen but with stop words removed ft_model = fasttext.train_unsupervised('scandal_in_bohemia_sentences_no_stopwords.txt', minn = 2, maxn = 5, dim = 500) ft_model_wo_subwords = fasttext.train_unsupervised('scandal_in_bohemia_sentences_no_stopwords.txt', maxn = 0, dim = 500) # Comparing the outputs from each model w2v_model.wv.most_similar('woman', topn = 20) ft_model.get_nearest_neighbors('woman', k = 20) ft_model_wo_subwords.get_nearest_neighbors('woman', k = 20)
From the references you can clone my Github repo to get the data and the Python file above. The input file ‘scandal_in_bohemia_sentences.txt’ is a txt file where the story has been separated into sentences. Every new line in the file is a sentence – I’ve removed punctuation and converted it all to lower case. Word2Vec and Fasttext take the input data in different formats which you should be able to see if you follow along with the Python in your own notebook/ IDE. Word2Vec takes a nested list of tokens and Fasttext takes a single list of sentences.
Suppose we had the following text corpus ‘The apple is red. The lemon is yellow. The lime is green.’ . The format Word2Vec would like is [[‘the’, ‘apple’, ‘is’, ‘red’], [‘the’, ‘lemon’, ‘is’, ‘yellow’], [‘the’, ‘lime’, ‘is’, ‘green’]] and the format Fasttext would like is [‘the apple is red’, ‘the lemon is yellow’, ‘the lime is green’]. These are not strict rules but the conventions used in the docs for each model.
Let’s briefly comment on the main model parameters. For Word2Vec, we have size and window; size represents the dimension of the word embedding vector and window represents how many context words we will take in the model. For Fasttext we have minn, maxn and dim; minn and maxn relate to the subwords – the subwords are all the substrings contained in a word between the minimum size (minn) and the maximal size (maxn). Dim represents the dimension of the word embedding vector and is equivalent to the size parameter in Word2Vec. I have built two Fasttext models, one using subwords and one without to compare the differences in results.
Looking at the results when we look for similar words to the word ‘woman’ we find:
[('hand', 0.14927232265472412), ('remove', 0.1452692747116089), ('peculiar', 0.1419636309146881), ('intention', 0.14183956384658813), ('godfrey', 0.13988855481147766), ('fro', 0.1376987099647522), ('driver', 0.13480930030345917), ('came', 0.13391801714897156), ('strange', 0.12853017449378967), ('coming', 0.1273207664489746), ('anything', 0.1268802285194397), ('read', 0.1251574605703354), ('black', 0.12454517185688019), ('numerous', 0.1207168847322464), ('annoyance', 0.11609496176242828), ('hot', 0.11296975612640381), ('fortunate', 0.11278606951236725), ('half', 0.11249668151140213), ('waving', 0.10978461802005768), ('country', 0.10923135280609131)]
Fasttext (with subwords):
[(0.3996190130710602, 'german'), (0.37522637844085693, 'gentleman'), (0.36704689264297485, 'clergyman'), (0.3200114369392395, '</s>'), (0.2853983938694, 'hand'), (0.272637277841568, 'lodge'), (0.26899453997612, 'window'), (0.2592756748199463, 'norton'), (0.2512776553630829, 'note'), (0.239231139421463, 'fire'), (0.2387862354516983, 'understand'), (0.2316322773694992, 'corner'), (0.22655518352985382, 'irene'), (0.22640398144721985, 'matter'), (0.2216232419013977, 'found'), (0.22161732614040375, 'street'), (0.21747052669525146, 'moment'), (0.20840883255004883, 'love'), (0.20640386641025543, 'serpentine'), (0.20150649547576904, 'adler')]
Fasttext (without subwords):
[(0.15502291917800903, 'cried'), (0.08252811431884766, 'baker'), (0.06099469214677811, 'address'), (0.05087444186210632, 'soul'), (0.05002214387059212, 'sitting'), (0.04498187080025673, 'watson'), (0.04359939694404602, 'hand'), (0.03805050253868103, 'dear'), (0.03730574622750282, 'bohemia'), (0.036167293787002563, 'heard'), (0.036113884299993515, 'understand'), (0.034881722182035446, 'client'), (0.034294936805963516, 'hands'), (0.02885563112795353, 'lady'), (0.026660935953259468, 'mask'), (0.026312250643968582, 'visitor'), (0.0246898103505373, 'clock'), (0.024095812812447548, 'clergyman'), (0.022810054942965508, 'german'), (0.01706317439675331, 'matter')]
At a first glance, we see that the words similar to ‘woman’ don’t make sense in the general use of the term. Within the context of the story, the words make more sense. The Fasttext model using subwords gives the ‘best’ subjective results because it returns terms like ‘Irene’, ‘Adler’, ‘clergyman’, ‘fire’ and ‘window’. These terms make sense if you recall that in the story, the term ‘that woman’ was used by Holmes to refer exclusively to Irene Adler, that Holmes dressed as a clergyman to infiltrate her home, that a firecracker was thrown through the window in Adler’s home which enabled Holmes to see where she hides her letters. Given the context, it’s quite impressive.
The model performance could be improved via better preprocessing, a larger text corpus (the story is quite short) and a more optimal tuning of model parameters. However I hope this post provides an easy quick start for those interested in these models.
- Fasttext docs: https://fasttext.cc/docs/en/unsupervised-tutorial.html
- Gensim docs: https://radimrehurek.com/gensim/models/word2vec.html
- Github page: https://github.com/JunaidMB/playing_with_fasttext
- A Scandal in Bohemia dramatisation: https://www.youtube.com/watch?v=ZaDfTP7zohQ