THE BLOG Part 2: Understanding words

3 min readDec 5, 2020

Natural Language Processing uses different techniques to transcribe the words into numerical inputs that machines can understand. Throughout the years, multiple models that accomplish this task have been created.

Preprocessing

When we want to feed a natural language model with text, we first need to “clean” the text that has been received. This is done through a succession of different processes.

1-Tokenization

Tokenization is splitting a text into individual words.

Ex: “I will go to the beach today” will be tokenized into {‘I’, ‘will’, ‘go’, ‘to’, ‘the’, ‘beach’, ‘today’}.

2- Stop-word removal

Stop word removal is removing words that do not really contribute to the meaning of the sentence as a whole. Those words are common words such as ‘the’ or ‘a’

Ex: Our previous sentence will look like this after stop-word removal: {‘I’, ‘will’, ‘go’, ‘beach’, ‘today’}. We can see that after removing the stop-words, we can have a reasonable understanding of the sentence.

3-Stemming/Lemmatization

Stemming or Lemmatization is reducing words to their roots.

Ex: Words like ‘building', 'fishing’ will be stemmed into ‘build’ and ‘fish’

This process can be problematic because words such as building would lose their nature: they would become verbs. This can make it harder for the machine to understand the sentence.

From words to numbers

Different models are used to transform text input into numerical values.

1-One hot word encoding

This requires to have a 1 in the column that matches the words and 0s everywhere else.

This suite of one and zeros will be a vector that will be representing the word. This technique is not perfect though, as it would require to have a column for every word in the dictionary. That will mean that for every English word we will have a vector of 171,145 zeros and one 1. This is not efficient.

2- Recurrent networks

Recurrent networks are neural networks that will be taking as an input every word in the sentence. So, the recurrent network will take the first word as an input, and will save this word. It will then take the second word and the saved first word as an input, and save them. The network will do that for every word in the sentence. This technique has the advantage of recognizing the context in a sentence. A word meaning is influenced by the words around it. So knowing what word comes before another in a sentence gives a better idea of the context, and makes the understanding better.

3- Word2Vec

Word2Vec is a model that bases itself on the similarity between different words. All the words will be converted into vectors, but the words that have a closer meaning such as king and queen will have similar vectors. You can see a graphical representation of some words’ vectors here.

4- BERT

BERT (Bidirectional Encoder Representations from Transformers) is a model that was developed by Google. The characteristic of BERT is that it takes in account the words surrounding a specific word to understand the context better.

All these models can be used to transform our text into numbers that machines can understand. After this transformation, the machine can make more sense of the words, and we can use them to train our models for specific tasks.

Jean Ghislain BILLA is a Student Ambassador in the Inspirit AI Student Ambassadors Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/.