THE BLOG: From waves to words
“OK Google”, “Hey Siri”, “Hey Cortana”. Those are all command terms we tell to “wake up” our personal assistants. But how do these assistants know the words we are saying. How do they accommodate different accents? How is our speech transcribed into text? Everything starts with sound waves.
Sound waves
Every sound that is emitted can be translated into a sound wave. A sound wave is the pattern of disturbance caused by the movement of energy traveling through a medium (such as air, water, or any other liquid or solid matter) as it propagates away from the source of the sound. When we speak, we generate multiple sound waves. Waves are an ideal way to translate speech because it translates our speech into numerical values that are easier to use for a computer.
Knowing that waves cause a change in the pressure of the air, recording speech will just require to record the change in air pressure as a function of time. This is exactly what microphones do.
Microphones are devices capable of converting the energy of sound waves into electrical energy. When a sound wave hits the diaphragm, the diaphragm moves and reproduces the frequency of the wave. The metal coil moves along with the diaphragm. The movement of the coil creates an electric current through it (because it is in contact with the magnet.). This electric current is recorded. To save information on this electric current, it needs to be converted into numerical values. Pulse Code Modulation Encoder (PCM encoder). The encoder identifies the voltage of the electrical current at different times and makes a graph of it. The graph is written and saved on a local file.
From graph to word recognition
The graphs that we obtain are then fed into a neural network (if you are using OK Google or Amazon Alexa, this neural network is based on the company’s specific server. This is why you need an internet connection to reach the virtual assistant: your file needs to be sent to the remote neural network). This neural network is trained with thousands of hours of speech from different languages. Depending on the language settings on your device, the neural network would make a prediction of what you are saying, and transcribe your speech into voice.
It is a thing to have words but it is another one to be able to understand them. The next blog post will dive into the techniques used in Natural Language Processing to transform the words into numerical values that a machine would understand.
Jean Ghislain BILLA is a Student Ambassador in the Inspirit AI Student Ambassadors Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/.