“Hang on to your dreams, Chip. The future is built on dreams. Hang on.”
Optimus Prime of The Transformers
Transformers have been an exciting part of the artificial intelligence progress in the past few years, reminiscent of deep learning and its emergence a decade ago. Transformers are now being used in natural language processing as well as computer vision, and these are architectures that use attention to improve performance. Transformers are used for NLP tasks such as text classification and language models for machine translation, question-and-answer, speech recognition, and text summarization. Examples of these transformers include Google’s BERT (Bidirectional Encoder Representation from Transformers) and OpenAI’s GPT-3 (Generative Pre-Trained Transformer 3), and both have demonstrated relatively high levels of performance.
Attention is the element that has enabled transformers to achieve the high levels of performance, and this element allows the models to pay “attention” to other words in the input words that these words are closely related to. For example, in the phrase “the doctor is performing a cardiac catheterization”, the word “catheterization” is closely related to the word “cardiac” and also “performing”, but perhaps not as much the word “doctor”. This attention mechanism helps to define the relationship of every word to every other word in the input sequence.
These transformers have proven to be superior to recurrent neural networks (RNNs) as well as their memory “relatives” LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units). RNNs, LSTMs, and GRUs had limitations with words that were relatively far apart in longer sentences, thus cannot accommodate long-range dependencies of related words. In addition, RNNs had to process the input sequence a single word at a time, thus limiting expediency in training. The transformers can process words in parallel (vs in sequence as in RNNs, LSTMs, or GRUs) so that these transformers render the distance between words much less relevant.