What is a Transformer Neural Network? TNNs are Impacting Computer Vision in 2021
In artificial intelligence, 2020 was the year of transformers hype. A transformer is a new type of neural network architecture that has become a ubiquitous method in modern deep learning models. Why is that? In “Attention Is All You Need”, Google (Brain) introduced the Transformer as a novel neural network architecture based on a self-attention mechanism that was believed to be particularly well suited for language understanding and NLP.
Attention is a concept that helped improve the performance of neural machine translation applications. The attention model in the Transformer Neural Network captures the relationships between each word in a sequence with every other word. That is transforming NLP but also spreading to other domains in 2021 and beyond.
With a Transformer Neural Network, and with enough data, matrix multiplications, linear layers, and layer normalization we can perform state-of-the-art-machine-translation. Back in 2017 Google found that the Transformer on top of higher translation quality, the Transformer requires less computation to train and is a much better fit for modern machine learning hardware, speeding up training by up to an order of magnitude.
Transformers are one of the types of neural network architecture and known for their staunch behavior for the assigned tasks to produce best results thus gaining popularity rapidly. Transformers are used by big names like OpenAI and DeepMind for AlphaStar. Transformer model uses attention to boost the speed of training and accuracy is maintained. In the evolution of deep learning, transformers outperform the Google Neural Machine Translation model in specific tasks.
The paper ‘Attention Is All You Need’ describes transformers and what is called a sequence-to-sequence architecture. That paper is now four years old already. Transformer only performs a small, constant number of steps (chosen empirically).
In each step, it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position. The attention-mechanism looks at an input sequence and decides at each step which other parts of the sequence are important.
Transformers could also impact the future of computer vision as well. The success of the transformers is related to their extreme effectiveness and ability to solve non-trivial problems in a superior way compared to previous architectures such as RNNs in natural language processing.
Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it differs from the previously described/existing sequence-to-sequence models because it does not imply any Recurrent Networks (GRU, LSTM, etc.).
In a Transformer Neural Network in other words, the order is irrelevant. It’s like using tokenization and sets. Words are not discrete symbols. They are strongly correlated with each other. That’s why when we project them in a continuous euclidean space we can find associations between them.
So in a Transformer Neural Network how we see the parts in the whole are a bit more quantum computing instead of a binary sequence. Processing becomes a bit more holistic, a bit more distributed and simultaneous.
BERT (Bidirectional Encoder Representations from Transformers) is a 2018 paper published by researchers at Google AI Language. BERT’s key technical innovation was applying the bidirectional training of Transformer, a popular attention model, to language modelling.
The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. So Transformers help NLP understand language context better which should gradually improve in the 2020s and could scale into computer vision problems.
Researchers apply this new architecture to several Natural Language Processing problems, and immediately it’s evident how much this may be able to overcome some of the limitations that plague RNNs, traditionally used for tasks such as translating from one language to another.
Google Brain brought us one step forward with Transformers. In 2020, Transformers went from natural language and now they are into computer vision tasks.
The Vision Transformers are born and, with some preliminary modifications to the images, they manage to exploit the classic architecture of the Transformers and soon reach the state of the art in many problems also in this field.
Meanwhile, Google Brain researchers have scaled up their newly proposed Switch Transformer language model to a whopping 1.6 trillion parameters while keeping computational costs under control. The team simplified the Mixture of Experts (MoE) routing algorithm to efficiently combine data, model and expert-parallelism and enable this “outrageous number of parameters”.
In NLP, the goal of neural language models is to create embeddings that encode as much information as possible of the semantics of a word in a text. I would recommend to take a look at the great post by Jay Alamar to those readers who want a more in-depth understanding of self-attention and of the Transformer model. The beauty of TNNs is in how they contribute and add value to neural networks with the staunch and use of parallelization.
In parallel to how Transformers leveraged self-attention to modelize long-range dependencies in a text, novel works have presented techniques that use self-attention to overcome the limitations presented by inductive convolutional biases in an efficient way. Self-attention layers in Computer Vision take a feature map as input.
Advances in the TNN of computer vision may help autonomous vehicles (among other things) make the breakthroughs they need in the 2023-2032 period to be truly be able to bring us smart cars at scale.
If you would like to be a contributor on the Last Futurist, contact me here. Maybe you are passionate about artificial intelligence like we are.