Breaking Down the Transformer: A Revolutionary Approach to NLP

The Transformer neural network architecture has revolutionized the field of Natural Language Processing (NLP) since its introduction in 2017. It has solved various sequence-to-sequence problems, such as machine translation, summarization, and question-answering tasks. The Transformer architecture employs an encoder-decoder architecture, with a significant difference in the input sequence processing compared to Recurrent Neural Networks (RNNs). The Transformer network can process the entire input sequence in parallel, making it much faster than RNNs. The key concept in the Transformer architecture is self-attention, which allows the network to learn different relationships between words in a sentence. We can build models for various natural language processing tasks with the Transformer. The code snippets provided in this article can be used as a starting point for building Transformer models for various NLP tasks. The Transformer architecture and its variants have revolutionized the natural language processing field and enabled new applications and capabilities that were not possible before.


Sequence Modelling

Before diving into the details of the Transformer architecture, it’s essential to understand the basics of sequence modelling. Recurrent Neural Networks (RNNs) have been the go-to model for sequence modelling, where the input has some defined ordering. However, RNNs are slow and struggle to deal with long sequences. This is where the Transformer architecture comes in.

The Transformer Architecture

The Transformer architecture employs an encoder-decoder architecture, much like RNNs, with a significant difference in the input sequence processing. The Transformer network can process the entire input sequence in parallel, which makes it much faster than RNNs.

Let’s consider the example of translating a sentence from English to Spanish. With an RNN encoder, we pass an input English sentence one word after the other. However, with a Transformer encoder, there is no concept of time steps. Instead, we pass in all the words of the sentence simultaneously and determine the word embeddings simultaneously.

Input Embeddings

The first step in the Transformer architecture is to convert the input sequence into a vector form. We achieve this by mapping every word to a point in space where similar words in meaning are physically closer to each other. This space is called the embedding space.

The input embeddings are then processed through a positional encoding layer, which adds positional information to the word embeddings.

import torch import torch.nn as nn

Create an embedding layer

embedding_layer = nn.Embedding(100, 256)

Convert a sentence to a tensor

input_sentence = “This is a test sentence” input_tensor = torch.tensor([0, 1, 2, 3, 4]).unsqueeze(0)

Get the input embeddings

input_embeddings = embedding_layer(input_tensor)


Self-attention is the heart of Transformer architecture. It involves answering the question, “what part of the input should I focus on?” If we are translating from English to Spanish and doing self-attention, the question we want to answer is, “how relevant is the i-th word in the English sentence relevant to other words in the same English sentence?”

The self-attention block computes attention vectors for every word in the sentence. The attention vectors capture contextual relationships between words in the sentence.

class SelfAttention(nn.Module): def init(self, d_model, num_heads): super (SelfAttention, self).init() self.num_heads = num_heads self.d_model = d_model

assert d_model % self.num_heads == 0

self.head_dim = d_model // self.num_heads

self.query = nn.Linear(d_model, d_model) self.key = nn.Linear(d_model, d_model) self.value = nn.Linear(d_model, d_model)

self.fc = nn.Linear(d_model, d_model)

def forward (self, x): batch_size = x.shape[0]

Split d_model into num_heads and head_dim

query = self.query(x) key = self.key(x) value = self.value(x)

Reshape to (batch_size, num_heads, seq_len, head_dim)

query = query.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3) key = key.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3) value = value.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)

Compute the dot product of query and key

scores = torch.matmul(query, key.transpose(-1, -2)) scores /= self.head_dim ** 0.5

Apply softmax to obtain attention weights

attention_weights = torch.softmax(scores, dim=-1)

Compute the weighted sum of the values

weighted_values = torch.matmul(attention_weights, value)

Reshape to (batch_size, seq_len, d_model)

weighted_values = weighted_values.permute(0, 2, 1, 3).contiguous() weighted_values = weighted_values.view(batch_size, -1, self.d_model)

Apply a feed-forward layer to the output

output = self.fc(weighted_values)

return output, attention_weights

Multi-Head Attention

In the Transformer architecture, we use multiple self-attention heads, which allows the network to learn different relationships between words. The output of the self-attention block is concatenated and passed through a feed-forward layer.

class MultiHeadAttention(nn.Module): def init(self, d_model, num_heads): super(MultiHeadAttention, self).init() self.num_heads = num_heads self.d_model = d_model

self.head_dim = d_model // num_heads

self.self_attention = SelfAttention(d_model, num_heads)

self.fc = nn.Linear(d_model, d_model)

def forward(self, x): batch_size = x.shape[0]

Compute self-attention for each head

output, attention_weights = self.self_attention(x)

Reshape the output to (batch_size, seq_len, num_heads, head_dim)

output = output.view(batch_size, -1, self.num_heads, self.head_dim)

Transpose the output to (batch_size, num_heads, seq_len, head_dim)

output = output.permute(0, 2, 1, 3).contiguous()

Reshape the output to (batch_size, num_heads * seq_len, head_dim)

output = output.view(batch_size, -1, self.d_model)

Apply a feed-forward layer to the output

output = self.fc(output)

return output, attention_weights

Feed-Forward Networks

The output of the multi-head attention block is passed through a feed-forward layer, which applies a non-linear activation function to the output.

class FeedForward(nn.Module): def init(self, d_model, d_ff): super(FeedForward, self).init()

self.linear_1 = nn.Linear(d_model, d_ff) self.linear_2 = nn.Linear(d_ff, d_model)

self.relu = nn.ReLU()

def forward(self, x): output = self.linear_1(x) output = self.relu(output) output = self.linear_2(output)

return output

Encoder and Decoder

The Transformer architecture consists of an encoder and a decoder. The encoder processes the input sequence, while the decoder generates the output sequence.

class TransformerEncoder(nn.Module): def init(self, d_model, num_heads, d_ff, num_layers): super(TransformerEncoder, self).init()

self.num_layers = num_layers self.layers = nn.ModuleList([TransformerEncoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)])

def forward(self, x): for i in range(self.num_layers): x = self.layersi

return x

class TransformerDecoder(nn.Module): def init(self, d_model, num_heads, d_ff, num_layers): super(TransformerDecoder, self).init()

self.num_layers = num_layers self.layers =nn.ModuleList([TransformerDecoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)])

def forward(self, x, encoder_output, encoder_mask, decoder_mask): for i in range(self.num_layers): x = self.layers[i](x, encoder_output, encoder_mask, decoder_mask)

return x


The Transformer architecture is implemented by combining the encoder and decoder.

class Transformer(nn.Module): def __init__(self, d_model, num_heads, d_ff, num_layers, source_vocab_size, target_vocab_size, dropout=0.1): super(Transformer, self).__init__() self.d_model = d_model self.num_heads = num_heads self.d_ff = d_ff self.num_layers = num_layers self.source_vocab_size = source_vocab_size self.target_vocab_size = target_vocab_size self.embedding_source = nn.Embedding(source_vocab_size, d_model) self.embedding_target = nn.Embedding(target_vocab_size, d_model) self.pos_encoding = PositionalEncoding(d_model, dropout) self.encoder = TransformerEncoder(d_model, num_heads, d_ff, num_layers) self.decoder = TransformerDecoder(d_model, num_heads, d_ff, num_layers) self.fc = nn.Linear(d_model, target_vocab_size) self.dropout = nn.Dropout(dropout) def forward(self, source, target, source_mask, target_mask): source_embedding = self.embedding_source(source) target_embedding = self.embedding_target(target) source_embedding *= self.d_model ** 0.5 source_embedding = self.pos_encoding(source_embedding) target_embedding *= self.d_model ** 0.5 target_embedding = self.pos_encoding(target_embedding) encoder_output = self.encoder(source_embedding, source_mask) decoder_output = self.decoder(target_embedding, encoder_output, source_mask, target_mask) output = self.fc(decoder_output) return output

Overall, the Transformer architecture has significantly advanced the natural language processing field and opened up new research and development opportunities. With its ability to process long sequences in parallel and powerful self-attention mechanism, the Transformer has become a popular choice for various NLP tasks.

However, many areas of research can still be explored to improve the Transformer and its variants. For example, there is ongoing work on developing more efficient and effective ways to pretrain and fine-tune Transformer models and better techniques for transfer learning across languages and domains.

In addition, researchers are exploring new architectures that build on the Transformer, such as the GPT (Generative Pretrained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) models, which have achieved state-of-the-art performance on various NLP benchmarks.

Overall, the Transformer architecture and its variants have revolutionized the natural language processing field and enabled new applications and capabilities that were not possible before. As the field continues to evolve, it is exciting to see what new breakthroughs and innovations will emerge.

