The concept of Attention is essential to the working of the Transformer. Through this concept of attention, models focus on the relevant information to give predictions.
Take an example of a sentence
The sky is ___ and world is round.
What words might you use to fill in the blanks for the above sentence? Given the context, it is quite evident to see words like “hesitating”, “destroyed”, and “skipping” are unrelated.
Language models need to understand the relationship between the words to predict words that fit into the context and are relevant. This is where the concept of Attention enters, which is central to the functioning of the transformers alongside other concepts and features like the position encoding, tokenizers, and the feed-forward layers.
Self Attention
Put it is a sequence-to-sequence operation where a sequence of vectors are manipulated using a set of operations to result in a sequence of vectors, where both the input and the output have the same set of dimensions.
The operation utilizes three matrices namely Query,Key and Value denoted by - \(W_q\) , \(W_k\) , \(W_v\)
The Query , Key, Value pairs are obtained by matrix multiplication between \(W\) and \(x\) where \(W\) is the weight matrices and \(x\) is the embedding inputs
- \(q^{(i)} = W_{(q)} \times x^{i}\)
- \(k^{(i)} = W_{(k)} \times x^i\)
- \(v{(i)} = W_{(v)} \times x^{i}\) where \(i\) \(\epsilon\) \([1,T]\) where T is the length of input sequence and i is index of token
The query and key vectors has same number of elements , elements present in value vector is arbitrary. The operations that take place are sequentially are
- Matrix multiplication
- Scaling
- Mask (needed in seq to seq translation or autoregressive generation)
- Softmax
- Matrix multiplication
Let us now dive deep into these individual operations
Generating Query,Key,Value matrices
Let us Load the libraries before proceeding
import torch
import math
Start with the random input sequence, which has been pre-processed through an embedding layer. Feel free to modify the input text and experiment. This results in feature vectors \(x_{1},…,x_{n}\), each with a dimension of \(1×d\). Together, these vectors form a matrix \(X\) of size \(n×d\).
= "The sky is blue and world is round"
text
= ",';#$@&)("
ignore_chars
for char in ignore_chars:
= text.replace(char,'')
text
= {s:i for i,s in enumerate(sorted(text.split()))} stoi
Convert the input text to token ids
= torch.tensor([stoi[s] for s in text.replace(ignore_chars,'').split()],dtype= torch.int64)
itos
itos
:
### outputtensor([0, 6, 5, 3, 1, 7, 5, 3])
Use the Embedding layer to transform the token ids to their respective feature vectors. The vocabulary size and the dimensions of the model can be changed .
= 100 #number of unique vocabulary
vocab_size = 32 #dimension of feature vectors
d_model
= torch.nn.Embedding(vocab_size,d_model)
embeddings = embeddings(itos) X
Now that the feature vectors are present we can use them to generate the Query,Key and Value matrices.
= torch.randn(d_model, d_model)
W_q = torch.randn(d_model, d_model)
W_k = torch.randn(d_model, d_model)
W_v
= X.matmul(W_q)
Q = X.matmul(W_k)
K = X.matmul(W_v) V
The size of all the three matrices will be torch.Size([8, 32])
. We will now proceed through the steps to modify the respective matrices
Matrix Multiplication of Query and Key
This is the first operation performed and building it from the ground up it is a dot product of the vector \(q_{i}\) with the vector \(k_{i}\). With regards to the dot product operation, one of them is a row vector and the other has to be a column vector. Thus we see the key vector to be transposed.
Since this has to be applied on the query and key matrix, a matrix multiplication operation is performed and the resultant output is called compatibility matrix.
= Q.matmul(K.T) compat_matrix
Scaling of the compatibility matrix
We know scale the output of the previous step by dividing the matrix by factor of \(\frac{1}{\sqrt{d_{k}}}\) , where \(d_{k}\) is the dimension of the key vector.
Now a question should arise , why do we have to divide it by \(\frac{1}{\sqrt{d_{k}}}\) , what does it do ?
This is done to adjust the range of the dot product and ensure more stable gradients. As the dimensions become large, if it is not scaled, the variance of the keys and values increases. Finally ending up in a phenomenon named vanishing gradients after the softmax operation is performed.Vanishing gradients problem can lead to an unstable training process.
= 1.0/math.sqrt(d_model) * compat_matrix scaled_compat_matrix
of variance
#understanding the effect print(f"Variance of compatibility matrix:{torch.var(compat_matrix).item()}")
print(f"Variance of scaled compatibility matrix:{torch.var(scaled_compat_matrix).item()}")
### outputof compatibility matrix:38521.3046875
Variance of scaled compatibility matrix:1203.790771484375 Variance
We can see a drastic difference in the variance of the matrix, thus we perform the scaling of the compatibility matrix.
Softmax operation on compatibility matrix
The softmax operation is used to yield the matrix of attention weights within the range of 0 to 1. This operation is performed as it can effectively represent the probabilities of the individual values in the input sequence. Also applying it to the entire matrix transforms the output into a probability distribution.
#softmax operation= torch.softmax(scaled_compat_matrix, dim=0) attention_matrix
Matrix multiplication with Value matrix
We know perform matrix multiplication on the matrix obtained from the previous step with the Value matrix to obtain the context matrix. There is no need of transposing of matrices here as the dimensions for both of the matrices involved are the same
= attention_matrix.matmul(V)
results print(results.shape)
### output.Size([8, 32]) torch
Summing it up this is the code for the self-attention process
class SelfAttention(torch.nn.Module):
__init__(self, embed_size):
def super(SelfAttention, self).__init__()
.embed_size = embed_size
self.query = torch.nn.Linear(embed_size, embed_size)
self.key = torch.nn.Linear(embed_size, embed_size)
self.value = torch.nn.Linear(embed_size, embed_size)
self.scale = torch.sqrt(torch.FloatTensor([embed_size]))
self
forward(self, x):
def = self.query(x)
Q = self.key(x)
K = self.value(x)
V
= torch.matmul(Q, K.transpose(-2, -1)) / self.scale
scores = F.softmax(scores, dim=-1)
attention = torch.matmul(attention, V)
out return out
Multi Head Attention
We got to see how self-attention works, and what the point of multi-head attention might be our next question.
Take the analogy of a detective trying to solve a puzzle, while the detective may be smart enough to solve it, it does take a lot of time to investigate and then arrive at conclusions. imagine the advantage the detective might have if there is a team of 3 or 4 people with him. One can investigate in person and the other might help in writing reports etc. The presence of the team helps in accelerating the process. It is the same with regards to multi-head attention as well.
When dealing with sequences consisting of multiple interrelated relationships, self-attention fails to capture it as it has a limited capacity to attend.
While similar fundamental operations we performed in Self-Attention process takes place here too but with respective to the individual heads present. Once the operations are completed, they are then concatenated in the end.
Computation of attention weights
Given an input sequence(\(X\)):
Project input into queries \(Q_i\), keys \(K_i\), and values \(V_i\) for each head \(i\):
\(Q_i = XW_i^Q, \quad K_i = XW_i^K, \quad V_i = XW_i^V\)
Compute scaled dot-product attention for each head:
\(\text{Attention}(Q_i, K_i, V_i) = \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d_k}}\right)V_i\)
Concatenate outputs from all heads:
\(\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h)W^o\)
where \(W^o\) is the output projection matrix.
The code to perform this operation
class MultiHeadAttention(nn.Module):
__init__(self, d_model, num_heads):
def super(MultiHeadAttention, self).__init__()
% num_heads == 0, "d_model must be divisible by num_heads"
assert d_model
.d_model = d_model
self.num_heads = num_heads
self.head_dim = d_model // num_heads
self
.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self
split_heads(self, x):
def , seq_length, _ = x.size()
batch_sizereturn x.view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
forward(self, query, key, value, mask=None):
def , seq_length, _ = query.size()
batch_size
= self.split_heads(self.W_q(query))
Q = self.split_heads(self.W_k(key))
K = self.split_heads(self.W_v(value))
V
-product attention
# Scaled dot= torch.matmul(Q, K.transpose(-2, -1)) / (self.head_dim ** 0.5)
scores
if mask is not None:
= scores.masked_fill(mask == 0, float('-inf'))
scores
= F.softmax(scores, dim=-1)
attention_weights = torch.matmul(attention_weights, V)
attention_output
# Concatenate heads and apply final linear layer= attention_output.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
attention_output = self.W_o(attention_output)
output
return output, attention_weights
Let us initialize the number of parameters and the dimension of feature vector.
= 4 #number of heads
num_heads = 32 dimension of feature vectors
d_model = MultiHeadAttention(d_model, num_heads)
mha
= X.unsqueeze(0)
X_batch , attention_weights = mha(X_batch, X_batch, X_batch)
output
print("Output shape:", output.shape)
print("Attention weights shape:", attention_weights.shape)
### output: torch.Size([1, 8, 32])
Output shape: torch.Size([1, 4, 8, 8]) Attention weights shape
In sum, we’ve explored the inner workings of both Self-Attention and Multi-Head Attention, which are foundational to the transformative ‘Transformer’ model. We got to see how the self-attention mechanism works, from matrix multiplication to scaling, and then why softmax is used. As well as understanding why MultiHead Attention is needed and how it can help in better understanding of the complexities present in input text.
References
- Attention is all you need - The OG paper
- Sebastian Raschka elegant explanation of Self Attention
- Modern Approach in NLP Book - Detailed walkthrough of the concepts