self-Attention

Normalization:

Standardizes the text - by removing redundant whitespace, accent etc. An optional step

Tokenization:

Breaks the sentence into word or sub-words and maps them to integer token IDs from a vocabulary

Embedding:

Convert each token ID to its corresponding high-dimensional vector

Positional encoding:

Add info about the position of each token in the sequence for transformer to understand the word order

Self-attention:

Determines the relationship between different words How does it do that?

Each input embedding it multiplied by three learned weight matrices to generate Query (Q), key (K), and value (V) vectors
- Query: Helps the model ask “Which other words in the sequence are relevant to me”
- Key: Its like a label that helps model identify how a word might be relevant to other words in the sequence
- Value: Holds the actual word content information
Score calculation: Calculated to determine how much each word should attend to other words. Done by taking dot product of Q vector of one word with the key vectors of all the words in the sequence.

Normalization of these scores are done

Then Value vector V is multiplied with these normalized attention weight. They are summed up and now we have another vector which is context aware. Similar for each word

Multi-head attention:

Multiple sets of Q, K, V weight matrices, run in parallel, focusing on aspects of input relationships. Then output from each sets are then concatenated and linearly transformed.

Use of multi-head attention improves the model’s ability to handle complex language patterns and long-range dependencies.

Mixture of Experts (MoE):

Instead of simply aggregating the prediction of all experts, it learns to route different parts of the input to different experts. This allows the model to specialize.

Large reasoning models

Chain of thought prompting, helps the model to generate the reason with itself by generating the intermediate steps and significantly improve the results.

प्रभात 🌅

Explorer