Normalization:
Standardizes the text - by removing redundant whitespace, accent etc. An optional step
Tokenization:
Breaks the sentence into word or sub-words and maps them to integer token IDs from a vocabulary
Embedding:
Convert each token ID to its corresponding high-dimensional vector
Positional encoding:
Add info about the position of each token in the sequence for transformer to understand the word order
Self-attention:
Determines the relationship between different words How does it do that?
- Each input embedding it multiplied by three learned weight matrices to generate Query (Q), key (K), and value (V) vectors
- Query: Helps the model ask “Which other words in the sequence are relevant to me”
- Key: Its like a label that helps model identify how a word might be relevant to other words in the sequence
- Value: Holds the actual word content information
- Score calculation: Calculated to determine how much each word should attend to other words. Done by taking dot product of Q vector of one word with the key vectors of all the words in the sequence.
Normalization of these scores are done
Then Value vector V is multiplied with these normalized attention weight. They are summed up and now we have another vector which is context aware. Similar for each word
Multi-head attention:
Multiple sets of Q, K, V weight matrices, run in parallel, focusing on aspects of input relationships. Then output from each sets are then concatenated and linearly transformed.
- Use of multi-head attention improves the model’s ability to handle complex language patterns and long-range dependencies.
Mixture of Experts (MoE):
Instead of simply aggregating the prediction of all experts, it learns to route different parts of the input to different experts. This allows the model to specialize.
Large reasoning models
Chain of thought prompting, helps the model to generate the reason with itself by generating the intermediate steps and significantly improve the results.