BPE
Byte-pair encoding, a tokenization method that builds vocabulary units from common byte or character sequences.
RoPE
Rotary position embedding, a method that adds relative position information by rotating query and key vector dimensions.
Q, K, V
Query, key, and value vectors used by attention to find and combine relevant prior information.
GQA
Grouped-query attention, where multiple query heads share fewer key and value heads to reduce cache memory.
MoE
Mixture of experts, an architecture that routes each token through a selected subset of feed-forward experts.
SwiGLU
A gated feed-forward activation used by many current language models.
RMSNorm
Root mean square normalization, a streamlined normalization method used by architectures including LLaMA.
Logit
A raw score assigned to a possible next token before probabilities are calculated.
Temperature
A decoding control that changes how concentrated or varied next-token probabilities are.
KV cache
Stored key and value vectors from previous tokens used to speed up autoregressive generation.
Inference
Running a trained model to generate an output for a new input.
RLHF / DPO
Post-training methods that use preference data to shape helpfulness, safety, and response behavior.