general notes

intro

mistral 7b is an llm introduced to achieve high performance on a variety of language modelling tasks with a small number of parametres and efficient inference
mistral manages to achive its high perfomance by leveraging GQA and SWA
- Grouped Query Attention (GQA): an attention mechanism that groups attention queries together allowing for higher batch sizes
- Sliding Window Attention (SWA): an attention mechanism that caps the number of tokens a single token can attend to a fixed window. allows the model to handle longer sequences
mistral 7b is open source released under Apache 2.0 and crafted to make fine tuning easy for public use

architecture

mistral 7b is a 7 billion parameter model based on transformer architecture (see my notes on that paper here)

the mistral model uses similar architecture to the llama series of models with some notable changes:

sliding window attention

in starndard attention mechanisms, the number of operations is quadratic in the sequence length. this leads to high latency and low throughput at inference

SWA is introduced to alleviate this issue by making the number of operations linear in the size of the window

SWA exploits the stacked layers of transformers to attend to information beyond the window size $W$
- the information stored at position $i$ of layer $k$ can directly attend to information stored at positions $i$ to $i-W$ of layer $k-1$
- recursively, this means that position $i$ of layer $k$ has access to information from $W\times k$ positions in the input layer
given a large enough window, this architecture allows the model to attend to a substantial amount of context in the final layer
in practice, for the mistral 7b model this results in an increase processing speed while not greatly hindering the model’s knowledge

rolling buffer cache

as a result of the fixed attention window a fixed buffer cache can be used

the rolling buffer cache effectively limits the amount of memory necessary for processing and leads to faster performance