Q. Jiang et al., 2023
paper: https://arxiv.org/pdf/2310.06825v1.pdf
repo: https://github.com/mistralai/mistral-src
general notes
intro
- mistral 7b is an llm introduced to achieve high performance on a variety of language modelling tasks with a small number of parametres and efficient inference
- mistral manages to achive its high perfomance by leveraging GQA and SWA
- Grouped Query Attention (GQA): an attention mechanism that groups attention queries together allowing for higher batch sizes
- Sliding Window Attention (SWA): an attention mechanism that caps the number of tokens a single token can attend to a fixed window. allows the model to handle longer sequences
- mistral 7b is open source released under Apache 2.0 and crafted to make fine tuning easy for public use
architecture
mistral 7b is a 7 billion parameter model based on transformer architecture (see my notes on that paper here)
the mistral model uses similar architecture to the llama series of models with some notable changes:
sliding window attention
in starndard attention mechanisms, the number of operations is quadratic in the sequence length. this leads to high latency and low throughput at inference
SWA is introduced to alleviate this issue by making the number of operations linear in the size of the window
- SWA exploits the stacked layers of transformers to attend to information beyond the window size $W$
- the information stored at position $i$ of layer $k$ can directly attend to information stored at positions $i$ to $i-W$ of layer $k-1$
- recursively, this means that position $i$ of layer $k$ has access to information from $W\times k$ positions in the input layer
- given a large enough window, this architecture allows the model to attend to a substantial amount of context in the final layer
- in practice, for the mistral 7b model this results in an increase processing speed while not greatly hindering the model’s knowledge
rolling buffer cache
as a result of the fixed attention window a fixed buffer cache can be used
the rolling buffer cache effectively limits the amount of memory necessary for processing and leads to faster performance