Q. Jiang et al., 2023

paper: https://arxiv.org/pdf/2310.06825v1.pdf

repo: https://github.com/mistralai/mistral-src

general notes

intro

architecture

mistral 7b is a 7 billion parameter model based on transformer architecture (see my notes on that paper here)

the mistral model uses similar architecture to the llama series of models with some notable changes:

sliding window attention

in starndard attention mechanisms, the number of operations is quadratic in the sequence length. this leads to high latency and low throughput at inference

SWA is introduced to alleviate this issue by making the number of operations linear in the size of the window

rolling buffer cache

as a result of the fixed attention window a fixed buffer cache can be used

the rolling buffer cache effectively limits the amount of memory necessary for processing and leads to faster performance