mixtral | Notion

paper: https://arxiv.org/pdf/2401.04088.pdf

a decoder only transforemer model
based on mistral 7b
the feed forward block in a standard transformer model is replaced with a router and 8 “experts”
- the router routes the input to 2 of the 8 experts for processing
- each expert is itself a SwiGLU function. It is essentailly a gated FF layer
  - $SwiGLU(x)=(Wx+b)⊙σ(Vx+c)$
  - where $W$ adn $V$ are parameter matrices and $b$ and $c$ are bias vectors
  - the $⊙$ symbol represent element wise multiplication
- the output of the “expert layer” is the weighted sum of the output of the both selected experts
  - weigths are given by the output of the router network
  - weights are obtained by taking a softmax of the top k elems of the linear rouer layer
    - here k = 2
- this process is done separately for each individual token of input and takes place in each decoder layer