paper: https://arxiv.org/pdf/2401.04088.pdf
- a decoder only transforemer model
- based on mistral 7b
- the feed forward block in a standard transformer model is replaced with a router and 8 “experts”
- the router routes the input to 2 of the 8 experts for processing
- each expert is itself a SwiGLU function. It is essentailly a gated FF layer
- $SwiGLU(x)=(Wx+b)⊙σ(Vx+c)$
- where $W$ adn $V$ are parameter matrices and $b$ and $c$ are bias vectors
- the $⊙$ symbol represent element wise multiplication
- the output of the “expert layer” is the weighted sum of the output of the both selected experts
- weigths are given by the output of the router network
- weights are obtained by taking a softmax of the top k elems of the linear rouer layer
- this process is done separately for each individual token of input and takes place in each decoder layer