premise

state of the art neural nets up till this point (2017) have all made use of recurrences or convulutions to perform sequencing tasks (language translation, text completion, anything that involves the prediction of output tokens based on an input)

in addition, all of these networks make use of an attention mechanism between recurrent/convulutional layers which helps the model to pay more or less “attention” to input tokens when predicting the next output.

this paper proposes a new architecture for neural nets that makes use of only this attention mechanism when predicting tokens and the results achieved are drastically better than that achieved by RNNs and CNNs.

RNNs and CNNs are inherently sequential. this makes parallelization difficult in these networks.the lengths of these sequences also make it difficult for these networks to acurately determine long range dependencies in inputs.

using self attention mechanisms transformers are able to determine these dependencies more easily.

architecture

like RNNs and CNNs as used for sequence tasks, the transformer architecture is composed of an encoder and a decoder. the encoder is used to process the input sequence into high dimensional, contextualized vector representations. these vectors encode the “relevance” of each token in the input to each other token. the decoder then takes the target sequence (the actual correct result in the case of training, or the ‘start-of-sequence’ token in the case of inference) as well as the output from the encoder and processes them together to obtain higlhy contextualized vector representations of a predicted output sequence. these output encodings are then projected back into the token space to obtain scores for each token’s likelyhood. the scores are then sofmaxed to obtain probabilities.

the encoder

the encoder is composed of $N$ identical layers. each layer contatins the same sequence of operations.

first, the embedded input goes through a mulit-headed self attention (MHSA) layer
the output of the MHSA is then normalized, before the input to the same layer is added back; this serves as a residual connection (Add and Norm step)
the normalized output is then fed into a feed forward layer
there is another add and norm step around the feed forward layer. this normalized vector serves as the output of a single encoding layer
the output of the previous layer is then used as input for the subsequent layer
this process is repeated $N$ times and the final vector is used by the decoder

the decoder

the decoder is also composed of $N$ identical layers. each layer contatins the same sequence of operations.