Vaswani et al., 2017

paper: https://arxiv.org/pdf/1706.03762.pdf

premise

state of the art neural nets up till this point (2017) have all made use of recurrences or convulutions to perform sequencing tasks (language translation, text completion, anything that involves the prediction of output tokens based on an input)

in addition, all of these networks make use of an attention mechanism between recurrent/convulutional layers which helps the model to pay more or less “attention” to input tokens when predicting the next output.

this paper proposes a new architecture for neural nets that makes use of only this attention mechanism when predicting tokens and the results achieved are drastically better than that achieved by RNNs and CNNs.

RNNs and CNNs are inherently sequential. this makes parallelization difficult in these networks.the lengths of these sequences also make it difficult for these networks to acurately determine long range dependencies in inputs.

using self attention mechanisms transformers are able to determine these dependencies more easily.

architecture

like RNNs and CNNs as used for sequence tasks, the transformer architecture is composed of an encoder and a decoder. the encoder is used to process the input sequence into high dimensional, contextualized vector representations. these vectors encode the “relevance” of each token in the input to each other token. the decoder then takes the target sequence (the actual correct result in the case of training, or the ‘start-of-sequence’ token in the case of inference) as well as the output from the encoder and processes them together to obtain higlhy contextualized vector representations of a predicted output sequence. these output encodings are then projected back into the token space to obtain scores for each token’s likelyhood. the scores are then sofmaxed to obtain probabilities.

the encoder

the encoder is composed of $N$ identical layers. each layer contatins the same sequence of operations.

the decoder

the decoder is also composed of $N$ identical layers. each layer contatins the same sequence of operations.