Attention Is All You Need

Attention Is All You Need

Year
2017
image

Attention Is All You Need

Ashish Vaswani, Llion Jones, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin . 2017. (View Paper → )

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelisable and requiring significantly less time to train…We show that the Transformer generalises well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

This paper introduced the ‘Transformer’ a neural‑network design that dispenses with the old word‑by‑word processing in favour of “self‑attention,” letting models weigh every word in a sentence simultaneously. This shift cut training from weeks to days, set new records in machine‑translation benchmarks, and became the template for nearly every modern language model, driving the current AI boom. This paper is fast becoming one of the most referenced - it’s been cited more than 170 000 times.