We establish connections between the Transformer architecture, originally introduced for natural language processing, and Graph Neural Networks (GNNs) for
representation learning on graphs. We show how Transformers can be viewed
as message passing GNNs operating on fully connected graphs of tokens, where
the self-attention mechanism capture the relative importance of all tokens w.r.t.
each-other, and positional encodings provide hints about sequential ordering or
structure