Published on Sun Jun 13 2021

Thinking Like Transformers

Gail Weiss, Yoav Goldberg, Eran Yahav

Transformers have no such familiar parallel. We propose a computational model for the transformer-encoder in the form of a programming language. We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.

26
246
1,180
Abstract

What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder -- attention and feed-forward computation -- into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP). We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works.

Sat Feb 06 2016
Neural Networks
Strongly-Typed Recurrent Neural Networks
Recurrent neural networks are increasing popular models for sequential learning. Most effective RNN architectures are perhaps excessively complicated. This paper imports ideas from physics and functional programming into RNN design to provide guiding principles.
0
0
0
Thu Nov 19 2015
Machine Learning
Neural Programmer-Interpreters
The neural programmer-interpreter (NPI) is a recurrent and compositional neural network that learns to represent and execute programs. NPI has three learnable components: a task-agnostic recurrent core, a persistent key-value program memory, and domain-specific encoders.
1
0
0
Fri Nov 08 2019
NLP
Memory-Augmented Recurrent Neural Networks Can Learn Generalized Dyck Languages
memory-augmented architectures are easy to train in an end-to-end fashion. They can learn the Dyck languages over as many as six parenthesis-pairs, in addition to two deterministic palindrome languages and the string-reversal transduction task.
1
0
3
Mon Jun 03 2013
Neural Networks
Riemannian metrics for neural networks II: recurrent networks and learning symbolic data sequences
Recurrent neural networks are powerful models for sequential data. Yet they are notoriously hard to train. Here we introduce a training procedure using a gradient ascent in a Riemannian metric. This produces an algorithm independent from design choices such as the encoding of parameters.
0
0
0
Fri Dec 02 2016
Machine Learning
Learning Operations on a Stack with Neural Turing Machines
Multiple extensions of Recurrent Neural Networks (RNNs) have been proposed to address the difficulty of storing information over long time periods. We experiment with the capacity of Neural Turing Machines (NTMs) to deal with these long-term dependencies on well-balanced strings of parentheses.
0
0
0
Fri Sep 11 2020
Machine Learning
Applications of Deep Neural Networks
Deep learning is a group of exciting new technologies for neural networks. It is now possible to create neural networks that can handle tabular data, images, text, and audio as both input and output. Readers will use the Python programming language to implement deep learning using Google TensorFlow and
15
1,051
4,624
Fri Apr 10 2020
NLP
Longformer: The Long-Document Transformer
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. We introduce the Longformer with an attention Mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens.
4
35
182
Mon May 24 2021
NLP
Self-Attention Networks Can Process Bounded Hierarchical Languages
Self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure. This suggested that natural language can beroximated well with models that are too weak for formal languages. We prove that self-att attention networks can process a subset of with depth bounded by $D.
4
18
96
Tue Jun 11 2019
NLP
What Does BERT Look At? An Analysis of BERT's Attention
Large pre-trained neural networks such as BERT have had great recent success in NLP. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly covering the whole sentence.
3
8
64
Thu Oct 11 2018
NLP
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT is designed to pre-train deep                bidirectional representations from unlabeled text. It can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
13
8
15
Mon Aug 17 2015
NLP
Effective Approaches to Attention-based Neural Machine Translation
An attentional mechanism has lately been used to improve neural machine translation (NMT) However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple andeffective classes of attentional mechanisms.
3
1
9
Mon Sep 01 2014
NLP
Neural Machine Translation by Jointly Learning to Align and Translate
Neural machine translation is a recently proposed approach to machine translation. Unlike traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be tuned to maximize translation performance.
6
4
7