Thinking Like Transformers

Published on Sun Jun 13 2021

Thinking Like Transformers

Gail Weiss, Yoav Goldberg, Eran Yahav

Transformers have no such familiar parallel. We propose a computational model for the transformer-encoder in the form of a programming language. We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.

246

1,180

Abstract

What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder -- attention and feed-forward computation -- into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP). We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works.

Sat Feb 06 2016

Neural Networks

Strongly-Typed Recurrent Neural Networks

Recurrent neural networks are increasing popular models for sequential learning. Most effective RNN architectures are perhaps excessively complicated. This paper imports ideas from physics and functional programming into RNN design to provide guiding principles.

Thu Nov 19 2015

Machine Learning

Neural Programmer-Interpreters

The neural programmer-interpreter (NPI) is a recurrent and compositional neural network that learns to represent and execute programs. NPI has three learnable components: a task-agnostic recurrent core, a persistent key-value program memory, and domain-specific encoders.

Fri Nov 08 2019

NLP

Memory-Augmented Recurrent Neural Networks Can Learn Generalized Dyck Languages

memory-augmented architectures are easy to train in an end-to-end fashion. They can learn the Dyck languages over as many as six parenthesis-pairs, in addition to two deterministic palindrome languages and the string-reversal transduction task.

Mon Jun 03 2013

Neural Networks

Riemannian metrics for neural networks II: recurrent networks and learning symbolic data sequences

Recurrent neural networks are powerful models for sequential data. Yet they are notoriously hard to train. Here we introduce a training procedure using a gradient ascent in a Riemannian metric. This produces an algorithm independent from design choices such as the encoding of parameters.

Fri Dec 02 2016

Machine Learning

Learning Operations on a Stack with Neural Turing Machines

Multiple extensions of Recurrent Neural Networks (RNNs) have been proposed to address the difficulty of storing information over long time periods. We experiment with the capacity of Neural Turing Machines (NTMs) to deal with these long-term dependencies on well-balanced strings of parentheses.

Fri Sep 11 2020

Machine Learning

Applications of Deep Neural Networks

Deep learning is a group of exciting new technologies for neural networks. It is now possible to create neural networks that can handle tabular data, images, text, and audio as both input and output. Readers will use the Python programming language to implement deep learning using Google TensorFlow and

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. We introduce the Longformer with an attention Mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens.

182

Mon May 24 2021

NLP

Self-Attention Networks Can Process Bounded Hierarchical Languages

Self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure. This suggested that natural language can beroximated well with models that are too weak for formal languages. We prove that self-att attention networks can process a subset of with depth bounded by $D.

Tue Jun 11 2019

NLP

What Does BERT Look At? An Analysis of BERT's Attention

Large pre-trained neural networks such as BERT have had great recent success in NLP. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly covering the whole sentence.

Thu Oct 11 2018

NLP

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT is designed to pre-train deep bidirectional representations from unlabeled text. It can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Mon Aug 17 2015

NLP

Effective Approaches to Attention-based Neural Machine Translation

An attentional mechanism has lately been used to improve neural machine translation (NMT) However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple andeffective classes of attentional mechanisms.

Mon Sep 01 2014

NLP

Neural Machine Translation by Jointly Learning to Align and Translate

Neural machine translation is a recently proposed approach to machine translation. Unlike traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be tuned to maximize translation performance.

Wed Jun 30 2021

NLP

On the Power of Saturated Transformers: A View from Circuit Complexity

Transformers have become a standard architecture for many NLP problems. This motivated theoretically analyzing their capabilities as models of language. We show that saturated transformers transcend the limitations of hard-attention transformers.