OpenAI studied GPT-2 with GPT-4 and tried to explain the behavior of neurons

Experts from OpenAI published a study in which they described how they tried to explain the work of neurons of its predecessor, GPT-2, using the GPT-4 language model. Now the company’s developers seek to advance in the “interpretability” of neural networks and understand why they create the content that we receive.

In the first sentence of their article, the authors from OpenAI admit: “Language models have become more functional and more pervasive, but we don’t understand how they work.” This “ignorance” of exactly how individual neurons in a neural network behave to produce output data is referred to as the “black box.” According to Ars Technica, trying to look inside the “black box,” researchers from OpenAI used their GPT-4 language model to create and evaluate natural-language explanations of neuronal behavior in a simpler language model, GPT-2. Ideally, having an interpretable AI model would help achieve a more global goal called “AI matching.” In this case, we would have assurances that AI systems would behave as intended and reflect human values.

OpenAI wanted to figure out which patterns in the text cause neuron activation, and moved in stages. The first step was to explain neuron activation using GPT-4. The second was to simulate neuronal activation with GPT-4, given the explanation from the first step. The third was to evaluate the explanation by comparing simulated and real activations. GPT-4 identified specific neurons, neural circuits, and attention heads, and generated readable explanations of the roles of these components. The large language model also generated an explanation score, which OpenAI calls “a measure of the ability of the language model to compress and reconstruct neuronal activations using natural language.”

During the study, OpenAI offered to duplicate the work of GPT-4 in humans and compared their results. As the authors of the article admitted, both the neural network and the human “performed poorly in absolute terms.”

One explanation for this failure, suggested by OpenAI, is that neurons can be “polysemantic,” meaning that a typical neuron in the context of a study can have multiple meanings or be associated with multiple concepts. In addition, language patterns may contain “alien concepts” for which people simply do not have words. This paradox could arise for various reasons: for example, because language models care about the statistical constructs used to predict the next token; or because the model has discovered natural abstractions that people have yet to discover, such as a family of similar concepts in non-comparable domains.

The bottom line at OpenAI is that not all neurons can be explained in natural language; and so far, researchers can only see correlations between input data and the interpreted neuron at a fixed distribution, with past scientific work showing that this may not reflect a causal relationship between the two. Despite this, the researchers are quite optimistic and confident that they have succeeded in laying the groundwork for machine interpretability. They have now posted on GitHub the code for the automatic interpretation system, the GPT-2 XL neurons and the explanation data sets.