LLMs such as the GPT are very simple in execution. A textual prompt is provided to the model as a sort of seed. The model takes the prompt and generates the next token (word or word fragment) in the sequence. The new token is added to the end of the prompt and the process continues.
In recent, “foundation” models, the LLM is capable of writing sophisticated stories with a beginning, middle and end. Although it can get “lost,” given enough of an input prompt, it goes in the “right direction” more often than not.
The LLM itself is stateless. Any new information, and all the context, lies in the prompt. The prompt is steering itself, base on the model it is interacting upon.
I’ve been wondering about that interaction between the growing prompt and the different layers of the model. The core of a transformer is the concept of attention, where each vector in an input buffer is compared to all the others. Those that match are amplified, and the others are suppressed.
All LLMs take an input as a series of tokens. These tokens are indexes into a vector dictionary. The vectors are then placed into the input buffer. At this point, attention is applied, then the prompt is successively manipulated through the architecture to find the next token, then the process is repeated. A one-layer LLM is shown below:

From LLaMA-2 from the Ground Up
The LLaMA 70b LLM model by Meta has 32 transformer layers. This means that the output of one layer is used as the input to the next layer. This is all in vector space – no tokens. Because attention is being applied at each layer, the transformer stack is finding an overall location and direction of the current input buffer and using that as a way of finding the next token.

In addition, recent large LLMs have a number of other tweaks that Meta discusses in LLaMA: Open and Efficient Foundation Language Models. For example, LLaMA uses Grouped-query attention, which shares single key and value heads for each group of query heads:

From GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
This means that there is an overall reduction in the dimensionality of the “space” as you move from input to output. This means that there are fewer vectors to compete for attention. Something resembling concepts, themes, and biases may emerge at different transformer layers. The last few layers would have more to do with calculating the next token, so the map, if you will is in the middle layers of the system
Because each layer feeds into a subsequent layer, there is a fixed relationship between these abstractions. Done right, you should be able to zoom in or out to a particular level of detail.
This space does not have to be physical, like lat/lon, or temporal, like year of publication. It can be anything, like the relationship between conspiracy theories. I’ve produced maps like this before using force directed graphs, but this seems more direct and able to work naturally at larger scales.

Turning this into human-readable information will be the challenge here, though, I think the model could help here as well. The manifold reduction would try to maintain the relationship of nearby vectors in the transformer stack. Some work in this direction is Language Models Represent Space and Time, which works using LLaMA and might provide insight into techniques. For example, it may be that the authors are evolving prompt vectors directly, using a fitness test that calculates the distance between a generated year or lat/lon and the actual lat/lon, then uses that difference to make a better prompt. Given that they have a LLaMA model to work with, they could do backpropagation, or conceivably, a less coupled arrangement like an evolutionary algorithm. In other words, the prompt becomes the mapping function.

Pingback: Phil 11.8.2024 | viztales