Coconut

a.k.a. Chain of Continuous Thought.

To refresh, a language transformer starts by embedding input tokens as vectors in some high-dimensional latent space, and runs each of these embeddings through a series of repeated computational layers. Then, of the resulting modified vectors in latent space, the vector that previously corresponded to the final input token is projected and normalized to create a probability distribution over what the next token could be. Then, to actually get the next token, you sample from the distribution.

Chain of Thought reasoning works so well because the model does some computation, outputs a token, and then all future instances of that model have access to that information as well. In essence, this technique for storing information between different forward passes greatly increases the serial depth of computation that is possible for the model. Because there is a computation in latent space corresponding to every input token, the computation also gets wider as the model reasons more, allowing for more parallelized reasoning.

The recent Neuralese paper takes this process and removes a few steps. It notices that the projection and sampling process loses almost all of the information encoded in the last layer of the model, and to increase the bandwidth of information flowing through the reasoning process, you can simply remove that lossy part of the computation. Instead, they have the model directly output the aforementioned high-dimensional latent vector without projecting it, and then that is used as an embedding for the model in future steps:

coconut

personally, I find the following idea in the paper the most interesting:

Given the intuition that continuous thoughts can encode multiple potential next steps, the latent reasoning can be interpreted as a search tree, rather than merely a reasoning “chain".

queryMT