Former Windows core developer Dave Plummer successfully ran the Transformer model on a 47-year-old PDP-11/44 computer and completed AI training with a 6MHz CPU and 64KB memory.The model run by this PDP-11 is called ATTN-11, written by Damien Boureille in PDP-11 assembly language, to implement a single-layer, single-head Transformer, containing only 1216 parameters.

The task of the model seems simple, that is, input a string of numbers and output the reversed result.But to complete this task, the model must independently learn the structural rules of sequence reversal. Plummer believes that this exactly captures the working essence of modern large models such as ChatGPT.

In order to run on extremely limited hardware, ATTN-11 has made a lot of extreme optimizations. The forward propagation accuracy is cut to 8-bit fixed-point numbers, and every CPU cycle is optimized.

Finally, Plummer used a cache board toAfter about 350 training steps, the model reached 100% accuracy, and the whole process took about 3.5 minutes.

Plummer describes the training process in the video:“The model starts out stupid, with high losses, and then at some point, the weights start to converge, the attention mechanism discovers the inversion mapping, and the machine crosses that invisible line from guessing to knowing.”

His core point is that the essence of modern AI is not some mysterious power, but "the machine repeatedly updates the strength of thousands of weighted connections, making the next answer slightly less wrong than the last time."

Plummer finally pointed out that as computing resources increasingly become a bottleneck, companies that can return to the ultimate pursuit of efficiency and optimization will have a greater advantage in the future AI competition.