gpt-go

Simple GPT implementation in pure Go. Trained on favourite Jules Verne books.

What kind of response you can expect from the model:

Mysterious Island.
Well.
My days must follow

Or this:

Captain Nemo, in two hundred thousand feet weary in
the existence of the world.

How to run

$ go run .

It takes about 40 minutes to train on MacBook Air M3. You can train on your own dataset by pointing the data.dataset variable to your text corpus.

To run in chat-only mode once the training is done:

$ go run . -chat

How to understand

You can use this repository as a companion to the Neural Networks: Zero to Hero course. Use git checkout <tag> to see how the model has evolved over time: naive, bigram, multihead, block, residual, full.

In main_test.go you will find explanations starting from basic neuron example

// Our neuron has 2 inputs and 1 output (number of columns in weight matrix).
// Its goal is to predict next number in the sequence.
input := V{1, 2} // {x1, x2}
weight := M{
    {2}, // how much x1 contributes to the output
    {3}, // how much x2 contributes to the output
}

All the way to self-attention mechanism:

// To calculate the sum of all previous tokens, we can multiply by this triangular matrix:
tril := M{
    {1, 0, 0, 0}, // first token attends only at itself ("cat"), it can't look into the future
    {1, 1, 0, 0}, // second token attends at itself and the previous token ( "cat" + ", ")
    {1, 1, 1, 0}, // third token attends at itself and the two previous tokens ("cat" + ", " + "dog")
    {1, 1, 1, 1}, // fourth token attends at itself and all the previous tokens ("cat" + ", " + "dog" + " and")
}.Var()
// So, at this point each embedding is enriched with the information from all the previous tokens.
// That's the crux of self-attention.
enrichedEmbeds := MatMul(tril, inputEmbeds)

Design choices

No batches.
I've given up the complexity of the batch dimension for the sake of better understanding. It's far easier to build intuition with 2D matrices, rather than with 3D tensors. Besides, batches aren't inherent to the transformer architecture.

Removed gonum.
The gonum.matmul gave us ~30% performance boost, but it brought additional dependency. We're not striving for maximum efficiency here, rather for radical simplicity. Current matmul implementation is quite effective, and it's only 40 lines of plain readable code.

Papers

You don't need to read them to understand the code :)

Attention Is All You Need
Deep Residual Learning
DeepMind WaveNet
Batch Normalization
Deep NN + huge data = breakthrough performance
OpenAI GPT-3 paper
Analyzing the Structure of Attention

Credits

Many thanks to Andrej Karpathy for his brilliant Neural Networks: Zero to Hero course.

Thanks to @itsubaki for his elegant autograd package.

Name		Name	Last commit message	Last commit date
Latest commit History 370 Commits
data		data
pkg		pkg
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
block.go		block.go
go.mod		go.mod
go.sum		go.sum
head.go		head.go
layer.go		layer.go
main.go		main.go
main_test.go		main_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gpt-go

How to run

How to understand

Design choices

Papers

Credits

About

Uh oh!

Releases

Packages

Languages

License

avary/gpt-go

Folders and files

Latest commit

History

Repository files navigation

gpt-go

How to run

How to understand

Design choices

Papers

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages