diff --git a/README.md b/README.md
index fb91582f..7e741c15 100644
--- a/README.md
+++ b/README.md
@@ -40,6 +40,7 @@ The `_demo` directory contains our demos (it start with `_` to prevent the `go`
* [concat](_demo/concat/concat.go): call C fprintf with stderr, and Go variadic function
* [qsort](_demo/qsort/qsort.go): call C function with a callback (eg. qsort)
* [genints](_demo/genints/genints.go): various forms of closure usage (including C function, recv.method and anonymous function)
+* [llama2-c](_demo/llama2-c/run.go): inference Llama 2 (It's the first llgo AI example)
### How to run demos
diff --git a/_demo/llama2-c/README.md b/_demo/llama2-c/README.md
new file mode 100644
index 00000000..6b3e469d
--- /dev/null
+++ b/_demo/llama2-c/README.md
@@ -0,0 +1,137 @@
+llama2 - Inference Llama 2 in LLGo
+=====
+
+
+
+
+
+Have you ever wanted to inference a baby [Llama 2](https://ai.meta.com/llama/) model in Go? No? Well, now you can!
+
+This is based on [llama2.c](https://github.com/karpathy/llama2.c), we didn't port anything! So it's very different from these Go implementations:
+* https://github.com/nikolaydubina/llama2.go
+* https://github.com/tmc/go-llama2
+
+llgo plays a great role as a bridge, allowing the C ecosystem to be seamlessly connected to Go.
+
+You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough (ref: [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) paper). This repo is a "fullstack" train + inference solution for Llama 2 LLM, with focus on minimalism and simplicity.
+
+As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B.
+
+## feel the magic
+
+How to run this example? The simplest way is to run it without any arguments:
+
+```bash
+llgo run .
+```
+
+This means it uses the default model checkpoint file (`stories15M.bin`), and the default prompt (`Once upon a time`).
+
+You need download the model checkpoint file first. Download this 15M parameter model trained on the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (~60MB download):
+
+```bash
+wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
+```
+
+If you want to specify a prompt (eg. `Long long ago`):
+
+```bash
+llgo run . 'Long long ago'
+```
+
+We can also try a bit bigger 42M parameter model (ie. `stories42M.bin`):
+
+```bash
+wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
+llgo run . -m stories42M.bin 'Long long ago'
+```
+
+There is also an even better 110M param model available, see [models](#models).
+
+Quick note on sampling, the recommendation for ~best results is to sample with `-t 1.0 -p 0.9`, i.e. temperature 1.0 (default) but also top-p sampling at 0.9 (default). Intuitively, top-p ensures that tokens with tiny probabilities do not get sampled, so we can't get "unlucky" during sampling, and we are less likely to go "off the rails" afterwards. More generally, to control the diversity of samples use either the temperature (i.e. vary `-t` between 0 and 1 and keep top-p off with `-p 0`) or the top-p value (i.e. vary `-p` between 0 and 1 and keep `-t 1`), but not both. Nice explainers on LLM sampling strategies include [this](https://peterchng.com/blog/2023/05/02/token-selection-strategies-top-k-top-p-and-temperature/), [this](https://docs.cohere.com/docs/controlling-generation-with-top-k-top-p) or [this](https://huggingface.co/blog/how-to-generate).
+
+
+## Meta's Llama 2 models
+
+As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the [Meta instructions](https://github.com/facebookresearch/llama). Once we have those checkpoints, we have to convert them into the llama2.c format.
+For this we need to install the python dependencies (`pip install -r requirements.txt`) and then use the `export.py` file, e.g. for 7B model:
+
+```bash
+python export.py llama2_7b.bin --meta-llama path/to/llama/model/7B
+```
+
+The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts. I would not attempt to run anything above 7B right now for two reasons: first, 13B+ currently doesn't work because of integer flow in pointer arithmetic, which is yet to be fixed, and second, even if it were fixed, this repo is doing float32 inference right now, so it would be fairly unusably slow. Once the export is done, we can run it:
+
+```bash
+./run llama2_7b.bin
+```
+
+This ran at about 4 tokens/s compiled with [OpenMP](#OpenMP) on 96 threads on my CPU Linux box in the cloud. (On my MacBook Air M1, currently it's closer to 30 seconds per token if you just build with `make runfast`.) Example output:
+
+> The purpose of this document is to highlight the state-of-the-art of CoO generation technologies, both recent developments and those in commercial use. The focus is on the technologies with the highest merit to become the dominating processes of the future and therefore to be technologies of interest to S&T ... R&D. As such, CoO generation technologies developed in Russia, Japan and Europe are described in some depth. The document starts with an introduction to cobalt oxides as complex products and a short view on cobalt as an essential material. The document continues with the discussion of the available CoO generation processes with respect to energy and capital consumption as well as to environmental damage.
+
+base models... ¯\\_(ツ)_/¯. Since we can inference the base model, it should be possible to also inference the chat model quite easily, and have a conversation with it. And if we can find a way to run 7B more efficiently, we can start adding LoRA to our training script, and going wild with finetunes all within the repo!
+
+You can also try Meta's Code Llama models even if support for them is incomplete. In particular, some hyperparameters changed (e.g. the constant in RoPE layer), so the inference is not exactly correct and a bit buggy right now. Looking into fixes. Make sure to build the tokenizer for the plain and instruct variants and pass it when doing inference.
+
+```bash
+python export.py codellama2_7b.bin --meta-llama /path/to/CodeLlama-7b
+python tokenizer.py --tokenizer-model=/path/to/CodeLlama-7b/tokenizer.model
+./run codellama2_7b.bin -z /path/to/CodeLlama-7b/tokenizer.bin
+```
+
+Chat with Code Llama Instruct:
+
+```bash
+python export.py codellama2_7b_instruct.bin --meta-llama /path/to/CodeLlama-7b-Instruct
+python tokenizer.py --tokenizer-model=/path/to/CodeLlama-7b-Instruct/tokenizer.model
+./run codellama2_7b_instruct.bin -m chat -z /path/to/CodeLlama-7b-Instruct/tokenizer.bin
+```
+
+
+## huggingface models
+
+We can load any huggingface models that use the Llama 2 architecture. See the script [export.py](export.py) and the `--hf` flag to export the model .bin file.
+
+## models
+
+For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours. I am hosting them on huggingface hub [tinyllamas](https://huggingface.co/karpathy/tinyllamas), both in the original PyTorch .pt, and also in the llama2.c format .bin:
+
+| model | dim | n_layers | n_heads | n_kv_heads | max context length | parameters | val loss | download
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| 260K | 64 | 5 | 8 | 4 | 512 | 260K | 1.297 | [stories260K](https://huggingface.co/karpathy/tinyllamas/tree/main/stories260K)
+| OG | 288 | 6 | 6 | 6 | 256 | 15M | 1.072 | [stories15M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin) |
+| 42M| 512 | 8 | 8 | 8 | 1024 | 42M | 0.847 | [stories42M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin) |
+| 110M| 768 | 12 | 12 | 12 | 1024 | 110M | 0.760 | [stories110M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin) |
+
+You'll notice that the 110M model is equivalent to GPT-1 in size. Alternatively, this is also the smallest model in the GPT-2 series (`GPT-2 small`), except the max context length is only 1024 instead of 2048. The only notable changes from GPT-1/2 architecture is that Llama uses RoPE relatively positional embeddings instead of absolute/learned positional embeddings, a bit more fancy SwiGLU non-linearity in the MLP, RMSNorm instead of LayerNorm, bias=False on all Linear layers, and is optionally multiquery.
+
+
+## training
+
+Let's see how we can train a baby Llama 2 from scratch using the code in this repo. First let's download and pretokenize some source dataset, e.g. I like [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) so this is the only example currently available in this repo. But it should be very easy to add datasets, see the code.
+
+```bash
+python tinystories.py download
+python tinystories.py pretokenize
+```
+
+Then train our model:
+
+```bash
+python train.py
+```
+
+**brief training guide**. See the train.py script for more exotic launches and hyperparameter overrides. Here is a brief guide to how to set the parameters. Look at the table at the very end of the [Chinchilla paper](https://arxiv.org/abs/2203.15556) to get a sense of how the Transformer parameters (dim, n_layers, n_heads) grow or shrink together. Extrapolate/interpolate this pattern to get bigger or smaller transformers. Set the max context length however you wish, depending on the problem: this should be the max number of tokens that matter to predict the next token. E.g. Llama 2 uses 2048. Next, you want the _total_ batch size per update (printed by the script as "tokens per iteration will be:") to be somewhere around 100K tokens for medium-sized applications. For tiny applications it could be lower, for large training (e.g. GPTs/LLamas) it is usually ~0.5M, or even more. You get there by first maxing out the batch_size to whatever your system allows (e.g. mine was 16 in a recent run because after that my GPU runs out of memory), and then you want to increase gradient_accumulation_steps to be as high as necessary to reach the total batch size of ~100K. Finally, you want to tune your learning_rate (LR). You want this to be as high as your training allows. Very small networks can get away with a large LR (e.g. 1e-3 or even higher). Large networks need lower LRs. 3e-4 is a safe choice in most medium-sized applications, but can be too low for small networks, so try to increase it! Finally, max_iters is the length of training. Play with different settings. I mostly only ever tune these parameters and leave most of the others unchanged. Here is an example of how I trained the 110M model, which I don't think is anywhere near optimal, but looked sensible to me: dim 768, n_layers 12, n_heads 12 (so size of each head is 768 / 12 = 64 channels), seq len of 1024, batch size 16 (this is the most that fit my A100 40GB GPU), gradient_accumulation_steps = 8 was needed to get total tokens batch size to be 16 batch size * 1024 tokens in sequence * 8 grad_accum = 131,072 tokens per update. Good. Learning rate 4e-4 (probably a little too low). max_iters 200K (probably a bit too high). Dropout 0.1, as that usually helps a bit at medium size. That was it. I ran using Distributed Data Parallel (DDP) on 4 GPUs on my cloud machine, training took ~day or so.
+
+Totally understand if you want to skip model training, for simple demo just download one of the pretrained models (see [models](#models) section), e.g.:
+
+```bash
+wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
+```
+
+Once we have the model.bin file, we can inference in C. Compile the C code first:
+
+```bash
+llgo run . -m stories15M.bin
+```
diff --git a/_demo/llama2-c/assets/llama_cute.jpg b/_demo/llama2-c/assets/llama_cute.jpg
new file mode 100644
index 00000000..c7fbda5e
Binary files /dev/null and b/_demo/llama2-c/assets/llama_cute.jpg differ
diff --git a/_demo/llama2-c/run.go b/_demo/llama2-c/run.go
index 9b9e6a57..fe131aaf 100644
--- a/_demo/llama2-c/run.go
+++ b/_demo/llama2-c/run.go
@@ -34,17 +34,15 @@ loop: // parse command line arguments
switch c.Getopt(c.Argc, c.Argv, c.Str("m:")) {
case 'm':
checkpointPath = c.Optarg
- c.Fprintf(c.Stderr, c.Str("use model: %s\n"), checkpointPath)
+ c.Fprintf(c.Stderr, c.Str("==> use model: %s\n"), checkpointPath)
case -1:
break loop
}
}
- /*
- if c.Optind < c.Argc {
- prompt = c.Index(c.Argv, c.Optind)
- c.Fprintf(c.Stderr, c.Str("prompt: %s\n"), prompt)
- }
- */
+ if c.Optind < c.Argc {
+ prompt = c.Index(c.Argv, c.Optind)
+ c.Fprintf(c.Stderr, c.Str("==> prompt: %s\n"), prompt)
+ }
// build the Transformer via the model .bin file
var transformer llama2.Transformer
diff --git a/x/llama2/README.md b/x/llama2/README.md
deleted file mode 100644
index b1846701..00000000
--- a/x/llama2/README.md
+++ /dev/null
@@ -1,4 +0,0 @@
-llama2 - LLGo support for Inference Llama 2
-=====
-
-TODO