I Let an AI Run 14 Experiments on My Sri Aurobindo Language Model — Here's What It Found

I've been training a small language model on Sri Aurobindo's collected works. 6.7 million tokens across 23 books — philosophy, yoga, poetry, political writing. The goal isn't to build the next ChatGPT. It's to see what a model learns when the entire training corpus is one thinker's lifetime of work.

A few weeks ago, I had a 22-million parameter model that was memorizing more than it was learning. Val loss at 3.95 with a train loss of 1.73 — a gap of 2.2 that screamed overfitting. An LLM judge rated the output 3.26 out of 10. Not great.

Then I came across Karpathy's autoresearch pattern.

The Idea Is Almost Stupidly Simple

Give an AI agent a training script. Let it modify one thing. Run for exactly 10 minutes. If val_loss improves, keep the change. If not, revert. Repeat.

No grand strategy. No sweeping architecture redesigns. Just: try something small, measure, keep or toss. The constraint is time, not compute.

I adapted this for my setup — M4 Pro with 24GB RAM, MPS backend, no cloud GPUs. The corpus is fixed. The tokenizer is fixed. The agent can only touch train.py. Everything else is locked down.

Then I let it run.

14 Experiments, 2 Hours, One Clear Story

The agent ran 14 experiments. Six improved val_loss and were kept. Eight were reverted. The whole thing took about two hours.

The headline: val_loss dropped from 5.286 to 4.987 — a 5.7% improvement.

But the number isn't the interesting part. The trajectory is.

Here's what happened, in order:

Experiment 1: The agent tried doubling the batch size. MPS ran out of memory. 35 steps in 10 minutes instead of 1,199. Reverted. Lesson learned the hard way.

Experiment 2: Dropped from 6 layers to 4, bumped the learning rate. Fewer parameters meant more training steps in the same time budget. First improvement.

Experiment 4: Cut the context length from 512 to 256 tokens. This was the single biggest win — 4.2% improvement in one move. Shorter sequences meant the model could chew through twice as many training examples.

Experiment 9: The agent went smaller again. 3 layers, 4 heads, 256 embedding dim. A 10-million parameter model. Less than half the original. And it was the new best.

Experiments 13-14: Pushed batch size up to 32, now that the model was small enough to handle it. Crossed below 5.0 for the first time.

The final configuration:

# 10.1M parameters (down from 22.3M)
N_LAYER = 3
N_HEAD = 4
N_EMBD = 256
BLOCK_SIZE = 256
BATCH_SIZE = 32
LEARNING_RATE = 6e-4

The Counterintuitive Finding: Smaller Was Better. Every Time.

This surprised me. The 22M model felt like the right size — not too big, not too small for 6.7M tokens. But in a fixed 10-minute budget, the 10M model crushed it. Two reasons:

More steps. The tiny model gets 2,500-5,000 training steps in 10 minutes. The larger model gets 1,200. When you're data-limited, more passes through the corpus matters more than model capacity.

Less overfitting. Fewer parameters relative to data means the model generalizes instead of memorizing. That 2.2 gap I was seeing earlier? This is how you close it.

This aligns with what the TinyStories paper found: match model capacity to data size. Sounds obvious in hindsight. Wasn't obvious before the agent tried it.

What Didn't Work (Equally Interesting)

SwiGLU activation — a provably better MLP design — actually made things worse. The extra compute per step meant fewer total steps. In a time-boxed regime, the cheapest forward pass that still learns is optimal. Fancier isn't better when the clock is ticking.

Learning rate of 1e-3 — too aggressive even on the tiny model. The agent tried it twice (experiments 3 and 10). Reverted both times. 6e-4 is the sweet spot for this corpus.

Label smoothing — artificially inflates the loss metric, making comparison meaningless. The agent correctly reverted it, but the insight is important: your regularization technique can't fight your evaluation metric.

Context length of 128 — too short. Sri Aurobindo's philosophical arguments develop across multiple sentences. You need at least 256 tokens for the ideas to have room to breathe. There's a minimum coherence length for philosophical text, and the model found it.

The Outputs Are... Something

After 10 minutes of training, with a 10M parameter model, here's what comes out:

Prompt: "The Supermind is"
Output: "The Supermind is when Agni is the symbol of that he is sometimes a goal of the Vedic sacrifice, the one of the psychological sense of the Vedic Rishis, the supreme travelled to the goddess of the Truth of the Truth."

It's not coherent. But look at what it has learned: Supermind, Agni, Vedic sacrifice, Rishis, psychological sense, Truth. The vocabulary is right. The conceptual neighborhood is right. The grammar is attempting complexity. After 10 minutes.

The model is trying to speak Sri Aurobindo. It just needs more time.

Why Autoresearch Works (and When It Doesn't)

The pattern works because it removes human bias from hyperparameter search. I would never have tried a 10M model — it felt too small. The agent had no such prejudice. It just tried things, measured, and kept what worked.

It works especially well when:

Your training budget is fixed (time or compute)
You're early in the project and don't know the landscape
The search space is large but each experiment is cheap to evaluate

It works less well when:

You need architectural changes that span multiple files
The evaluation metric is noisy or slow
You're in a regime where the right answer is "train longer" not "change something"

What's Next

The autoresearch run found the right architecture. Now I need to train it properly.

The plan: take this 10M model with its optimized config, move it to Colab or RunPod, and train for 1-2 hours instead of 10 minutes. Then re-evaluate with the LLM judge against the 3.26/10 baseline.

There are also unexplored directions the agent didn't get to — cosine annealing with warm restarts, gradient accumulation for larger effective batches, rotary positional embeddings. Each one is a potential improvement that the autoresearch loop can test.

But the biggest insight isn't about this specific model. It's about the method. Two hours of automated experiments, running on a laptop, found a configuration that a human would have taken days to discover through manual tuning. The agent didn't have intuition. It had patience and a stopwatch.

Sometimes that's enough.

SASLM (Sri Aurobindo Small Language Model) is a personal research project training small language models on Sri Aurobindo's collected works. The autoresearch pattern is from Andrej Karpathy. The training runs on an M4 Pro MacBook — no cloud GPUs were harmed in the making of this experiment.

I Let an AI Run 14 Experiments on My Sri Aurobindo Language Model — Here's What It Found

The Idea Is Almost Stupidly Simple

14 Experiments, 2 Hours, One Clear Story

The Counterintuitive Finding: Smaller Was Better. Every Time.

What Didn't Work (Equally Interesting)

The Outputs Are... Something

Why Autoresearch Works (and When It Doesn't)

What's Next

Tagged

Stay Connected

Continue Reading

Sanskrit Aid v0.2 - Sandhi Splitting, Pronunciation, and Vidyut-Powered Declensions

Turning a Founder's Outbound Playbook into an MCP Server in One Sitting

From Zero to Indra: Running a Personal AI Orchestrator on a Hetzner Server