Generating 50,000 Sanskrit Stories with AI

In 2023, researchers at Microsoft (Eldan & Li) published TinyStories. They showed that models as small as 1 million parameters can generate coherent English stories — if trained on the right synthetic data.

I've been learning Sanskrit for two years. I've been building NLP systems for over a decade. The question was obvious: can we do the same for Sanskrit?

Not translating English stories. Generating original Sanskrit narratives in Devanagari, with age-appropriate vocabulary, coherent plots, and something TinyStories never had to deal with — an ontological worldview. Sanskrit stories don't just use different words. They encode a different relationship between action and consequence, self and world. That difference matters at the prompt level, and it matters in evaluation.

That's SanskritKatha. The dataset is done.

Two Tiers, One Language

BalaKatha — for ages 4-5 (CBSE classes 1-3). Simple vocabulary, short sentences, familiar settings. The Dharmic principle is embedded gently: a child learns that helping others has its own reward, not because anyone says "this is karma-phala" but because the narrative demonstrates it.

KishoraKatha — for ages 14-15 (CBSE classes 8-10). Complex grammar including proper sandhi usage, longer narratives, philosophical themes woven into realistic scenarios. A teenager faces a moral choice and the story explores it through character and consequence, not didactic lecturing.

Each story has:

A required vocabulary list (age-appropriate word usage)
A target Dharmic principle (karma, ahimsa, satya, guru-shishya parampara, etc.)
A story feature (dialogue, description, internal monologue)
Cultural grounding (Bharatiya names, settings, customs — not Sanskrit words painted over Western scenarios)

25,000+ stories per tier. 50,000 total.

How We Generate

Four LLM providers: Anthropic (Claude), OpenAI (GPT-4), Google (Gemini), and one more. We ran a pilot of 500 stories per tier to evaluate which providers handle Sanskrit morphology best, then scaled.

The generation prompt is heavily structured. This isn't "write me a story in Sanskrit." It's a system prompt that specifies vocabulary constraints, grammar expectations, age targets, cultural requirements, and narrative structure. Per-story parameters customize each generation.

I hit Gemini's 10,000 API calls/day limit during scaling. Never faced that before — feels good in a way. Total cost: around $200 for 50,000 stories. Sanskrit is a low-resource language; token costs reflect that. The limiting factor was rate limits, not money.

Many iterations on those prompts before they stopped producing stories that read like textbook exercises.

The Evaluation Problem

Here's where it gets genuinely hard.

Automated evaluation (LLM-as-judge) has known problems. Self-preference bias — where the same model that generated content evaluates it favorably — averages 0.8 points on a 5-point scale. Leniency bias varies by 1+ point depending on which model is judging.

For a paper targeting ACL/EMNLP 2026, human evaluation isn't optional. Reviewers will ask "how do you know these stories are actually good Sanskrit?" The answer needs to be "because Sanskrit scholars scored them blind on 6 dimensions."

Those 6 dimensions:

Sanskrit grammar — correct vibhakti (case endings), verb conjugations, sandhi application
Vocabulary level — age-appropriate word choices
Story coherence — clear narrative arc
Dharmic integration — the principle emerges naturally, not as overlay
Word usage — required vocabulary words fit organically
Cultural authenticity — Bharatiya setting, names, customs

The Feedback Platform

So I built one. sanskritkatha.com

Sanskrit readers — scholars, teachers, students, enthusiasts — evaluate stories against those 6 criteria. The methodology borrows from Mechanical Turk research:

Blind review: Reviewers don't know which model generated the story
3-reviewer consensus: Each story needs at least 3 independent evaluations
Calibration stories: Known-quality stories establish baseline agreement
Trust scoring: Reviewers who consistently deviate from consensus are flagged
Anti-abuse safeguards: Rate limiting, minimum time-on-story thresholds, anomaly detection

I built the safeguards deliberately. Sanskrit content online attracts a particular kind of hostility. These aren't an afterthought.

What I've Learned

Cultural encoding is harder than grammar. Getting an LLM to produce grammatically correct Sanskrit isn't the main challenge. Getting it to produce stories that feel Bharatiya — where the worldview is lived, not described — requires careful prompt engineering. The ontological commitment has to be in the prompt structure, not just the content requirements.

Small languages deserve small models. The premise of TinyStories is that you don't need 70 billion parameters to tell a children's story. For Sanskrit, with its regular (if complex) morphology and relatively constrained educational vocabulary, small models should work even better. We'll find out in the training phase.

AI-assisted research is real. Claude Code drove the entire pipeline: vocabulary curation, prompt engineering, generation, evaluation framework, the feedback platform. 26 years in software and I'm a complete beginner at research. The AI is teaching me how to do this. Not automated — I made every substantive decision — but orchestrated. That's a meaningful distinction.

What's Next

Human evaluation via sanskritkatha.com (need reviewers)
Custom BPE tokenizer optimized for Sanskrit morphology
Model training on RunPod (GPT-Neo architecture, 1M to 33M params)
Cross-linguistic evaluation against English TinyStories
Paper draft, submission

The dataset goes on HuggingFace. Open data, open models.

Call for Reviewers

If you read Sanskrit at any level — beginner to expert — your evaluation matters. Each review session is 10 stories, about 25 minutes. Work at your own pace.

If you know Sanskrit teachers, scholars, or students who might be interested, please share this. Every review makes the research stronger.

SanskritKatha is a personal research project by Mahesh CR. Not affiliated with any institution. Dataset and trained models will be released under open licenses.