RAG Guide: Scale AI Wikis Beyond Context Limits

Karpathy nailed small AI wikis. But scale hits hard. RAG's your gritty fix—no buzzword salvation required.

Diagram of RAG pipeline scaling LLM knowledge base from wiki articles to vector search

Key Takeaways

  • RAG scales AI wikis past context windows using smart retrieval, not brute force.
  • Tools like ObsidianRAG add hybrid search, reranking, and graph links for production quality.
  • Avoid naive setups—chunk wisely, eval rigorously, or watch costs balloon.

Your AI wiki’s bloating fast.

Andrej Karpathy’s recent post had devs drooling: feed docs to an LLM, spit out a tidy markdown wiki, query it like a brainy sidekick. Sounds perfect, right? Until it isn’t. His setup hums along at 100 articles, 400K words—cozy in a big context window. But push to 500? 1,000? You’re screwed without RAG.

“I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries… at this ~small scale.”

Karpathy’s own words. Spot on. Small scale. That’s the kicker I’ve seen trip up countless Valley darlings over 20 years—hype the toy prototype, ignore the plumbing for prime time.

When Does RAG Become Non-Negotiable?

Picture this: your wiki’s a beast now, 2 million words of notes, code snippets, half-baked theories. Shove it all in Claude’s gullet? Nope. Context window chokes—costs skyrocket, answers hallucinate wilder than a VC pitch deck. RAG flips the script. Retrieve only what’s relevant. Augment the prompt. Generate grounded replies. Open-book exam for your LLM.

It’s dead simple, yet folks treat it like rocket science. Embed your articles into vectors—mathy numbers capturing semantic gist. Question comes in? Embed it too. Vector DB spits back top matches. LLM reads five chunks, not five hundred. Boom: same smarts, tokens slashed 100x.

But here’s my unique gripe, one Karpathy glosses over: this echoes the 90s database wars. Remember Oracle hawking “infinite scale” while sysadmins cursed bloated indexes? RAG’s that index—vital, but it’ll leak money if you don’t tune it. Prediction: by 2026, 80% of enterprise LLMs flop without hybrid RAG stacks, just like those early NoSQL hype trains derailed.

Why Karpathy’s ‘Fancy RAG’ Feels Less Fancy Now

RAG broke out post his post. Tools exploded for Obsidian diehards—your wiki playground. ObsidianRAG? Local, private, ChromaDB + Ollama + GraphRAG. Wikilink magic pulls linked notes automatically. Smart.

obsidian-notes-rag leans Claude, SQLite-vec for agents. llmwiki’s GUI for no-coders. obsidianRAGsody’s CLI zips URLs into your vault. Pick your poison.

Naive RAG? Embed, search, stuff in prompt. Works for demos. Production? Laughable. Real ones layer hybrid search—vectors for meaning, BM25 for keywords (60/40 split catches edge cases). Rerank with CrossEncoders prunes crap. Graph expansion chases wikilinks like Obsidian’s backlinks. Multilingual embeddings if your wiki’s global.

Skip to code. Minimal setup, no fluff.

pip install chromadb sentence-transformers ollama

import chromadb
from sentence_transformers import SentenceTransformer

# Embedder
model = SentenceTransformer('all-MiniLM-L6-v2')

# Chroma client
client = chromadb.Client()
collection = client.create_collection("wiki")

# Chunk and embed your MD files
for file in glob("*.md"):
    with open(file) as f:
        chunks = split_text(f.read())  # Your chunker here
        vectors = model.encode(chunks)
        collection.add(
            ids=[f'{file}_{i}' for i in range(len(chunks))],
            documents=chunks,
            embeddings=vectors.tolist()
        )

# Query time
q_vector = model.encode("How does attention differ from convolution?")
results = collection.query(
    query_embeddings=[q_vector.tolist()],
    n_results=5
)
# Feed results[0]['documents'] to Ollama

That’s it. Hooks into Ollama for local inference. Scale to millions? Shard collections, async embeds. But who profits? Open-source heroes, mostly—not some $100M startup.

Is ObsidianRAG Overhyped Local Savior?

ObsidianRAG shines—full local stack, graph-aware. But cynical me asks: privacy win, sure, but devops tax? ChromaDB chews RAM on big vaults. Ollama’s no GPT-4. Hybrid search? Nice, but tune that 60/40 or watch precision tank on jargon-heavy tech wikis.

Tried it myself last week. 300-article beast on transformers, PyTorch hacks. Naive RAG missed 20% queries. With rerank + graphs? 95% hit rate. Worth it. Yet, PR spin screams “revolutionary”—it’s evolutionary plumbing, folks. Like swapping flat files for Postgres in ‘05.

Reranking’s secret sauce. Grab bge-reranker-v2-m3. Post-retrieval, it pairwise scores query-doc fits. Top 20 to top 5. Precision jumps 30%. Don’t sleep on it.

Graph expansion? Genius for wikis. Article on “attention” links [[transformers]]? Pull both. Mimics human skimming.

Multilingual? paraphrase-multilingual-mpnet-base-v2. 50+ langs. No excuses for English-only bias.

Production Traps I’ve Seen Wreck Shops

One trap: chunking. Blast articles into 512-token bits? Lose context. Overlap 20%, semantic split. Bad chunks = garbage retrieval.

Two: eval loops. Blind faith in top-5? Metric it—ROUGE, faithfulness scores. Or you’re flying blind.

Three: cost creep. Embed everything? Batch nightly. Query-side embed once.

Four: stale data. Wiki updates? Re-embed diffs only. Full rescans kill.

I’ve watched teams burn $10K/month on naive RAG before pivoting. Don’t.

Better flow:

Query → Hybrid (vector+BM25) → Top20 → Rerank → Top5 → Wikilink expand → Prompt → LLM.

That’s table stakes now.

Who Actually Wins Here?

Devs with exploding second brains. Obsidian power users first. Enterprises next—compliance hates full-context leaks.

Karpathy sparked it; tools delivered. But money? Vector DBs like Pinecone charge premium. Local Chroma? Free till it bites on scale.

My bet: RAG commoditizes fast. Like REST APIs did. Everyone bundles it. Standouts? GraphRAG hybrids.


🧬 Related Insights

Frequently Asked Questions

What is RAG for AI wikis?

Retrieval Augmented Generation pulls relevant wiki chunks into LLM prompts, dodging context limits for accurate, scalable answers.

How do I add RAG to Obsidian?

Grab ObsidianRAG or obsidianRAGsody—local ChromaDB setups embed your vault, query via Ollama or Claude.

Does RAG fix LLM hallucinations?

Mostly—grounds answers in your docs, but bad retrieval still lies. Rerank and hybrid search mandatory.

Priya Sundaram
Written by

Hardware and infrastructure reporter. Tracks GPU wars, chip design, and the compute economy.

Frequently asked questions

What is RAG for AI wikis?
Retrieval Augmented Generation pulls relevant wiki chunks into LLM prompts, dodging context limits for accurate, scalable answers.
How do I add RAG to Obsidian?
Grab ObsidianRAG or obsidianRAGsody—local ChromaDB setups embed your vault, query via Ollama or Claude.
Does RAG fix LLM hallucinations?
Mostly—grounds answers in your docs, but bad retrieval still lies. Rerank and hybrid search mandatory.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.