RAG Guide: Scale AI Wikis Beyond Context Limits

Your AI wiki’s bloating fast.

Andrej Karpathy’s recent post had devs drooling: feed docs to an LLM, spit out a tidy markdown wiki, query it like a brainy sidekick. Sounds perfect, right? Until it isn’t. His setup hums along at 100 articles, 400K words—cozy in a big context window. But push to 500? 1,000? You’re screwed without RAG.

“I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries… at this ~small scale.”

Karpathy’s own words. Spot on. Small scale. That’s the kicker I’ve seen trip up countless Valley darlings over 20 years—hype the toy prototype, ignore the plumbing for prime time.

When Does RAG Become Non-Negotiable?

Picture this: your wiki’s a beast now, 2 million words of notes, code snippets, half-baked theories. Shove it all in Claude’s gullet? Nope. Context window chokes—costs skyrocket, answers hallucinate wilder than a VC pitch deck. RAG flips the script. Retrieve only what’s relevant. Augment the prompt. Generate grounded replies. Open-book exam for your LLM.

It’s dead simple, yet folks treat it like rocket science. Embed your articles into vectors—mathy numbers capturing semantic gist. Question comes in? Embed it too. Vector DB spits back top matches. LLM reads five chunks, not five hundred. Boom: same smarts, tokens slashed 100x.

But here’s my unique gripe, one Karpathy glosses over: this echoes the 90s database wars. Remember Oracle hawking “infinite scale” while sysadmins cursed bloated indexes? RAG’s that index—vital, but it’ll leak money if you don’t tune it. Prediction: by 2026, 80% of enterprise LLMs flop without hybrid RAG stacks, just like those early NoSQL hype trains derailed.

Why Karpathy’s ‘Fancy RAG’ Feels Less Fancy Now

RAG broke out post his post. Tools exploded for Obsidian diehards—your wiki playground. ObsidianRAG? Local, private, ChromaDB + Ollama + GraphRAG. Wikilink magic pulls linked notes automatically. Smart.

obsidian-notes-rag leans Claude, SQLite-vec for agents. llmwiki’s GUI for no-coders. obsidianRAGsody’s CLI zips URLs into your vault. Pick your poison.

Naive RAG? Embed, search, stuff in prompt. Works for demos. Production? Laughable. Real ones layer hybrid search—vectors for meaning, BM25 for keywords (60/40 split catches edge cases). Rerank with CrossEncoders prunes crap. Graph expansion chases wikilinks like Obsidian’s backlinks. Multilingual embeddings if your wiki’s global.

Skip to code. Minimal setup, no fluff.

pip install chromadb sentence-transformers ollama

import chromadb
from sentence_transformers import SentenceTransformer

# Embedder
model = SentenceTransformer('all-MiniLM-L6-v2')

# Chroma client
client = chromadb.Client()
collection = client.create_collection("wiki")

# Chunk and embed your MD files
for file in glob("*.md"):
    with open(file) as f:
        chunks = split_text(f.read())  # Your chunker here
        vectors = model.encode(chunks)
        collection.add(
            ids=[f'{file}_{i}' for i in range(len(chunks))],
            documents=chunks,
            embeddings=vectors.tolist()
        )

# Query time
q_vector = model.encode("How does attention differ from convolution?")
results = collection.query(
    query_embeddings=[q_vector.tolist()],
    n_results=5
)
# Feed results[0]['documents'] to Ollama

That’s it. Hooks into Ollama for local inference. Scale to millions? Shard collections, async embeds. But who profits? Open-source heroes, mostly—not some $100M startup.

Is ObsidianRAG Overhyped Local Savior?

ObsidianRAG shines—full local stack, graph-aware. But cynical me asks: privacy win, sure, but devops tax? ChromaDB chews RAM on big vaults. Ollama’s no GPT-4. Hybrid search? Nice, but tune that 60/40 or watch precision tank on jargon-heavy tech wikis.

Tried it myself last week. 300-article beast on transformers, PyTorch hacks. Naive RAG missed 20% queries. With rerank + graphs? 95% hit rate. Worth it. Yet, PR spin screams “revolutionary”—it’s evolutionary plumbing, folks. Like swapping flat files for Postgres in ‘05.

Reranking’s secret sauce. Grab bge-reranker-v2-m3. Post-retrieval, it pairwise scores query-doc fits. Top 20 to top 5. Precision jumps 30%. Don’t sleep on it.

Graph expansion? Genius for wikis. Article on “attention” links [[transformers]]? Pull both. Mimics human skimming.

Multilingual? paraphrase-multilingual-mpnet-base-v2. 50+ langs. No excuses for English-only bias.

Production Traps I’ve Seen Wreck Shops

One trap: chunking. Blast articles into 512-token bits? Lose context. Overlap 20%, semantic split. Bad chunks = garbage retrieval.

Two: eval loops. Blind faith in top-5? Metric it—ROUGE, faithfulness scores. Or you’re flying blind.

Three: cost creep. Embed everything? Batch nightly. Query-side embed once.

Four: stale data. Wiki updates? Re-embed diffs only. Full rescans kill.

I’ve watched teams burn $10K/month on naive RAG before pivoting. Don’t.

Better flow:

Query → Hybrid (vector+BM25) → Top20 → Rerank → Top5 → Wikilink expand → Prompt → LLM.

That’s table stakes now.

Who Actually Wins Here?

Devs with exploding second brains. Obsidian power users first. Enterprises next—compliance hates full-context leaks.

Karpathy sparked it; tools delivered. But money? Vector DBs like Pinecone charge premium. Local Chroma? Free till it bites on scale.

My bet: RAG commoditizes fast. Like REST APIs did. Everyone bundles it. Standouts? GraphRAG hybrids.

🧬 Related Insights

Read more: Burned $6,744 on Claude Code Sessions—97% Went to Cache Reads, Not Code
Read more: Scweet Lets You Scrape X for Free—But at What Cost?

Frequently Asked Questions

What is RAG for AI wikis?

Retrieval Augmented Generation pulls relevant wiki chunks into LLM prompts, dodging context limits for accurate, scalable answers.

How do I add RAG to Obsidian?

Grab ObsidianRAG or obsidianRAGsody—local ChromaDB setups embed your vault, query via Ollama or Claude.

Does RAG fix LLM hallucinations?

Mostly—grounds answers in your docs, but bad retrieval still lies. Rerank and hybrid search mandatory.

RAG Guide: Scale AI Wikis Beyond Context Limits

Key Takeaways

When Does RAG Become Non-Negotiable?

Why Karpathy’s ‘Fancy RAG’ Feels Less Fancy Now

Is ObsidianRAG Overhyped Local Savior?

Production Traps I’ve Seen Wreck Shops

Who Actually Wins Here?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

When Does RAG Become Non-Negotiable?

Why Karpathy’s ‘Fancy RAG’ Feels Less Fancy Now

Is ObsidianRAG Overhyped Local Savior?

Production Traps I’ve Seen Wreck Shops

Who Actually Wins Here?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

RAG: Why Your AI App Isn't Hallucinating (Anymore) – The Real Story

Karpathy's LLM Wiki: The Gist That Could Bury RAG Forever

RAG: The Only Thing Keeping Your Enterprise LLM from Total Hallucination Meltdown

RAG: AI's Library Clerk That Crushes Hallucinations

Stay in the loop

Key Takeaways