Bag-of-words. It’s the first hammer every ML student grabs for text data. But in a fresh UBC CPSC330 assignment, recipe titles from Food.com’s massive 180,000-recipe bank proved it’s often a dud.
Expectations? Simple counts would spit out tidy clusters from 9,100 cleaned titles — duplicates gone, NaNs axed, shorties under five characters binned, tags limited to the top 300. Nope. Those clusters lumped ‘Quantum Computer’ with ‘Climate Change’ on toy Wikipedia data. Absurd.
This changes everything for unsupervised text tasks. Suddenly, pre-trained embeddings aren’t just buzz — they’re table stakes.
Why Bag-of-Words Can’t Hack It Anymore
CountVectorizer. Bag-of-words encoding. Shallow as a puddle. It tallies frequencies, ignores semantics. “Chicken” and “cake” dominate the wordcloud, sure — but no grasp of context.
Using a CountVectorizer with Bag-of-words encoding is intentionally shallow. We do not capture any meaning of the words, just their frequency and pass this on to our model.
That’s the project’s own words. Spot on. Humans see nuance; models see counts. Result? Clusters that don’t scream ‘sensible.’
Take the shortest title: ‘bread.’ Longest: ‘baked tomatoes with a parmesan cheese crust and balsamic drizzle.’ Common words cluster vaguely — muffins with cookies? Kinda. But mash quantum tech with green energy? Facepalm.
And here’s the thing — we’ve known this since the ’90s. TF-IDF patched it a bit, but bag-of-words stays the training-wheel baseline. This experiment? Proof it’s time to ditch.
Enter Sentence Embeddings: Meaning at Scale
Shift to ‘all-MiniLM-L6-v2’ from Sentence Transformers. Pre-trained magic. Vectors rich with baked-in semantics.
Those Wikipedia clusters? Fixed. ‘Quantum Computer’ jumps to ‘Unsupervised learning’ and ‘Deep learning.’ Climate stuff stays together. Boom — sense made.
Data glimpse:
| 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| 0 | -0.005857 | -0.004795 | -0.000976 | 0.011121 |
Numbers like that — dense 384-dim vectors — feed K-means first. Better. But not done.
Now, market dynamics: Embeddings aren’t new (BERT era, 2018), but for student projects on recipe titles? They’re exploding accessibility. Hugging Face drops models weekly; costs plummet. This isn’t hype — it’s deployable now.
K-Means vs. DBSCAN vs. Hierarchical: The Clustering Cage Match
K-means with Euclidean? Solid start. DBSCAN with cosine distance? Density-based, no k-guess hassle. Hierarchical? Dendrograms for the win.
The student picked hierarchical — ‘most sensible results.’ Plots showed it: tighter recipe groups. Zucchini everything together; salads apart. No more forced spheres.
Why hierarchical edges out? Flexibility. No predefined k; cuts where data says. In text — ragged, overlapping — it’s gold.
But wait — unique insight time. This mirrors early Yahoo directory clustering (1990s), manual tags crumbling under web explosion. Google’s N-gram shift prefigured embeddings. Today? Recipe apps like Yummly already do this implicitly. This project? Microcosm of why startups bet embeddings over counts. Prediction: By 2025, 80% unsupervised text pipelines ditch bag-of-words entirely.
Why Does This Matter for Recipe Data — or Any Text?
Food.com’s bank: Goldmine for patterns. ‘Zydeco salad’ clusters with Cajun kicks; ‘zuppa toscana’ soups. Marketers drool — auto-categorize for search, recs.
Skeptical take: Corporate PR would spin this as ‘AI revolutionizes kitchens.’ Nah. It’s iterative ML hygiene. Bag-of-words was never for production; embeddings make unsupervised viable without labels.
Scale it? 180k recipes — embeddings handle it on a laptop. Cost: Pennies via APIs. Devs, integrate via SentenceTransformers pip — done.
Look, courses like CPSC330 push this for a reason. Text ain’t tabular. Old tricks fail; modern vectors rule.
Trade-offs? Embeddings balloon dimensions — PCA or UMAP for viz. Compute up 10x vs. counts. Worth it? Data says yes.
Is Hierarchical Clustering Always King?
Not quite. K-means faster for millions. DBSCAN shines noisy data. But recipe titles — cleanish, topical — hierarchical’s sweet spot.
Student’s viz (imagine the plot): Hierarchical carves cleanest lines. Editorial call: Smart pick over K-means defaults.
Real-world? Whisk or Allrecipes could A/B this tomorrow. Clusters feed personalization — ‘zucchini haters, try these swaps.’ Revenue.
One nit: Subsample bias. Top tags skew popular. Full 180k? Might shift. Still, proof-of-concept aced.
And — em-dash aside — this isn’t academic fluff. It’s blueprint for any doc clustering: E-commerce titles, news headlines, logs.
🧬 Related Insights
- Read more: One Random Check Beats a Million: Decoding Zero-Knowledge Proofs
- Read more: 32.6 Million Remote Workers Unlock Global Job Goldmines in 2026
Frequently Asked Questions
What is document clustering for recipe titles?
It groups similar recipe names without labels, using ML to spot patterns like ‘all zucchini breads together.’
How do sentence embeddings improve text clustering?
They pack word meaning into vectors, unlike bag-of-words counts — leading to semantic clusters, not just frequency matches.
Should I use hierarchical clustering over K-means for text?
Yes for interpretable, non-spherical groups like recipes; K-means quicker for huge scale.
Why avoid bag-of-words for modern ML projects?
Ignores semantics, lumps unrelated topics — embeddings capture context pre-trained on billions of sentences.