Haiku Detector

Production

June 1, 2023

A syllable-counting validation tool that determines whether text follows the 5-7-5 haiku form. Loads a library of syllable counts and checks content with good accuracy. Currently useful as a validation gate for AI-generated haikus — generate daily haikus, validate with the detector, regenerate if they do not pass.

Purpose

Built a tool that loads a large syllable dictionary and validates whether a given piece of text follows the 5-7-5 haiku syllable pattern. Useful for quality-controlling the 101 Potato Haikus pipeline and potentially for a daily haiku feature on the Potato Literature website.

Stack

JavaScriptSyllable DictionaryNLPValidationText Processing

What I Learned

Syllable counting in English is harder than it sounds. "Fire" — one syllable or two? "Poem" — one or two? "Comfortable" — three or four? English does not have consistent syllable rules the way Japanese does (where haiku originated and every mora is unambiguous). The approach: load a large dictionary with known syllable counts (like the CMU Pronouncing Dictionary), and for unknown words, fall back to heuristic rules (count vowel groups, subtract silent e's, handle common suffixes).
The CMU Pronouncing Dictionary maps ~130,000 English words to their phonetic pronunciation using ARPAbet notation. Each entry includes stress markers on vowels (0=unstressed, 1=primary, 2=secondary). Counting the vowel phonemes gives you the syllable count. For example: "POTATO" → P AH0 T EY1 T OW2 → three vowels → three syllables. This dictionary is the backbone of most English syllable counters.
Haiku validation is a binary gate: 5-7-5 or not. This makes it a perfect automated quality check — no subjective judgment, just counting. AI can generate haikus all day, but AI models frequently misccount syllables (they tokenize text differently than humans syllabify it). A deterministic syllable counter as a validation gate catches what the AI misses.
The accuracy gap comes from words not in the dictionary — proper nouns, slang, neologisms, brand names. "Potatuhs" is not in any syllable dictionary. The heuristic fallback has to handle it (Po-ta-tuhs → 3 syllables, which is correct). Getting the heuristics right for edge cases is where the tool goes from "cool demo" to "actually useful."

Key Insights

The haiku detector has a clear application in the Potatuhs ecosystem: Potato Literature wants a daily haiku on potatoliterature.com. AI can generate candidate haikus on any topic (potatoes, seasons, cooking, farming). The detector validates syllable counts. If a haiku fails validation, generate another. This creates an automated pipeline: prompt → generate → validate → publish if valid, retry if not. Zero human intervention for daily content.
Validation tools are more valuable than generation tools. Anyone can generate content with AI. The bottleneck is knowing whether the content is correct. A syllable counter for haikus, a grammar checker for prose, a linter for code — these are the quality gates that make automated content pipelines trustworthy. The generator is the engine. The validator is the brakes. You need both.
The 5-7-5 rule is actually debated among haiku practitioners — traditional Japanese haiku counts morae (sound units), not syllables, and many modern English haiku poets use fewer syllables for a closer approximation of the Japanese brevity. But for the 101 Potato Haikus series, 5-7-5 is the standard, and the detector enforces it. Knowing the rule well enough to debate it is different from enforcing it consistently at scale.
This tool connecting to the Potato Literature daily haiku feature is a small example of the larger Potatuhs pattern: build tools that serve the ecosystem. The haiku detector is not a standalone product. It is infrastructure for a content pipeline that feeds a division of the brand. Tools that serve the ecosystem are more durable than tools that serve the market.

#NLP#syllable-counting#haiku#validation#Potato-Literature#Potatuhs#CMU-dictionary#automation#content-pipeline#text-processing

This post was composed through a conversation between Brett Owers and Claude Code (Anthropic). The content reflects Brett's recollection of each project and the lessons drawn from it. Some details may be approximate or omitted — the purpose is to paint an honest picture of a software engineer's development over time, not to serve as a precise historical record.