WriteLike: How it works, Why it works
2025-10-01
TL;DR
A detailed breakdown of WriteLike’s technical core. I explain the pipeline (spaCy feature extraction, statistical summarization, compact JSON prompt → Gemini), why each feature matters, and how the system balances determinism with generative interpretation. I also cover safeguards, ethics, and why this approach is robust vs naïve LLM-only analysis.
First off, what is WriteLike?
Short Version:
WriteLike’s analyzer turns a text into a compact, explainable representation of how the text works. It combines robust syntactic parsing (spaCy) with simple statistical feature engineering (sentence/token length distributions, clause counts, POS & dependency frequencies, lemma motifs, verb tense signals, and a small morphological sample) and then feeds a short, structured summary to a generative model (Gemini) to produce human-friendly advice.
That mix is powerful because the NLP layer (spaCy) extracts objective, reproducible signals, and the LLM layer translates those signals into readable, actionable guidance — so the system is both explainable and usable. The analyzer code computes moments (mean / stdev / range) and an oscillation ratio that reveals rhythm/variation.
Longer technical explanation -- What is extracted, and why it matters
Pipeline (end-to-end):
- Language detection (langdetect) decides between English / Chinese pipelines.
- Syntactic parsing with spaCy: tokenization, POS tagging, dependency parsing, lemma extraction, morphological features. The parsing and feature extraction live in
analyze.py
. - Statistical summarization: compute distributions (mean, std, range) and an oscillation ratio (how often length/structure changes between adjacent sentences/tokens) using
variance_measures
. This gives rhythm and regularity metrics. - Prompt composition & LLM call: pack the compact JSON-like stat summary into a prompt template and call Gemini for human-readable interpretation; the integration is in
main.py
/ FastAPI (app.py
).
Features extracted and why each matters:
1. Sentence-length distribution (avg, stdev, range, oscillation_ratio)
- What: Number of tokens per sentence; summarized by mean, standard deviation, min/max range, and a simple oscillation metric (how often successive sentences differ).
- Why it matters: Controls rhythm and readability. Low variance → staccato/consistent style (short news-like sentences); high variance → more complex, flowing prose. Oscillation shows whether the writer alternates sentence lengths (a stylistic rhythm).
2. Token-length distribution (avg, stdev, range, oscillation_ratio)
- What: Average token (word) length excluding punctuation.
- Why: Proxy for lexical complexity (longer tokens → more morphologically complex words / technical vocabulary). Useful for readability estimates and "register" (colloquial vs academic).
3. Clauses per sentence (counts + variance)
- What: Using dependency labels (
ccomp
,xcomp
,advcl
,relcl
,conj
) to estimate clause counts per sentence. - Why: Measures syntactic complexity and embedding depth. More clauses → denser informational style, possibly more subordinate clauses and complex arguments.
4. POS distribution (frequency of NOUN/VERB/ADJ/ADV/etc.)
- What: Count/frequency table of part-of-speech tags.
- Why: Reveals focus and style — noun-heavy texts are descriptive/nominalized; verb-heavy texts are dynamic/action-oriented. POS mix is a compact fingerprint of voice.
5. Dependency relation distribution
- What: Frequency of dependency labels (e.g.,
nsubj
,dobj
,advmod
). - Why: Shows typical syntactic patterns and the text’s structural orientation (e.g., heavy use of nominal modifiers vs adverbial modification).
6. Verb tense signals (token tags starting with V)
- What: Frequency of verb tense/morphological tags.
- Why: Tense/aspect pattern maps to temporal framing (past narrative vs present commentary) and can be a strong style cue.
7. Lemma frequency / motif detection
- What: Top lemmas (normalized lexical items) and counts.
- Why: Reveals themes and repetitive motifs (words/topics that the author returns to), useful for voice and topical fingerprinting.
8. Morphological samples
- What: Short samples of token morphs (from
token.morph
) for a few sentences. - Why: Useful for showing fine-grained grammatical patterns (case, number, aspect) — especially useful in morphologically rich languages.
9. Sentences & token lists (raw stats)
- What: Basic counts (sentence count, tokens etc.) and the sample excerpt included in the prompt.
- Why: Provide context to the LLM and allow reproducible downstream checks.
These features together form a multidimensional style embedding (human-readable): rhythm (sentence/token stats), syntactic complexity (clauses, deps), lexical choice (lemmas/token-length), and temporal/frame cues (verb tense). Because each dimension is interpretable, the overall summary is both thorough and explainable.
Implementation note: the analyzer computes
variance_measures
(mean, sd, range, oscillation_ratio) and formats that into aprompt_template
that Gemini then turns into readable advice. Seeanalyze.py
for the exact code and template usage.
Feature engineering — how I turn text into numbers (and why that matters)
Feature engineering is the heart of WriteLike: the tool’s job is to translate surface writing into a compact, explainable vector of signals that together describe how a piece of writing behaves.
Concretely, I treat each document as a small structured object of numeric features — a human-readable “style vector” — and I try to pick features that are: (1) interpretable, (2) testable, and (3) complementary (they each add different information about rhythm, syntax, or lexical choice). Below are the main ideas and the practical steps I use.
Feature groups and engineering choices
-
Rhythm & shape (continuous features)
- Sentence-length stats: mean, standard deviation, min/max, oscillation_ratio (how often successive sentences differ).
- Token-length stats: same set of moments. Why: these capture pacing and lexical density. I compute moments because they’re easy to test and explain.
-
Syntactic complexity (discrete → normalized)
- Clauses per sentence (estimated from dependency labels), frequency of subordinate clauses, coordination rates.
- Dependency relation distribution (normalized counts for
nsubj
,dobj
,advmod
, etc.). Why: shows how the author uses subordination, embedding, and modification — key structural fingerprints.
-
Lexical profile (categorical → frequency vectors)
- POS distribution (NOUN, VERB, ADJ, ADV, PRON, etc.).
- Top lemmas and motif frequency (a short list of the most common lemmas normalized by document length).
- Token-level morphological markers (small sample) for languages where morphology matters. Why: captures voice (nominal vs verbal), thematic repetition, and register.
-
Temporal/frame signals
- Verb tense/aspect distribution (counts of past/present/progressive markers). Why: tells you whether the sample is narrative, reflective, or instructional.
Practical engineering details I apply
- Normalization: counts are normalized by sentence or token counts so features are comparable across documents of different lengths.
- Compact summary JSON: features are serialized to a small JSON payload (a few KB at most) that I send to the LLM rather than the whole document. This keeps token use — and costs — low.
- Feature sparsity control: for lemma motifs I cap the list (top N) and drop extremely rare items; for POS/dependency features I keep only the most frequent labels to reduce noise.
- Testability: each numeric feature has a small unit test (sanity checks on a toy string) so I can detect regressions when models or spaCy versions change. Recruiters like this — it shows engineering discipline.
- Feature composition: in practice I don’t hand a single flat vector to the user; I group features (rhythm, syntax, lexicon) and provide summary statistics and simple explanations the LLM can translate into advice.
Example (very small) payload I send to the LLM
{
"lang": "en",
"sentences": {"count": 8, "avg_len": 15.2, "std_len": 6.3, "oscillation": 0.25},
"tokens": {"avg_len": 4.6, "std_len": 2.1},
"pos": {"NOUN": 0.29, "VERB": 0.18, "ADJ": 0.07},
"top_lemmas": ["data","analysis","model"],
"verbs": {"past": 0.6, "present": 0.3}
}
This compact summary is enough for the LLM to produce targeted advice: “Your sentences are mostly short with low oscillation — try varying length to add rhythm,” or “High noun ratio — consider swapping some nouns for verbs to make prose more active.”
Why an engineered approach is better than raw LLM-only analysis (for this product)
- It’s explainable: every recommendation maps back to a measurable feature you can inspect and test.
- It’s efficient: fewer tokens to the LLM = lower cost + lower latency.
- It’s stable: deterministic spaCy parsing and numeric features let me write unit tests and detect regressions automatically.
- It’s compositional: you can mix and match features (e.g., rhythm + POS) to target specific stylistic edits.
Why this approach is robust vs naive LLM-only approaches?
- Determinism & interpretability: spaCy-derived signals are deterministic and testable, so you can explain why the LLM advised X.
- Token-efficiency / cost control: sending a tight, structured summary to the LLM is far cheaper and faster than sending whole documents — and reduces noisy hallucination risk. (
main.py
andapp.py
show that you send a compact prompt, not the entire text). - Easier testing & unitization: stats are numeric and can be unit tested (you already have small sanity checks in your repo) rather than relying wholly on subjective LLM outputs.
Ethics & limitations — what WriteLike is for (and what it isn’t)
I want to be upfront about what this tool is built to do, and where it stops. In short: WriteLike is meant to help you understand the components of style and give concrete, learnable edits — not to teach people how to copy another writer verbatim.
What I want WriteLike to do (the intent)
- Help writers answer the question I used to ask myself: “What made that passage work?” — at the linguistic level.
- Show how style is composed (rhythm, clause use, lemma choices) and give short, concrete edits or rewrites that illustrate those ideas.
- Help you incorporate bits of style into your writing repertoire while keeping authorship and originality: implement the pattern, not the sentence.
What WriteLike is not (and why I say this explicitly)
- It is not a tool to create verbatim copies of a living author’s style for publication without attribution. Style imitation raises ethical and copyright concerns, and writing is more than patterns — it’s intention, experience, and voice.
- It will not perfectly capture creative, cultural, or contextual nuances. Linguistic features are only a slice of what makes writing compelling.
Practical safeguards & ethical choices I built in
- Transparency: the UI shows both the numeric features and the generated advice so users see the source signals behind a recommendation. This makes the output interpretable and less like a magical rewrite.
- Privacy & rate-limits: I enforce per-IP rate limits (beta caps) and don’t persist submitted text beyond short-lived analysis caches (when caching is enabled). If you want longer retention, there should be explicit consent.
- Plagiarism & attribution guidance: rewrites are meant as illustrative templates — I include a short UI note encouraging users to rephrase and credit long-form inspiration when appropriate.
- Multilingual honesty: I mark Chinese analysis as experimental in the UI — I don’t hide model uncertainty. That’s important: being explicit about where the tool is reliable and where it isn’t builds trust.
- No hallucination dependency: because the LLM gets a compact, structured prompt (not the entire document), the generative layer’s role is to interpret statistics, not invent factual claims about the author or text.
A short note on creativity & learning (personal)
For me, WriteLike was useful because I often read a piece and couldn’t say what exactly made it feel “good.” Breaking style into measurable pieces helped me learn — not copy — and that’s what I want the tool to do for other writers. I absorb elements from lots of authors, and I hope WriteLike helps people do the same intentionally: borrow structures and rhythms, not strings.