Cynthia Yao | Cybersecurity & AI Developer

TL;DR

A detailed breakdown of WriteLike’s technical core. I explain the pipeline (spaCy feature extraction, statistical summarization, compact JSON prompt → Gemini), why each feature matters, and how the system balances determinism with generative interpretation. I also cover safeguards, ethics, and why this approach is robust vs naïve LLM-only analysis.

First off, what is WriteLike?

Short Version:

WriteLike’s analyzer turns a text into a compact, explainable representation of how the text works. It combines robust syntactic parsing (spaCy) with simple statistical feature engineering (sentence/token length distributions, clause counts, POS & dependency frequencies, lemma motifs, verb tense signals, and a small morphological sample) and then feeds a short, structured summary to a generative model (Gemini) to produce human-friendly advice.

That mix is powerful because the NLP layer (spaCy) extracts objective, reproducible signals, and the LLM layer translates those signals into readable, actionable guidance — so the system is both explainable and usable. The analyzer code computes moments (mean / stdev / range) and an oscillation ratio that reveals rhythm/variation.

Longer technical explanation -- What is extracted, and why it matters

Pipeline (end-to-end):

Language detection (langdetect) decides between English / Chinese pipelines.
Syntactic parsing with spaCy: tokenization, POS tagging, dependency parsing, lemma extraction, morphological features. The parsing and feature extraction live in analyze.py.
Statistical summarization: compute distributions (mean, std, range) and an oscillation ratio (how often length/structure changes between adjacent sentences/tokens) using variance_measures. This gives rhythm and regularity metrics.
Prompt composition & LLM call: pack the compact JSON-like stat summary into a prompt template and call Gemini for human-readable interpretation; the integration is in main.py / FastAPI (app.py).

Features extracted and why each matters:

1. Sentence-length distribution (avg, stdev, range, oscillation_ratio)

What: Number of tokens per sentence; summarized by mean, standard deviation, min/max range, and a simple oscillation metric (how often successive sentences differ).
Why it matters: Controls rhythm and readability. Low variance → staccato/consistent style (short news-like sentences); high variance → more complex, flowing prose. Oscillation shows whether the writer alternates sentence lengths (a stylistic rhythm).

2. Token-length distribution (avg, stdev, range, oscillation_ratio)

What: Average token (word) length excluding punctuation.
Why: Proxy for lexical complexity (longer tokens → more morphologically complex words / technical vocabulary). Useful for readability estimates and "register" (colloquial vs academic).

3. Clauses per sentence (counts + variance)

What: Using dependency labels (ccomp, xcomp, advcl, relcl, conj) to estimate clause counts per sentence.
Why: Measures syntactic complexity and embedding depth. More clauses → denser informational style, possibly more subordinate clauses and complex arguments.

4. POS distribution (frequency of NOUN/VERB/ADJ/ADV/etc.)

What: Count/frequency table of part-of-speech tags.
Why: Reveals focus and style — noun-heavy texts are descriptive/nominalized; verb-heavy texts are dynamic/action-oriented. POS mix is a compact fingerprint of voice.

5. Dependency relation distribution

What: Frequency of dependency labels (e.g., nsubj, dobj, advmod).
Why: Shows typical syntactic patterns and the text’s structural orientation (e.g., heavy use of nominal modifiers vs adverbial modification).

6. Verb tense signals (token tags starting with V)

What: Frequency of verb tense/morphological tags.
Why: Tense/aspect pattern maps to temporal framing (past narrative vs present commentary) and can be a strong style cue.

7. Lemma frequency / motif detection

What: Top lemmas (normalized lexical items) and counts.
Why: Reveals themes and repetitive motifs (words/topics that the author returns to), useful for voice and topical fingerprinting.

8. Morphological samples

What: Short samples of token morphs (from token.morph) for a few sentences.
Why: Useful for showing fine-grained grammatical patterns (case, number, aspect) — especially useful in morphologically rich languages.

9. Sentences & token lists (raw stats)

What: Basic counts (sentence count, tokens etc.) and the sample excerpt included in the prompt.
Why: Provide context to the LLM and allow reproducible downstream checks.

These features together form a multidimensional style embedding (human-readable): rhythm (sentence/token stats), syntactic complexity (clauses, deps), lexical choice (lemmas/token-length), and temporal/frame cues (verb tense). Because each dimension is interpretable, the overall summary is both thorough and explainable.

Implementation note: the analyzer computes variance_measures (mean, sd, range, oscillation_ratio) and formats that into a prompt_template that Gemini then turns into readable advice. See analyze.py for the exact code and template usage.

Feature engineering — how I turn text into numbers (and why that matters)

Feature engineering is the heart of WriteLike: the tool’s job is to translate surface writing into a compact, explainable vector of signals that together describe how a piece of writing behaves.

Concretely, I treat each document as a small structured object of numeric features — a human-readable “style vector” — and I try to pick features that are: (1) interpretable, (2) testable, and (3) complementary (they each add different information about rhythm, syntax, or lexical choice). Below are the main ideas and the practical steps I use.

Feature groups and engineering choices

Rhythm & shape (continuous features)
- Sentence-length stats: mean, standard deviation, min/max, oscillation_ratio (how often successive sentences differ).
- Token-length stats: same set of moments. Why: these capture pacing and lexical density. I compute moments because they’re easy to test and explain.
Syntactic complexity (discrete → normalized)
- Clauses per sentence (estimated from dependency labels), frequency of subordinate clauses, coordination rates.
- Dependency relation distribution (normalized counts for nsubj, dobj, advmod, etc.). Why: shows how the author uses subordination, embedding, and modification — key structural fingerprints.
Lexical profile (categorical → frequency vectors)
- POS distribution (NOUN, VERB, ADJ, ADV, PRON, etc.).
- Top lemmas and motif frequency (a short list of the most common lemmas normalized by document length).
- Token-level morphological markers (small sample) for languages where morphology matters. Why: captures voice (nominal vs verbal), thematic repetition, and register.
Temporal/frame signals
- Verb tense/aspect distribution (counts of past/present/progressive markers). Why: tells you whether the sample is narrative, reflective, or instructional.

Practical engineering details I apply

Normalization: counts are normalized by sentence or token counts so features are comparable across documents of different lengths.
Compact summary JSON: features are serialized to a small JSON payload (a few KB at most) that I send to the LLM rather than the whole document. This keeps token use — and costs — low.
Feature sparsity control: for lemma motifs I cap the list (top N) and drop extremely rare items; for POS/dependency features I keep only the most frequent labels to reduce noise.
Testability: each numeric feature has a small unit test (sanity checks on a toy string) so I can detect regressions when models or spaCy versions change. Recruiters like this — it shows engineering discipline.
Feature composition: in practice I don’t hand a single flat vector to the user; I group features (rhythm, syntax, lexicon) and provide summary statistics and simple explanations the LLM can translate into advice.

Example (very small) payload I send to the LLM

{
  "lang": "en",
  "sentences": {"count": 8, "avg_len": 15.2, "std_len": 6.3, "oscillation": 0.25},
  "tokens": {"avg_len": 4.6, "std_len": 2.1},
  "pos": {"NOUN": 0.29, "VERB": 0.18, "ADJ": 0.07},
  "top_lemmas": ["data","analysis","model"],
  "verbs": {"past": 0.6, "present": 0.3}
}

This compact summary is enough for the LLM to produce targeted advice: “Your sentences are mostly short with low oscillation — try varying length to add rhythm,” or “High noun ratio — consider swapping some nouns for verbs to make prose more active.”

Why an engineered approach is better than raw LLM-only analysis (for this product)

It’s explainable: every recommendation maps back to a measurable feature you can inspect and test.
It’s efficient: fewer tokens to the LLM = lower cost + lower latency.
It’s stable: deterministic spaCy parsing and numeric features let me write unit tests and detect regressions automatically.
It’s compositional: you can mix and match features (e.g., rhythm + POS) to target specific stylistic edits.

Why this approach is robust vs naive LLM-only approaches?

Determinism & interpretability: spaCy-derived signals are deterministic and testable, so you can explain why the LLM advised X.
Token-efficiency / cost control: sending a tight, structured summary to the LLM is far cheaper and faster than sending whole documents — and reduces noisy hallucination risk. (main.py and app.py show that you send a compact prompt, not the entire text).
Easier testing & unitization: stats are numeric and can be unit tested (you already have small sanity checks in your repo) rather than relying wholly on subjective LLM outputs.

Ethics & limitations — what WriteLike is for (and what it isn’t)

I want to be upfront about what this tool is built to do, and where it stops. In short: WriteLike is meant to help you understand the components of style and give concrete, learnable edits — not to teach people how to copy another writer verbatim.

What I want WriteLike to do (the intent)

Help writers answer the question I used to ask myself: “What made that passage work?” — at the linguistic level.
Show how style is composed (rhythm, clause use, lemma choices) and give short, concrete edits or rewrites that illustrate those ideas.
Help you incorporate bits of style into your writing repertoire while keeping authorship and originality: implement the pattern, not the sentence.

What WriteLike is not (and why I say this explicitly)

It is not a tool to create verbatim copies of a living author’s style for publication without attribution. Style imitation raises ethical and copyright concerns, and writing is more than patterns — it’s intention, experience, and voice.
It will not perfectly capture creative, cultural, or contextual nuances. Linguistic features are only a slice of what makes writing compelling.

Practical safeguards & ethical choices I built in

Transparency: the UI shows both the numeric features and the generated advice so users see the source signals behind a recommendation. This makes the output interpretable and less like a magical rewrite.
Privacy & rate-limits: I enforce per-IP rate limits (beta caps) and don’t persist submitted text beyond short-lived analysis caches (when caching is enabled). If you want longer retention, there should be explicit consent.
Plagiarism & attribution guidance: rewrites are meant as illustrative templates — I include a short UI note encouraging users to rephrase and credit long-form inspiration when appropriate.
Multilingual honesty: I mark Chinese analysis as experimental in the UI — I don’t hide model uncertainty. That’s important: being explicit about where the tool is reliable and where it isn’t builds trust.
No hallucination dependency: because the LLM gets a compact, structured prompt (not the entire document), the generative layer’s role is to interpret statistics, not invent factual claims about the author or text.

A short note on creativity & learning (personal)

For me, WriteLike was useful because I often read a piece and couldn’t say what exactly made it feel “good.” Breaking style into measurable pieces helped me learn — not copy — and that’s what I want the tool to do for other writers. I absorb elements from lots of authors, and I hope WriteLike helps people do the same intentionally: borrow structures and rhythms, not strings.

WriteLike: How it works, Why it works