At the heart of every captivating story lies tone—the elusive essence that breathes life into words, infusing them with personality, emotion, and rhythm. It's the subtle dance between structure and word choice that transforms mere sentences into art. For language models, mastering tone isn't just a luxury; I believe it's a key to taming them.

But how does one measure tone? How do we teach machines to grasp something so inherently subjective, so deeply woven into the human experience?

I: Charting the Unseen

[1] Reference

Programming Historian: Introduction to Stylometry with Python

Turns out there's an entire field dedicated to this pursuit: stylometry, the statistical study of literary style. After a brief conversation with ChatGPT, I found myself diving into stylometric analysis, guided by an excellent resource on Stylometry Methods in Python [1].

Allow me to introduce our cast of literary voices: J.K. Rowling, weaving magic with her words; Tade Thompson, navigating the complexities of science fiction; and Andre Agassi, sharing intimate reflections in his autobiography (though perhaps penned with the aid of a ghostwriter).

My journey began with the fundamentals: word length. Employing Mendenhall's Characteristic Curves of Composition, I examined how each author distributes words of different lengths throughout their texts. Interestingly, at first glance, a striking similarity emerged—a testament to the shared building blocks of language.

But that's just the surface. Digging deeper, I turned to word frequency, comparing not only books by the same author but also across different authors. Our dataset includes the first two books of the Harry Potter series, the opening volumes of The Wormwood Trilogy, and Andre Agassi's candid autobiography.

To quantify these stylistic nuances, I employed the Jensen-Shannon divergence—a metric that measures the similarity between probability distributions. In this context, lower values signify closer alignment in writing styles, while higher values highlight greater divergence. This metric captures the subtle shifts in word choice and pattern that define an author's unique voice.

Jensen-Shannon Heatmap

0.460

0.000

HP1

HP2

WT1

WT2

HP1

0.000

0.239

0.456

0.457

0.444

HP2

0.239

0.000

0.459

0.460

0.450

WT1

0.456

0.459

0.000

0.300

0.335

WT2

0.457

0.460

0.300

0.000

0.358

0.444

0.450

0.335

0.358

0.000

The resulting heatmap paints a picture of stylistic relationships. Unsurprisingly, works by the same author—such as the first two Harry Potter books or the initial entries in The Wormwood Trilogy—cluster together with lower divergence values, affirming the consistency of their narrative voices.

Yet, despite these insights, the method isn't foolproof. The varied genres and styles at play mean that Jensen-Shannon divergence alone isn't a definitive fingerprint for authorship. It's a piece of the puzzle, but not the whole picture.

Turning our gaze to punctuation, the differences become even more pronounced. J.K. Rowling's text dances with quotation marks, single quotes, and apostrophes—hallmarks of dialogue-rich storytelling. Her characters' voices leap off the page, a chorus of interaction that defines the Harry Potter series. In contrast, Tade Thompson and Andre Agassi wield punctuation with measured restraint, favoring the steady rhythm of periods and commas.

II: Data Alchemy

Storytelling often hinges on the interplay between its narrative elements, a dynamic I explored called the 'paragraph aspect ratio.' This concept offered a revealing glimpse into the mechanics of storytelling by analyzing the balance of different narrative components within a text. To explore this further, I first needed a robust and diverse training dataset capable of reflecting an author's stylistic tendencies.

To construct this dataset, I began by dividing each book into its component paragraphs and leveraging GPT-3.5 to generate concise, meaningful summaries for each. This approach created structured paragraph-summary pairs, which served as the foundation for the fine-tuning process. I found that limiting the dataset to no more than 1,000 training pairs was critical; exceeding this threshold often led to overfitting. At that point, the model began echoing the source material verbatim—becoming more of a parrot than a poet.

Training Samples	Duration	Tokens Billed	Tokens/Sample
800	4h 29m 7s	584,000	730
600	2h 42m 39s	438,000	730
300	1h 15m 18s	222,000	740

With the dataset prepared, I launched a series of fine-tuning experiments using Azure OpenAI Studio. My primary goal was to explore how varying the number of training samples influenced the model's ability to adapt and capture an author's distinct style. By testing different dataset sizes, I could identify the sweet spot where the model demonstrated flexibility in generating stylistically consistent text without falling into overfitting. These experiments provided valuable insights into the relationship between dataset size and model performance.

Once fine-tuning was complete, I turned my attention to dissecting the author's stylistic patterns. Using the Gemini Pro's enum output, I categorized the paragraphs into five key narrative elements: action, description, exposition, inner thoughts, and dialogue. This classification provided a precise lens through which to examine the structure of storytelling, revealing how each narrative element contributed to the overall composition. By aligning these categories with the fine-tuned model's outputs, I gained a deeper understanding of how style and structure interact in the art of storytelling.

What is truly striking is how these authors begin to diverge in their use of these narrative elements.

The radar chart maps the distinct storytelling signatures of our authors. J.K. Rowling's narrative, outlined in black, strikes a delicate equilibrium, with action and dialogue weaving seamlessly together. This blend conjures her signature: richly detailed yet fast-moving scenes that have captured imaginations around the globe.

In vivid orange, Tade Thompson's style emerges, leaning heavily into action and exposition. This combination reflects the dynamic yet thoughtful nature of his science fiction worlds, where every moment pulses with movement and every idea is grounded in carefully crafted context.

Andre Agassi's autobiography unfolds as an intimate first-person account, drawing heavily on inner thoughts and exposition. The result is a raw, unfiltered narrative—a window into his emotional landscape as he revisits his triumphs and struggles, both on and off the court.

Capturing the Magic of J.K. Rowling's Style

Step into the transformation of GPT-4 as it learns to embody the essence of J.K. Rowling's writing. With each fine-tuning step, the model's radar chart unfolds like a living narrative, tracing its evolution across five key stylistic dimensions. The chart, juxtaposing Rowling's original style (brown) with the model's developing voice (orange), tells a story of growing alignment.

An interactive slider on the right allows you to explore this journey visually, sliding through the stages to see how the model's voice evolves with additional training samples.

Base300600800

III: Drawing the Curtain

Base300600800

On the chart to the left, you can explore how fine-tuning and prompt engineering each contribute to the model's stylistic evolution. With the interactive slider, adjust the emphasis on inner thoughts in the prompt and see how this, alongside fine-tuning, brings the model closer to Rowling's unique blend of action and introspection.

The fusion of traditional stylometry with language models unveils a potential framework for taming these models.

While fine-tuning edged us closer to mirroring J.K. Rowling's distinctive harmony of action and inner thoughts, perhaps the most compelling discovery was that smaller, more focused training sets—around 600 samples—often outperformed their larger counterparts.

This hints at a fundamental truth: in the art of data science, the quality and precision of training data eclipses sheer quantity.

Most intriguingly, our exploration illuminated that writing style transcends mere word choice or sentence structure; it's an intricate choreography of narrative elements. Much like a master chef balancing flavors, great writers weave together action, dialogue, and description, each with their own unique rhythm.

As our story comes to a close, I find myself pondering how this applies to data extraction tasks as well. While we can quantify and even replicate aspects of style, the true alchemy lies not in the individual ingredients but in the ineffable way they meld. Our memories and experiences shape our voice in ways that defy precise measurement, coloring every phrase, every pause, and every turn of thought. Yet, one inescapable question emerges for me: how do writers with an inner voice differ from those without one?

Where to Take It from Here

While I can scarcely hold a candle to the work happening at labs like OpenAI and Anthropic, I believe that this field is within reach for anyone willing to consistently try. Here are three promising avenues I think would be interesting to explore, each with its own level of complexity. I hope to explore these paths more in the future, but should you happen to arrive first, I invite you to connect; I will happily comment and share your work.

Synthetic Dataset Generation

Easy

Synthesize paragraphs that deliberately skew toward specific narrative aspects—for instance, 80% dialogue and 20% description—to assess whether fine-tuning can be more precisely steered along chosen stylistic dimensions.

Use GPT-4 to generate targeted training data
Fine-tune on extreme distributions
Measure style transfer precision

Embeddings vs. Stylometry

Medium

Compare transformer embeddings with traditional stylometric measures to understand if modern language models naturally capture the same writing style attributes we analyzed.

Generate embeddings for text samples
Correlate with stylometric features
Map embedding dimensions to attributes

Statistical Validation

Hard

Compare human and LLM evaluations of style transfer quality against stylometric measures to validate whether statistical approaches effectively capture subjective writing style assessment.

Generate large sample outputs
Run full stylometric analysis
Compare evaluation approaches

The Grammar of Thought: Does Fine-Tuning Make a Difference?

I: Charting the Unseen

Jensen-Shannon Heatmap

II: Data Alchemy

Capturing the Magic of J.K. Rowling's Style

III: Drawing the Curtain

Where to Take It from Here

Synthetic Dataset Generation

Embeddings vs. Stylometry

Statistical Validation

Let's discuss how I can help your business