My Book and Machine Learning Odyssey: Part 2

July 24, 2022

Table of Contents

The Holy Grail Of Publishing: Bestseller Prediction

After validating that I could successfully classify novels into industry standard book categories (BISAC), I turned my focus back to the prize of predicting the next bestseller. I tackled this problem along two lines:

Product validation: validate that literary agents (the gatekeepers for traditionally published books – typically the first person from industry to set eyes on a future bestseller) would use a service like this. (If I could prove viability with this group, I’d likely have a path into publishers and then onto spinning up an AI driven book publisher.
Technical feasibility: train a model to successfully predict the marketability of a book.

Slush Filter

In publishing, the flood of unsolicited manuscripts an agent receives every day (in some cases 70-80 per day) lands in the so called “slush pile”. Often a junior agent or even an intern will conduct the first pass at filtering these manuscripts. One of the challenges for agents, though, is passing over that diamond in the rough, as was the case with Harry Potter being rejected twelve times.

The thinking was, that if I could have literary agents run unsolicited manuscripts through a model, then it might highlight a promising manuscript they may have ordinarily overlooked. So, not wanting to build a whole web app to support the test, I built out an email service that accepted manuscripts as attachments, analyzed them and email the results back within a few seconds. This fit with the agents’ current workflow – many received manuscripts via email and forwarded them to colleagues after review.

Finding A Ground Truth For Sales (A Tale Of Imbalance)

Agents are searching for books that sell, so I needed to find a value to represent successful book sales (since book sales data is not publicly available). The obvious first choice is the NYTimes bestseller list, but with some research it becomes obvious that it’s not a particularly reliable source given it’s subjectivity. Amazon bestseller ranks are another obvious choice, but sales rank is an aged value on a window of an hour so wasn’t suitable. I played around with other potential features such as product count (the number of ISBNs a book as) as a measure of popularity, too (the more editions a book has, typically signals it has sold well over a number of years).

In the end, I decided to settle on predicting whether a book was traditionally published (accepted by publishing houses and invested in) vs. self-published (as a proxy for books that may have been relegated to the slush pile). As you can imagine, the datasets were pretty unbalanced, so I tried a number of techniques from under and over sampling to adjust the class distribution, along with reweighting/boosting attributes etc. Later I would also experiment with cost sensitive classifiers.

Feature Engineering & Dimension Reduction (Part 1)

But first, let’s talk a bit more on feature engineering, which we touched on in Part 1. Classifying books means you deal with very large sparse matrices. My initial, naive, feature engineering approach was to use Bag Of Words and TF-IDF. The dictionary I landed on was over 7 million terms as each word was tagged with a Part-Of-Speech (POS) label to indicate how the word was used in a sentence (vs. simply the raw word count). On this large dictionary, I applied a number of simple approaches to shrink the feature set:

Removing stop words
Filtering out character names and place via Named-Entity-Recognition given classifiers would match on these terms inappropriately
Applying pluralisation (Inflector) to transform or plural uses of words to singular
Applied lingpipe to extract significant phrases

But the most effective naive dimension reduction method was to reduce the dictionary size by IDF values (i.e. carve out the most common words by POS) which is really an advanced approach to stop word filtering.

Model Selection / Classifiers On Very Sparse Matrices

I played around with a number of machine learning models, including using SVM which I knew wouldn’t scale. But I converted a Java SVM implementation to Scala and made a number of enhancements (academic coded libraries often aren’t super performant!). I even took a stab at parallelizing the SVM model but after some time spent tackling the problem, I researched it a little further and found several papers written on the topic, so made the judgement call that I’d make better progress trying alternative models.

Other approaches I tried:

I threw LibLINEAR (A Library for Large Linear Classification with many features) at the problem.
Pre-classifying books to broad categories, using the BISAC classified (see Part 1) I had built earlier, to then use specialized classifiers by category of book.
Many, many statistical machine learning models
Adding first / third person, gender of author and experimenting with sentiment (basically looking for more ways to include higher level features (vs. just terms)).

Most academic libraries are not production grade or built to scale – and often significant performance gains can be made via optimizing the code to run in parallel – my approach was to implement Actors (and Futures) by extending Java implementations using Scala. I did this for many libraries including Stanford’s NLP library.

Bayes classifiers performed the best, in particular Multinomial Naive Bayes and Random sub space Discriminative Multinomial Naive Bayes classifiers (an ensemble approach combining models produced by several learners, taking different subspaces of the original feature space and voting on combined results).

Unearthing Topics In A Book

The early Slush Filter product also reported on topics a book was about. For this I used LDA, an unsupervised learning approach based on word co-occurrence probability. After multiple iterations, I found the most accurate topics were found by extraction the nouns in a text (using a POS tagger) using a topic model size of 1000 (this could probably be reduced further as there was some similarity in the topics the model produced). I then had a literary data analyst label each of the produced topics to display in the report.

Dimension Reduction (Part 2)

Still not completely satisfied with the pure Bag of Words classification approach, I explored other approaches to dimension reduction, including LSA (Latent Semantic Analysis – I optimized the academic library for significant performance improvement) and PCA (Principal Component Analysis) – and found that LSA performed better, but still not to the level I needed.

Finding Similar Books / Comparative Titles

One of the more interesting approaches I tried was in applying Stochastic SVD (SSVD) to the full text of books. To do this, I explored Nathan Halko’s dissertation “Randomized methods for computing low-rank approximations of matrices” in depth, as while an implementation was provided in Mahout/Hadoop (which I executed on AWS using Elastic Map Reduce), the implementation didn’t work out of the box to produce new observations, and required manipulation of matrices (transposition, etc.) to make them compatible for the matrix multiplication required in the model.

After coding my own matrix converter, I was able to get the SSVD implementation working and then coded an extension to produce the output in Weka arff format (for classification). While I was able to reduce a full sized novel down to 150 or so features, the output didn’t work well for classification (my hypothesis is that this approach uses all words in a book vs. the intrinsic experience those words capture for a reader – but more on this later). But it did work very well for comparing books. Even using a simple nearest neighbor search produced good comparisons (a reader would think makes sense).

Later, I ended up deploying this as a service for authors to upload their manuscript and see, in real-time, comparisons to books that had sold well (using PCANoFoldIn to fold in new observations for analysis).

The Product

There was significant interest amongst literary agents for the service – I trialed the product with some of the biggest agents in publishing in NY. They would send through manuscripts and receive the analysis as an email response in seconds. Aside from a machine learning provided review, seeing a true apples to apples comparison of an inbound manuscript to a published book helped with positioning a book in terms of market, audience, etc. – a key, early task for agents looking to find and pitch a submission.

In terms of the backend tech, it was a scala service (wrapping various models), JavaMail interface, mongoDB and a service called SendWithUs that provided a capability for rich templated email generation.

A very early text-based iteration, with a basic signal for an agent to Review or Pass a manuscript, alongside category classification and similar titles.

Slush filter – significant phrases from lingpipe, with tf/idf, plus gender, bisac classification, ssvd NN comps, features like ing words, adverbs, count, how they compared to popular books. Plus LDA topics. Sendwithus.

Here is an evolved version including statistics about writing style and readbility (which can point to how well or poorly a manuscript is written), along with NER extracted entities and topics from our LDA topic model.

It’s Art Not Science

In the end, while the analysis was credible and the insights were useful, getting over the hurdle of an agent or publisher trusting the data to make a decision on, was too high. It was challenging to explain why a manuscript should be passed over, even with supplemental data on topics, categories and writing were provided. Traditional agents and publishers make their living through picking winners based on years of pattern matching across thousands of manuscripts and published books.

Additionally, some poorly written manuscripts scored well. This isn’t to say those manuscripts would not have performed well (history is replete with examples of rejected manuscripts that became bestsellers), but taking a closer look was too much of a leap of faith for many agents who were looking for confirmation of manuscripts they may have selected themselves. This all leads to the broader point, that a good, raw manuscript itself is not enough to create a high selling book. The professional editing and shaping of the manuscript is material, but even more so is the investment in promotion and marketing.

While I could have built this further, harvest indie pubs for hidden bestsellers and invest in the marketing, I didn’t have clear enough signals to measure the marketing budget impact to place this large bet. Marketing spend is not publicly available data and neither are book sales.

I still believe though, that with the right data (much of which is held within publishing houses or within retailer sites like Amazon), strong signals on marketability potential can be generated, but overall, trying to predict a bestseller from a manuscript alone, is not likely to be successful without considering the marketing budget. I go into this in more detail in my DBW article (now moved to this blog) “Machine Learning and Bestseller Prediction: More Than Words Can Say”.