My Book and Machine Learning Odyssey: Part 1

I initially founded Kadaxis as a vehicle to experiment and validate hypotheses I had on applying machine learning models to book data. The goal was to see what models worked, then to wrap as a product and bring it to market (a little in reverse but I was intentional about this approach).


I had just come off a stint of running the engineering org at Bookish, a joint venture between three of the biggest names in publishing (Simon & Schuster, Penguin and Hachette). Our goal was to create a digital retail outlet for publishers that would bypass Amazon. A key aspect of the user experience was a book search and recommendation engine.

To accomplish this, we built a data pipeline stack primarily in Scala and used Spark 0.11 (during the early days of RDD-only design) and MongoDB. We sourced tens of millions of book data records from various outlets, applied various graph algorithms to deduplicate and consolidate book records, and transformed and calculated data for CQRS views, search (using Lucene), and rec (recommendations). Since this was in 2013 and earlier, easy-to-use cloud-native data pipeline solutions like the ones available today did not exist, so we had to build and configure our own infrastructure.

Bookish was my first foray into applying data science to the publishing industry – our team built and patented the recommendation engine. I personally hand built the search engine (with help from my friend Alessio Signorini to design and pick out algorithms), which consisted of a custom built (in scala) bayes classifier to classify search query intent as title, author, category or general, then send requests off to the relevant search index.

I used a Lucene as an index for each of the groups above (title, author, etc.). Weights for each feature in the index were created by training against a set of search results scraped from Amazon (our truth set). I built a custom Random Walk framework in using Discount Cumulative Gain to score ranked results. We did significantly more with search, but the above was at the heart of our search index weight training.

After Bookish, I was quite hooked on the potential of uncovering more insights from and about books. This biggest was to see if an algorithm could predict a bestseller. But I had a lot to learn in the data science space before tackling such a lofty goal.

AutoML before AutoML / Scaled Model Optimization

Rather than study a series of machine learning models and painstakingly run individual experiments, iterate over parameter configurations and develop an intuition for a particular model and my data domain, I chose the “throw the kitchen sink at the problem” route. My skillset was in building scalable software in the cloud that could run many computations in parallel, on a single machine or distributed across multiple machines.

Remember, this was 2013, so the term AutoML didn’t exist yet. But effectively, I built my own AutoML framework in scala (using an actor model design from Akka, using various routers to distribute work). Back then I called it a “brute force” framework – it would take datasets (post feature engineering), randomly distribute into train/test sets and train multiple algorithms in parallel using wide ranges of parameters per for each model. I used the Weka framework as it was java based (scala runs on the JVM) and had a large number of implemented algorithms. The system ran on AWS on large ec2 instances. I would kick off a training run and go off and do something else – the framework would email me results after each training iteration completed. Here’s an example:

This approach supported easily zeroing in on a model and configuration set that worked, while allowing me to develop an intuition for the model/data domain. After casting a wide net, it was easy to then run a huge number of variations on a single model / model group. It also meant that when optimizing features, it was easy to run a broader model/parameter set to see if the optimal model parameters had shifted with data/feature set changes.

Feature Engineering for Books

Books contain words, so the intuition was that there should be encoded amongst all that language, a number of signals to be unearthed. I had a dataset of hundreds of thousands of books and here again, running multi-threaded scala code in the cloud allowed for quick processing of all these books. I didn’t use a formal data pipeline or persist data on s3 (other than for some Hadoop jobs later in the journey), but persisted processed features in MongoDB. The solution worked well for my use cases – as a functional programming advocate, I made heavy use of the Play Iteratees framework and ReactiveMongo to stream data into and out of Mongo.

Iterating over words in a book as features was an interesting NLP problem to solve. (Again – remember this was before more modern NLP deep learning methods were commonly available). Most of the Weka algorithms available were Bag of Words based approaches relying on term counts. So, I started with the typical first step of using stop words, but then moved onto an approach that filtered by Inverse Document Frequency score (IDF of TF/IDF).

I continued to experiment, learning more and more about the basics of feature engineering such as scaling, euclidean normalization etc. I added additional features such as Flesch Kincaid readability, counts for: long words, generic words, overused words, syllable counts, syllable ratios, etc.

Beyond these primitive features, I ran books through the Stanford NLP libraries to produce additional features such as sentiment, extracted entity counts and terms with their Part of Speech tags added PoS tags exploded the data number of features per book.

Classifying Project Gutenberg

While I had started some initial experiments on attempting to predict the marketability of books, I decided to first validate that my approach made sense and worked on solving for predicting the category a book belongs to. The publishing industry uses BISAC categories to categorize books, so everyone from Barnes & Noble to Amazon use these codes – they were originally created so that booksellers new which sections to place books. So, I had a dataset of books and also the metadata that contained the categories for each book. I first started with fiction – arguably more challenging than non-fiction for classification, which back then had a few hundred unique categories. Using a mix of the techniques described above, the results were quite good – in many cases I had uncovered better categories than those originally assigned to a book by an expert.

I then took the feature engineering pipeline and best model, and classified tens of thousands of public domain books digitized and available on Project Gutenberg. The results of these were generated and searchable and published via the Kadaxis website.

All in all, I had some strong enough signals that the approach to NLP and book classification was valid.