My Book and Machine Learning Odyssey: Part 3

Keywords / Deconstructing Amazon Search / Through The Dark Web

After indulging in an attempt to find the holy grail of publishing, I pivoted to applying data science to solving a less lofty, but real need in publishing: keywords. The simple pitch was:

  • Most books are sold on Amazon
  • Most customers find books using Amazon search
  • Keywords supplied in book metadata are directly indexed by Amazon
  • Most books (in 2014) didn’t have (effective) keywords.

There was a clear industry need to find keywords for books, at scale, to boost sales of books on Amazon.

Occam’s Razor Fallacy: Book Keywords From Book Text

The simplest and most obvious source for book keywords, is to start with the huge volume of data intrinsic to each book – the text of the book itself. Almost every company that attempts to solve this problem starts here (at least 3-4 did this around the time, including one that applied the out-of-the-box Stanford NER library to the text of books and managed to pick up the 2015 BISG Innovation Award through some beguiling promotions). I started here too, but after experimenting with various models to extract terms and phrases, it quickly became evident that such a naive approach to keyword extraction was not the right one in many cases. While it worked well for some non-fiction and specialist, technical books, it failed spectacularly for most fiction books.

A dystopian novel like the hunger games won’t have characters talking, explicitly, about the “dystopian” world they live in.

Reversing The Problem Space

Instead of generating keywords from the book text, how about starting with strong keywords and matching the most relevant subset to a book?

The strategy here was to classify and score search queries along a number of dimensions, then to classify books along similar dimensions and match the two together.

Mining Keywords

The first thing I had to do was mine keywords – the best source of Amazon keywords is Amazon search. Each country and product type has it’s own search index (along with the main search super index). The two indexes relevant to this problem space were “Books” and “Kindle” indexes for the US. The easiest way to mine keywords is to hit the typeahead API, all you need to do is seed it with a starter phrase and you’re presented with several keywords. Starter phrases can be found anywhere, but I started with industry book categories, Amazon browse nodes, countries, cities, top wikipedia topics, etc.

Through the above process, I was able to amass 750k book search queries that were directly indexed on Amazon.

By the way, here’s a position I put together on the legality of scraping which I used to close a deal with a big five publisher in the UK, which has a stricter position on scraping than the US. (The publisher’s legal team accepted the merits of my legal argument).

Through The Dark Web

After obtaining a good set of search queries, I needed the search result data for each query, which meant executing each search query to capture which books were returned. Unlike typeahead, scraping search results required a bit more finessing to execute at scale. The approach I used was to create anonymized scrapers using ToR (The Onion Router). Given ToR operates at the network layer, it required a design that executed a number of individual JVMs in parallel. I built this framework in Scala and controlled communication and management of each JVM node through remote Akka Actors). I ran all this, ironically, on AWS. I could have run the framework on multiple distributed instances, but creating an instance with tons of memory and running locally (communicated via local nic) scaled enough for my timeline. (This was also before containers were en vogue.)

I needed to understand how Amazon scored and indexed related books in search. 

Deconstructing Amazon Search

To understand how to get a book to rank well in search using keywords, you need to understand how the search engine parses and deconstructs search queries, then ranks them amongst a corpus of books. If you look hard enough, you can learn a lot about how Amazon search works. I studied Amazon A9 patents, talks from A9 data scientists (at MLConf, etc.), github repos, search infrastructure job ads, etc., and was able to build a working knowledge of Amazon search design covering topics such as:

  • Query understanding system, inbound query manipulation + query tagging, parsing and cleaning
  • Product to query matching
  • Product ranking
    • Use of gradient boosted trees / ensemble of bagged trees, pairwise ranking, feature selection
  • Search analytics
  • Customer Intent
    • Session based behavioral matching, positive signal (clicks) calculation with signal decay

As you can see, most of Amazon’s search was built using machine learning models (not much deep learning back then). Model selection was based on a combination of inference speed and relevance accuracy.

From this knowledge I could reverse engineer how attributes outside pure topic relevance, incorporating customer intent signals, product ranking, etc. and how these algorithms impacted search ranking, then used this insight to build hypotheses on how to maximize the ability for a book to rank well in search based on specific keywords.

I wrote an article combining much of this knowledge, for the industry lay person.

Connecting The Dots

For each search query, I pulled the top 100 book search results, which meant that running through the list of 750k search queries netted data for millions of unique books. For each book I also captured it’s assigned browse nodes (or categories, such as Contemporary Romance,  Crime Thriller or Persian Cookbook).

For each search query, I applied a simple TF-IDF approach to the browse nodes for each book returned in the search results, weighted by rank – this provided a mapping from search query to ranked browse nodes.

So, to connect the dots between books and a universe of possible search queries, I scored the  book’s browse nodes against the search query browse nodes to produce a universe of possible search queries (typically in the hundreds to thousands). This worked reasonably well where the publisher had chosen good categories (or browse nodes), but where they didn’t or for a book that had yet to be published, I could map from classified categories or topics to corresponding browse nodes to derive a keyword universe.

Applying the knowledge I had of Amazon search rank, I was able to rank the keyword universe to generate optimal keywords for a book. I could also apply simple distance calculations to search for similar queries by queries or browse node, etc.

All the above is of course a rather simplified snapshot of the solution, but hopefully it provides a sense of the design.

Big 4 Publisher Pilot

Through my connections from Bookish, I piloted my technology with one of the industry’s big 4 publishers. I exposed part of the capability as an API (elastic search) to allow searching for keywords by term, related keywords, category, topics etc. and had the system integrated into an internal marketing system. I also generated keywords off the full text of bestsellers provided by the publisher and did the same with smaller publishers, but found that the keywords, though they made sense, weren’t quite refined enough for the pristine keywords a major publisher curates.

At this point in the journey, I took a break to really think about the problem that I was trying to solve. In its essence, I was trying to connect people to books. Why do people read books? They don’t read them for the words or the content: people read books because of how they make them feel. This realization prompted a huge shift in how I would solve the book discovery problem.