In order to close a deal with a large UK publisher, I had to put forward an analysis of the legality of scraping Amazon (UK). Below is the content of my case which was ultimately accepted by the client’s legal team.
Table of Contents
[Company] and Web Scraping
[Company] provides keywords to publishers to improve the visibility of books on retailer search engines. Several different processes are applied to data about books, to create keywords. One source of data is reader reviews. Reviews are read from various sites, including Amazon.co.uk. Natural language processing and other algorithms are applied to reviews, and incorporated into a data transformation pipeline, to derive keywords.
‘Keywords’ in this context refer to terms added to an ONIX file by a publisher, which is then sent on to retailers (such as Amazon). Keywords are typically single words or two word phrases, and are never exposed to customers. Keywords are designed to be indexed by search engines, to enable books to be found by customers using search. Their use and visibility is dictated by the BISG (Book Industry Standards Group).
This paper concerns the use of Amazon customer review data, by [Company], as part of a wider dataset used to generate book keywords. UK law is cited, along with relevant US law where relevant principles are discussed.
Legal Claims
Three major legal claims exist to prevent undesired web scraping:
1. Copyright infringement
[Company] does not reproduce, in whole, any of the data from any website. [Company] crawls websites in a similar manner as search engines, such as Google, and processes and enriches publicly available data to provide additional value to search engines, which positively impacts retailers, publishers and consumers.
The keyword data provided by [Company] is provided only to publishers, who then send it electronically to retailers. Reviews are not reproduced, but are heavily processed and form part of a larger dataset (including an existing keyword database), used to create keywords.
Search engines reproduce a portion of web content they crawl, in the form of search results, to the public. The keyword data provided by [Company], is a significantly smaller fraction of review content, and is not viewable by the public.
2. Trespass to Chattel
This claim considers a website as personal property, of which a user/computer is trespassing. In order for this claim to succeed, intentional and unauthorised interference with the plaintiff’s possessory interest in the system needs to be proven, and that it causes damage.
[Company] only accesses publicly available data, and does not attempt to circumvent technical controls designed to restrict access to data, for example, password protection or a firewall. [Company] does not apply any credentials to gain access to any protected section of any website.
[Company] accesses websites in a conservative manner, including substantial pauses between requests, that fall below the threshold of the average human user’s request profile. All accesses are read-only, and no data is attempted to be modified on any site. Other businesses in the publishing industry perform significantly more requests (see ‘Publishing Web Scraping Environment’ below).
[Company] does not attempt to access or make use of any person’s PII (Personally Identifiable Information).
3. Violation of the Computer Fraud and Abuse Act (“CFAA”)
[Company] does not perform any action that would be construed a violation of the CFAA. No unauthorised access, intentional damage, extortion or defrauding is performed.
Terms of Service (TOS)
Section 2 of Amazon UK’s Terms Of Service state:
“…you may not utilise any data mining, robots, or similar data gathering and extraction tools to extract (whether once or many times) for re-utilisation any substantial parts of the content of any Amazon Service, without our express written consent. You may also not create and/or publish your own database that features substantial parts of any Amazon Service (e.g. our prices and product listings) without our express written consent.”
A key legal term in the TOS is “substantial”, which defines the breadth of programmatic access to a website. Amazon.co.uk comprises tens of millions of product pages. [Company] accesses a small fraction of these pages, which for the purposes of legal argument, is classified as ‘insubstantial’. (See: http://uk.practicallaw.com/2-107-4065, which discusses UK cases, their outcomes, and the differences between ‘substantial’ and ‘insubstantial’).
Section 2 of the TOS only specifies accessing “substantial parts of the content of any Amazon Service”. The TOS also refers to using the data to create a database of prices and product listings. A price comparison website that benefits competitors to Amazon, is an example of a competing use of substantial parts of Amazon’s content.
For completeness, where ‘insubstantial’ use is concerned, by law, the interests of the database owner and data use are examined, further: “insubstantial parts of the database contents will also not be permitted if it conflicts with the normal exploitation of the database or unreasonably prejudices the legitimate interests of the database maker.” [Company]’ use of the data does not prejudice Amazon’s legitimate interests, nor does it provide any competition for Amazon’s services or conflict with normal usage. In fact, [Company]’ service was created for the explicit purpose of improving Amazon’s search service, on behalf of book publishers. The net result of using [Company]’ services is an improvement to customer’s ability to find books they might buy, increasing revenue to Amazon as well as publishers.
Robots Exclusion Standard (RES)
“The Robots Exclusion Standard – also referred to as the ‘Robots Exclusion Protocol’, or simple, the ‘Robots.txt Standard / Protocol’ – is a standard that enables website to control access to their content by web robots and crawlers.”
From: http://www.internet-guide.co.uk/robotsexclusionstandard.html See also: https://en.wikipedia.org/wiki/Robots_exclusion_standard
The RES was created decades ago, and is a well known standard for programmatically and explicitly informing web crawlers what elements of a website can be visited by a robot, and which parts should be excluded from a crawl. All major search engines, such as Google, Bing and Yahoo all read and abide by the robots.txt file (RES) on websites they extract data from.
Acceptable web crawling protocol dictates that a robot first download and interpret the robots.txt file, to understand the rules published by a website owner on what data the robot may extract. Amazon.co.uk has such a robots.txt file, which explicitly excludes certain parts of the site from programmatic access, and one robot type. In accordance with the RES, all other parts of the site, not explicitly excluded, are implicitly included, and therefore permitted to be accessed by a robot.
[Company]’ programmatic access to Amazon.co.uk is in complete adherence to Amazon’s robots.txt file. No part of Amazon.co.uk that is explicitly disallowed, is accessed by [Company].
The inclusion of a robots.txt file (instructions to robot crawlers) by Amazon is a clear acknowledgement of the company’s acceptance of web crawling, to the areas of the site their instructions permit.
Crawling Claim Likelihood
As companies from all industries are increasing utilising data mining and machine learning techniques for business intelligence, web crawling activity has become widespread. Despite the prevalence of scraping, only fifty known web crawling cases, in total, have ever gone to court.
Tonia Klausner, an attorney partner at Wilson Sonsini Goodrich & Rosati specialising in the areas of internet data, privacy and mobile, affirms that incentives play a key role in the web data legal landscape.
“The value of legal claims against web crawlers is low where the crawler does not crash or otherwise harm the website, and the crawled data is not used in competition with the website operator.” Ms. Klausner says. “This is one reason why we don’t see many claims being filed in court against web-crawlers, and why the claims that are filed tend to be driven by the crawlers somehow damaging the business of the data owners, whether directly or due to opportunity cost.”
https://www.wsgr.com/WSGR/DBIndex.aspx?SectionName=attorneys/BIOS/7957.htm
Publishing Web Scraping Environment
While companies from all industries scrape data, below are three instances where publishing businesses scrape Amazon:
Vearsa, a UK company, scrapes millions of product pages each day (from Amazon and other retailers), to provide data about retailer listings, such as missing buy buttons, pricing, etc. Vearsa has several publisher clients, including Penguin Random House. A reference on their website states:
“Unless your distribution partner can do web scraping, there’s only one way to make sure your books are available: manually. Go to each retail site and look up new titles, then periodically look spot check titles from your backlist.
That said, there are a few distribution partners you can work with who will keep tabs on retailers for you. INscribe Digital, Libre Digital and, of course, Vearsa can help you ensure retailer compliance.“
The other companies Vearsa references also scrape retailer websites such as Amazon. Libre Digital has had a contract with [Publisher] for several years.
Pronoun (recently acquired by Macmillan) have, for several years, scraped millions of product pages on Amazon, again each day, for use in analytical dashboards, to provide market/pricing intelligence to authors, and for identifying trends. They currently provide email alerts to authors about changes in the status of their books on Amazon, for example, when a book receives new reviews, enters a bestseller list, etc. Use of their service is not restricted to the author of a book – any user can choose to monitor and receive email alerts on any book on Amazon.
Author Earnings – is a well-known website targeting authors and publishers. The site operators openly discuss scraping millions of product pages on Amazon in the UK and the US, for example: http://authorearnings.com/report/november-2015-the-uk-report-author-earnings-on-amazon-co-uk/ One of the founders of the site is known as “Data Guy” and is a keynote speaker at Digital Book World 2017.
These companies scrape millions of pages on Amazon at a time (some daily). [Company]’ daily scraping volume is a fraction of this.
Summary
In analysing the risk of scraping activities by [Company], it is the opinion of the author that [Company]’ scraping activities fall within permitted use, with regards to being insubstantial and also adhering to the published Robots Exclusion Standard (a well-known standard for permitted website crawling).
If this were to be tested, it should be noted that the courts have historically evaluated scraping activity on the basis of intent and business impact. The intent of [Company]’ use of scraped data is clearly to benefitting and improving Amazon’s business through search optimisation on behalf of publishers. No content is publicly reproduced or used in a manner that assists Amazon’s competitors.
There is also no threat by [Company], to the operation of Amazon’s website, whose infrastructure handles hundreds of millions of requests daily. [Company]’ usage profile is highly conservative, and falls within the request frequency of a human user.
Further, any testing of the law, would likely involve companies who scrape substantial parts of Amazon’s available book catalog (such as those mentioned above), and who make this data more publicly accessible. [Company] scrapes a fraction of this data, and the highly processed output is never made public to website users.