This short paper explores some of the benefits and drawbacks of big data in the publishing industry as it is becoming a prevalent feature in analyzing and predicting trends in every field such as retail, healthcare, and crime prevention (Kobo, 2014, para. 1). In order to look at the role of “big data” in publishing, a general understanding of the term is required. An article by Forbes Contributor Gil Press discusses the origin and popularization of the term “big data”, and notes that the definition is problematic and still needs development. Although Press and other experts “predict a relatively short life span for this unfortunate term”, Press gathers various definitions in order to get a well-rounded idea of big data, as it an amalgamation of many characteristics. The definitions that appealed most to my understanding are:
        (#2) “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.”
        (#6) The new tools helping us find relevant data and analyze its implications.
        (#7) The convergence of enterprise and consumer IT.
        (Press, 2014, p.1).
An article by Kobo also highlights the point that “analyzing these data sets is quickly becoming the basis for competition, productivity, and innovation” (Kobo, 2014, para. 1) as it becomes “the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value” (Press, 2014, p. 1, para. 10). In the publishing industry, the digitization of reading is paving the way for Big Data to become the driving force in evolving the relationships between commercial parties, publishers, authors, readers, and technology.
        Digitization allows reader behaviour to be tracked and reported. Different parties (e.g. commercial parties, publishers, and authors) are able to see what kind of books readers have purchased, “what books were left unopened, which were read to the very last word and how quickly” (Kobo, 2014, para. 4). Statistics drawn from the data can help to draw connections between the completion rates of books and sales numbers to determine whether or not publishers should invest in an author’s book. As Kobo says, “opportunity lies among those books that have high completion rates yet suffer low sales” because this shows high reader engagement and a rectifiable lack in marketing (Kobo, 2014, p. 3, para. 4). On a publisher’s level, big data helps to highlight trends and reader engagement; thereby giving them a picture of which authors to invest more resources into. This can also benefit authors in that they can get a better sense of their reader demographic and compare results between their different works. However, while authors used to worry more about whether or not their books were purchased, new payment models based on big data determine how an author earns money. Unlike in the matter of print books, these models are based on new parameters such as pages read for self-published ebooks.
        A related example can be found in Tracking reader habits using tech: Good or bad for readers and writers? by Troy Lambert in which various ebook platforms that collect(ed) data are examined. One of the platforms discussed is Amazon’s Kindle Unlimited subscription service in which authors are compensated by detection of how many pages are read. This could potentially be problematic due to various causes behind reader data that could affect the accuracy of page reads. The method of having a single algorithmic protocol for all genres, whether or not they are read linearly or non-linearly, is also a problem, as noted by VanDyke in the comments (VanDyke, 2016). Other issues according to this article are people who cheat or work around the system which can result in having one’s books pulled or being banned from operating on Amazon (Lambert, 2016, para. 1). While this article did not go into depth about the drawbacks of the payment model, I did find another recent article about the associated publishing platform, KDP (Kindle Direct Publishing) Select, that confirmed my suspicions.
        Experienced KDP Select author David VanDyke reports that he, along with fellow authors using the platform are losing page reads due to software glitches (VanDyke, 2016). Vandyke also conducted a few experiments to confirm the odd statistics, and concluded that there was some sort of issue with the way reader data was reported. This can lead to large discrepancies in compensation. The article and its comments below also illustrate a lack of transparency between Amazon and its authors in trying to diagnose the precise issue, or even in Amazon admitting that there is a problem in the first place. The ongoing discussion in the comment section suggests that PageFlip, a newly incorporated feature, may have affected the way page reads are recorded, alongside reader behaviour such as returning to the book cover after reading or going back to a favourite passage.
        Readers may also have other reasons as to why they stop reading, and making assumptions based on data that doesn’t necessarily account for these causes could be inaccurate. While this issue also stems from the payment model itself, the model can be seen as a result of our attitude towards big data and can be an example of how algorithmic defects become difficult to troubleshoot, especially the more complex a system is. Trying to work towards a solution when there is a lack of transparency because a company wants to appear unaffected makes the situation worse.
When we use data to gauge the success of sales— like in the case of how Nielsen BookScan numbers are really limited to point of sales based on ISBNs and don’t account for a large portion of sales like ebook sales (Michel, 2016, para. 10-12) — we must remain aware of the context of said data. It is important since we rely on the accuracy, handling, and interpretation of data in order to make decisions that affect the publishing industry on all levels.
        Another feature of big data is “the shift (for individuals) from consuming data to creating data” (Press, 2014, p. 2). It seems the related issues here tend to be ethical ones, such as privacy. When we read ebooks or sign up for Goodreads accounts, it usually involves us giving companies permission to track our habits and preferences. However, it seems that we are okay with providing businesses with our attention and data if it helps to enhance or build products and services catered to us.
        Our participation makes it possible for companies to suggest pregnancy yoga books to a consumer if her data indicates that she is pregnant and likes yoga. My question is whether or not it ever crosses the line. In 5 Reasons to Liberate Your Ebook From Their DRM Prison, K.T. Bradford recounts five ways in which customers can lose their ebooks due to companies monitoring reader behaviour (Bradford, 2013). In one of these cases, reported data tipped off Amazon into confiscating all the purchased ebooks on a woman’s device and denying her access to her account because the DRM detected that her she was, along with a related account, in violation of terms of services (Bradford, 2013, para. 6). What they didn’t tell her was why, or give her any warning. The subsequent conversations with Amazon were also non-transparent and unfruitful (Bradford, 2013, para. 6). The mishandling of data in monitoring user behaviour contributed to the dissatisfaction of another big data service and became an ethical issue.
        In conclusion, privacy, transparency, inaccuracy, and shortcomings in technology are common topics that have negatively affected big data’s transition into publishing. However, big data ultimately contributes to providing better services for authors and consumers. It enables higher levels of interactivity between commercial parties, publishers, authors, and readers. It will also provide the means for new innovations and efficiencies, and work towards mapping an industry that is often unpredictable.


Bradford, K. T. (2013, August 21). 5 Reasons to liberate your ebooks from their DRM prison. Digital Trends. Retrieved from

Kobo. (2014, October 9). Publishing in the era of big data. Kobo Newsroom. Retrieved

Lambert, T. (2016, September 24). Tracking reader habits using tech: Good or bad for readers and writers? Teleread. Retrieved from

Michel, L. (2016, June 30). Everything you wanted to know about book sales (But were afraid to ask). Electric Lit. Retrieved from

Press, G. (2014, September 3). 12 Big data definitions: What’s yours? Forbes. Retrieved

VanDyke, D. (2016, October 8). Amazon KDP Select authors are losing page reads, apparently due to software glitches. Teleread. Retrieved from


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s