THE BESTSELLER-OMETER, OR, HOW TEXT MINING MIGHT CHANGE PUBLISHING
Back in the spring of 2010, Stieg Larsson's agent was having a good day. On June 13, The Girl Who Kicked the Hornets' Nest—third in the series from a previously unknown author—debuted at number one in hardback in the New York Times.You can imagine the lists would have been a pleasing sight over morning coffee. Hornets' Nest straight in at the top, Dragon Tattoo at number one in two paperback formats, and The Girl Who Played with Fire a roundly satisfying number two. This had been going on for forty-nine weeks in the U.S., and for three solid years in Europe. It would have been hard not to be smug.
The following month Amazon would announce Larsson was the first author ever to sell a million copies on the Kindle, and over the next two years sales in all editions would top seventy-five million. Not bad for an unknown political activist—turned-novelist from a little Scandinavian country, especially one who had chosen a rather uncharming title in Swedish and had written some brutal scenes of rape and torture. Men Who Hate Women—or The Girl with the Dragon Tattoo as it was renamed in English—was the sensation book of the year in more than thirty countries.
The press didn't understand the success. Major newspapers commissioned opinion pieces on what on earth was going on in the book world. Why this book? Why the frenzy? What was the secret? Who could have known?
Answers were lackluster. Reviewers scratched their heads about it. They found fault with the novel's structure, style, plotting, and character. They groaned over the translations. They complained about the stupidity of the reading public. But still copies sold as fast as they were printed—whether you were in the UK, the U.S., in Japan, or in Germany; whether you were male, female, old, young, black, white, straight, or gay. Whoever you were, practically anywhere, you knew people who were reading those books.
That doesn't happen very often in the book world. The industry might enjoy a phenomenon breakout like Larsson once a year, if that. E. L. James has been the biggest breakout since, with Fifty Shades of Grey, and unlike Larsson she was available for a big publicity tour. Larsson had died before publication. The level of sales his trilogy achieved without even the backing of its author was supposedly just unfathomable. Freakish. Unpredictable.
Let's consider some numbers. A company in Delaware called Bowker is the global leader in bibliographic information and the exclusive provider for unique identification numbers (ISBN) for books in the U.S. Their annual report states that approximately fifty to fifty-five thousand new works of fiction are published every year. Given the increasing number of self-published ebooks that carry no ISBN, this is a conservative number. In the U.S., about two hundred to two hundred twenty novels make the New York Times bestseller lists every year. Even with conservative numbers, that's less than half a percent of works of fiction published. Of that half a percent, even fewer hit the bestseller lists and stay there week after week to become what the industry calls a "double-digit" book. Only handfuls of authors manage those ten or more weeks on the list, and of those maybe just three or four will sell a million copies of a single title in the U.S. in one year. Why those books?
Traditionally, it is believed that there are certain skills a novelist needs to master in order to win readers: a sense of plot, compelling characters, more than basic competence with grammar. Writers with big fan bases have mastered more: an eye for the human condition, the twists and turns of plausibility, that rare but appropriate use of the semicolon. These are good writers, and with time and dedication almost all genuinely good writers will find their audience. But when it comes to the kind of success involved in hundreds of thousands of people reading the same book at the same time—this thriller and not that thriller, this potential Pulitzer and not that potential Pulitzer—well, unless Oprah is involved, that signals the presence of a fine stardust that's apparently just too difficult to detect. The sudden and seemingly blessed success of books like the Dragon Tattoo Trilogy, Fifty Shades of Grey, The Help, Gone Girl, and The Da Vinci Code is considered very lucky, but as random as winning the lottery.
The word "bestseller," by the way, has always been a book world term, and as a word it is relatively young. It first entered the dictionary in the late nineteenth century, about the time of the first list of books ranked by consumer sales. While it should be a neutral term, it has developed some connotations that are likely misleading. The literary magazine The Bookman started to print "Sales of Books during the Month" in 1891 in London and in 1895 in New York after the International Copyright Act of 1891 slowed down the distribution of cheap pirated copies of British novels. Until then, no sales statistics had really been possible. From the beginning, the lists—which were printed in each major city and typically reported the top six sellers of the month—were about two things that were new to the book world. The bestseller lists were about sales as the only criterion for inclusion, and a proxy recommendation system for what to read next. These recommendations were based not on the choices of a select few reviewers or publishers, but on the choices of everyday fellow readers. The reader's choice was and still is the only vote. The term "bestseller," then, should carry no intrinsic comment on quality or type of book, and is not a synonym for either "genre" or "popular fiction." While the word has often been used pejoratively by some members of the literary establishment, who have felt that the collective taste of the reading market signals bad literature, the data itself suggests a less subjective and more balanced truth. Bestsellers include Pulitzer Prize winners and Great American Novels as well as books by famous mass-market writers. The list can house Toni Morrison and Margaret Atwood alongside Michael Connelly and Debbie Macomber. This is why the bestseller list is such a rich cultural construct and so dynamic to study.
Sign up for more essays, interviews and excerpts from Thought Matters.
ThoughtMatters is a partnership between Macmillan Publishers and Huffington Post
Obviously there's a lot of value in writing one of those books. There's a lot of value in finding those books as an agent or editor. There's a lot of value for retailers, too—the top few titles alone are why some retailers are able to stay in business and keep selling books at all.
Of course, we are talking for now of value in monetary terms. Imagine a seven- or even eight-figure advance for finally getting onto the page that book you are always telling your friends is inside you. Not many authors command that kind of clout in one territory, but they are certainly around. And you can glamorize the impoverished artist with his pen and notebook as much as you like, but wouldn't it be nice to think of the story you just made up as appearing on bedside tables, beside bathtubs, and on commuter iPads and Kindles in different languages all over the world?
The key sellers of a given year bring the glamor and the drama. They represent the houses in the Hamptons, the fancy cars and diamond tiaras of the literary domain. Hit the lists and stay there for a while and you will be revered, respected, loathed, and condemned. You might be asked to judge a prize or review other books. Maybe your movie rights will be optioned. People will be talking.
Wouldn't it be fun if success weren't so random?
The bold claim of this book is that the novels that hit the New York Times bestseller lists are not random, and the market is not in fact as unknowable as others suggest. Regardless of genre, bestsellers share an uncanny number of latent features that give us new insights into what we read and why. What's more, algorithms allow us to discover new and even as yet unpublished books with similar hallmarks of bestselling DNA.
There is a commonly repeated "truth" in publishing that success is all about an established name, marketing dollars, or expensive publicity campaigns. Sure, these things have an impact, but our research challenges the idea it's all about hype in a way that should appeal to those writers who toil over their craft. Five years of study suggests that bestselling is largely dependent upon having just the right words in just the right order, and the most interesting story about the NYT list is about nothing more or less than the author's manuscript, black ink on white paper, unadorned.
Using a computer model that can read, recognize, and sift through thousands of features in thousands of books, we discovered that there are fascinating patterns inherent to the books that are most likely to succeed in the market, and they have their own story to tell about readers and reading. In this book we will describe how and why we built such a model and how it discovered that eighty to ninety percent of the time the bestsellers in our research corpus were easy to spot. Eighty percent of New York Times bestsellers of the past thirty years were identified by our machines as likely to chart. What's more, every book was treated as if it were a fresh, unseen manuscript and then marked not just with a binary classification of "likely to chart" or "likely not to," but also with a score indicating its likelihood of being a bestseller. These scores are fascinating in their own right, but as we show how they are made we will also share our explanation for why that book on your bedside table is so hard to put down.
Consider some of these percentages. The computer model's certainty about the success of Dan Brown's latest novel, Inferno, was 95.7 percent. For Michael Connelly's The Lincoln Lawyer it was 99.2 percent. Both were number one in hardback on the NYT list, which for a long time has been one of the most prestigious positions to occupy in the book world. These are veteran authors, of course, already established. But the model is unaware of an author's name and reputation and can just as confidently score an unknown writer. The score for The Friday Night Knitting Club, the first novel by Kate Jacobs, was 98.9 percent. The Luckiest Girl Alive, a very different debut novel by Jessica Knoll, had a bestselling success score of 99.9 percent based purely on the text of the manuscript. Both Jacobs and Knoll stayed on the list for many weeks. The Martian (before Matt Damon's interest in playing the protagonist) got 93.4 percent. There are examples from all genres: The First Phone Call from Heaven, a spiritual tale by Mitch Albom, 99.2 percent; The Art of Fielding, a literary debut by Chad Harbach, 93.3 percent; and Bared to You, an erotic romance by Sylvia Day, 91.2 percent.
These figures, which provide a measure of bestselling potential, have made some people excited, others angry, and more than a few suspicious. In some ways that is fair enough: the scores are disruptive, mind-bending. To some industry veterans, they are absurd. But they also could just change publishing, and they will most certainly change the way that you think about what's inside the next bestseller you read.
We should make it clear that none of the books we reference were acquired based on our model's figures, and figures, beyond the ones you'll read about here, have never been formally shared with any agent or publishing house. We should also be clear that these figures are specific to the closed world of our research corpus, a corpus we designed to look like what you'd see if you walked into a Barnes & Noble with a wide selection to choose from. Agents and editors do a good job of putting books in front of consumers—it's not as though we are short of things to read. And some individuals in publishing have a particular reputation for the Midas touch. But remember that the bestseller rate in the industry as it stands is less than one-half of one percent. That's a lot of gambling before a big win. Note, too, that year after year, the lists comprise the names of the same long-standing mega-authors. Stephen King is sixty-eight. James Patterson is sixty-eight. Danielle Steel is sixty-eight. As much as fans are still thrilled by another new novel from one of these veteran writers, it is telling that the publishing world has not discovered the next generation of authors who will similarly enjoy thirty to forty years of constant bestselling. Nor did the industry find, despite the thousands of manuscripts both rejected and published annually, a runaway bestseller for 2014 (Dragon Tattoo, Fifty Shades, and Gone Girl had been the standout hits of previous years), and neither did it publish a manuscript to impress the Pulitzer Prize committee in 2012. Why?
Well, it is a universal wisdom that bestsellers are freaks. They are the happy outliers. The anomalies of the market. Black swans. If that is the truth, then once you find a bestselling writer, why put your money anywhere else? Why put your millions on a new twenty-year-old writer instead of Stephen King? How could you possibly know if a new literary author is worth the sort of investment worthy of a future big-prize winner?
Copyright © 2016 by Jodie Archer and Matthew L. Jockers
JODIE ARCHER bought and edited books for Penguin UK before decamping for the doctoral program in English at Stanford University. After her PhD, she worked at Apple as their research lead on literature, and has since consulted with many writers and businesses about literary success. She is now a full time writer.
MATTHEW L. JOCKERS is Susan J. Rosowski Associate Professor of English at the University of Nebraska-Lincoln where he teaches and directs the Nebraska Literary Lab. His text mining research has been profiled in the New York Times, The LA Review of Books, The Sunday Times of London, and more.
Read more at Thought Matters. Sign up for originals essays, interviews, and excerpts from some of the most influential minds of our age.