Featured

2020 Book Analysis

Does anyone else remember that kid-friendly computer game called Storybook Weaver from the 90s? I have a lot of memories of loading that CD-ROM onto our Windows 95 PC (so archaic!), in order to create and illustrate my own stories. I was probably around 6-7 years old at the time.

 I have since shied away from writing, but reading has always stuck with me. It's possible that I can credit this to Storybook Weaver for bringing out that passion for books at a very young age. 


Since May of 2012, as a junior in high school, I have been tracking every book that I read on Goodreads. Not only do I have a virtual archive for my books, but I also receive book recommendations based on the books I've enjoyed, and manage a "shelf" of books that I'd like to read (which I always consult when I go to the library!). 

As someone whose sole focus at work is datasets, I realized that I've been collecting a great deal of my own data this whole time without even realizing it! So many potential visualizations and analyses started running through my mind. My Goodreads dataset was comprised of the following at the time of my analysis:
  • 114 books I've read
    • 112 of these have been rated by me on a scale of 1-5
  • 5 books I am currently reading
  • 41 books that I want to read
Each book has what you could consider a "profile", with even more data- including number of pages, year published, genre, community rating, and price.


I decided that to ring in the new year, I would utilize this data to reflect on everything I had read in 2020. This narrowed my dataset down to 21 books. I just wanted some cool summary statistics and highlights, I didn't think there would be much additional value in diving deep with web scraping or mathematics or anything complex like that. So the method I chose to do this analysis was Tableau Public. You can view the interactive public dashboard here.

One note I want to make is that you will see sometimes I use a generic "Fiction" genre for a book, and sometimes it is more specific, such as "Historical Fiction". I based this off of what users most often categorized the book as (you can see that in the bottom right of the screenshot above), plus my own intuition. For example, The Handmaid's Tale was categorized as "Fiction" by 17k users, and "Dystopian Fiction" by 8k users. I went with Dystopian Fiction in this case because I felt that more accurately represented the book.



Above is a plot showing the numbers of days it took me to finish a book, based on when I started that book. Interestingly, I was reading slower on average at the end of the year than the beginning. This may correlate with the holidays, and burnout at that time of the year. 

Intuitively, one would think that there is a direct correlation with total time to read and page number, but that isn't necessarily the case with these end of year outliers. Learn Python 3 the Hard Way isn't particularly long, but it is dense, being as it is a textbook, so that outlier makes sense. The Glass CastleThe Handmaid's Tale, and The Unicorn Project are all in line with the rest of the year in terms of book length. 

Admittedly, in hindsight, it would be interesting to compare this chart with a Total # Days to Read vs. Number of Pages chart, but I had not collected the page number data because I thought it would be an obvious correlation and therefore not interesting. Also, it is time consuming to capture that data manually with no web scraping script in place.


Next is a plot showing Total Days to Read vs. My Rating, with a color indicator for what genre the book is. I am not seeing a clear correlation between days to read and how highly I rated the book, however it does seem that I read nonfiction books relatively quickly. Also, there aren't any educational books on the low end of the x-axis, which makes sense and is worth noting. However, I don't think there are many credible insights to derive from this, being as the sample size for each genre varies greatly, and some only have a couple of data points. I will not make any other generalizations about this one, but it is still interesting to ponder. 



My third visualization is a bar chart showing how I rated a book minus the average rating of that book by the Goodreads community. Of course there are variables to consider here, such as: Are people who are rating books on Goodreads representative of the population? Do the number of ratings vary for each book, and therefore some are more credible than others? (Yes.)  But since this is all for fun, we will just take it as it is.

Some points that stand out to me:

  • I rated The Handmaid's Tale and The Woman in the Window significantly higher than the average user. Both of these are what I would consider to be dark novels, so perhaps I should pick those types of books up more often!
  • I am a harsh critic. I rated most of these books lower than the average user. And the negative difference goes up to almost 2 points, but the positive difference only goes up to 1.05.
  • The worst book I read was Miss Peregrine's Home for Peculiar Children. I actually went into this book blind -- I saw on my Goodreads feed that one of my close friends had rated it highly so I decided to read it too. Turns out, this is a fantasy book (which I'm not a fan of) and it was geared towards younger readers. I could appreciate that the writing was good, but I just wasn't a fan of the plot because I enjoy reading more realistic novels.
And finally, a heat map! I have such a love for heat maps, as my first project that introduced me to "real" data science was creating heat maps from genetic data. 

Now this one isn't very telling since the sample size for each genre varies so greatly. Fantasy and Self-Help, as two examples, only have a sample size of 1. We can see this recurring theme of how my data kind of sucks for cool visualizations but that is ok!

I will break it down by number of data points. 
The bottom right squares, Business, Fantasy, Dystopian fiction, Historical Fiction, and Self-help only have 1 data point. We can see that out of those, I rated Dystopian fiction and Historical Fiction the highest.

Of the Fiction(3), Computer Science(2), and Mystery(2), I enjoyed the mystery books the most.
And finally, of the Nonfiction(5) and Memoir(4) books, it is pretty even, at a high average rating of 4.00 and 4.25 respectively. I'm actually very impressed with this as I have proven to be a harsh critic. It seems like nonfiction books and memoirs may be my niche!

That wraps up my analyses for this year in books. I already am planning some improvements I can make for next year and how to take it up a notch or two. My book goal for this year is now 20 books. I am excited to see what next year's stats will look like and what types of books I will read!

Happy reading for 2021 everyone!