Tuesday, 9 February 2016


Introduction:

In the modern age of fast internet, creation of various kind of networks on the world wide web has become a very natural, common and intriguing phenomenon. As people are connecting with each other on internet through various social networking sites, as users they are generating lots and lots of information through their interaction with the web leading to creation of complex networks related to the networking site. Goodreads (https://www.goodreads.com/) is also a community-driven social cataloging site, which has grown into one of the most popular social book reading, reviewing and recommendation sites. It has a built-in social structure, users can put books in their personal bookshelves, read, rate and review books, get to know about reading behavior of their friends and their favorite authors, open discussions and groups on a variety of topics and interact with the site in various different ways.

Since the establishment of Goodreads (almost nine years), the information that this cataloging site has gathered is enormous which can produce a very big data set which can help in detailed analysis of a very interesting complex network developed through this site. Unfortunately, not many works have been noted related to this network. One simple analysis can be a measurement study demonstrating the interplay among three major entities on Goodreads: (i) general users, (ii) books and (iii) their authors.

Characteristic properties of Goodreads entities:

We can characterize three major entities on Goodreads - the books, and the users.

Books: Goodreads topically organizes books into various genres like fiction, fantasy, thriller etc. Apart from these popular genres, users can add genres and add books there. Goodreads have 798 user-defined genres in total. It is quite clear that any book may contain various no. of genres. Now, if we consider all the available genres as nodes and every appearance of a book sharing two (or more) genres leads to creation of an undirected edge or addition of weight to the edge between the two genres may create a very interesting weighted undirected graph. The weight on the edges may depend upon the various characteristics of the books like rating, reviews, etc. This graph may be used to analyse the relation between different genres and how it is influenced with appearance of a book. We can also develop a prediction model to predict the probability of a book falling into other genres if some of the genres are known. For example, a certain book is of fantasy genre, what are the chances of it being of fiction genre and so on.

Another analysis can be done to study the relevance of the ratings, reviews and user comments related different books. We can create different ranked lists on the basis of book ratings, no. of reviews and sentiment analysis score of user reviews. We can pick up a set of famous books for example, facebook bookbucket challenge top 100 books list and use this ranked list to compare with the ranked lists we found earlier. For comparison between two ranked list we can calculate following correlation coefficients:

1. Kendall rank correlation coefficient (τ):

2. Spearman's rank correlation coefficient (ρ): The Spearman's correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables. For a sample of size n, the n raw scores X_i, Y_i are converted to ranks x_i, y_i, and ρ is computed from:

where d_i = x_i - y_i, is the difference between ranks.

3. Jaccard overlap:


We can also investigate the popularity of books in terms of rating and reviews obtained by them. Goodreads extends a 5-star rating system for rating the books, authors etc. 

Fig a shows the distribution of the average ratings for books. More than 96% of the books in Goodreads have an average rating of more than 3 and ∼ 40% of books have average rating of more than 4. This suggests that Goodreads has a quality collection of books which are quite popular among the users. In Fig b, the rating distribution for the books are shown. Clearly, we see that most of the books receive higher percentages of 4-stars and 5-stars and very low no. of 1 and 2-stars. Fig c shows the distribution of the no. of reviews received by the books. The distribution follows a power-law behavior with a heavy tail. Specifically 23% of the books receive no reviews, 44% of the books receive less than 10 reviews, whereas only 0.04% of the books have more than 10K reviews.

Users: A study on users' characteristics behavior with focus on their participation in groups, their reading behavior, reviewing behavior etc is important to study the large network of the site. 

In the above figure, the temporal evolution of the users joining the Goodreads social media from 2007 till 2013 has been shown. From 2007 till 2011, the evolution curve shows a linear increase and in 2012 and 2013, the number of users joining Goodreads exponentially increases. 

We can also study various characteristics distributions of the user for example, followings are the respectively the  distribution of c) no. of groups user belongs to d) no. of friends of the users e) no. of reading shelves users use for reading and f) no. of reviews made by the users.

Users are the main unit of this complex network and studying various characteristics of the users helps us with the measurement analysis of the site. The information related to a user can be treated as the characteristics of the user on Goodreads. Also, depending on these characteristics we can try to classify the readers on Goodreads into different classes and sub classes and study the nature of book reading trends growing in a certain class of readers and how is, it different from other classes of readers.

With increasing popularity, Goodreads has developed a rich knowledge base for book lovers. Its been found that most of the users on Goodreads are young adults (~71%) belonging to the age group of 20-40. Varieties in types of books, authors, genres and users make this site much more interesting. There is certainly a scope of study related to this topic.

No comments:

Post a Comment