Wednesday 10 February 2016

Why do People Re-tweet?

Introduction:
The main objective is to get a better understanding of what makes people spread information in tweets or microblogs through the use of retweeting. We find that although users in the majority of cases do not retweet information on topics that they themselves Tweet about as or from people who are like them” (hence anti-homophily), we do find that models which do take homophily into account fits the observed retweet behaviours much better than other more general models which do not take this into account.

Methodology:
The streams, people and links in these social media are all treated as a large homogeneous mass. While such a high-level view of the world is of tremendous use in order to understand large global behaviours, it unfortunately is not appropriate for fine-grained analysis of local behaviours. We here focus on generating a profile of “topics of interest” for a user based on past content posted, and then use this profile to gain insight into what makes people propagate information through different behaviour models. Contribution lies in building and using these user profiles.

This is done through automatic tagging of people and content into semantically meaningful categories and then using these categories to develop context- specific behavioural models for information propagation. Our approach further relies on being able to match and disambiguate entities mentioned in content so that we can track what a person writes about over time. For example, rather than track that a person writes about “Obama” and “Bush” and “Clinton”, we would like to learn that repeated instances about “Bush” is likely the president of the United States and that the topic really is Presidents and politics rather than these keywords. We do this by mapping found entities into an ontology, as we describe below, and then keeping track of which ontological concepts show up repeatedly in a user’s content.

These repeated concepts can then be used as that Person’s “topic of interest profile”, which we can use to map against other content, specifically with respect to what that person decides to propagate. Our approach to discovering a Twitter user’s topic profile is based open the idea that the topics of interests can be identified by finding the entities about which a user Tweets, and then determining a common set of high-level categories that covers these entities. As a running example, consider the following real-world Tweet:
             

There are four entities of interest in this Tweet: Arsenal, which refers to the Arsenal Football Club of England; Walcott, which refers to Theo Walcott, a player for Arsenal; Becks, which refers to football superstar David Beckham; and England. A category that covers these entities within the Tweet might be “English Football.    

Therefore, to develop a topic profile for a user, we analyse all of their Tweets and determine the set of common high-level categories that covers the set of Tweets. This set of categories defines the topic profile. In our example, the profile may include “English Football,” “World Cup,” etc. Our approach is to look for capitalized, non stopwords as possible named entities. This ensures high recall (we retrieve many possible entities) while conforming to the difficulty of our data. If the entity is not found in Wikipedia then we do not include it in our profile. Wikipedia may return a set of candidates that match the entity. To deal with disambiguation problem, we leverage the “local context” of the Tweet. Specifically, we treat the text of the Tweet (excluding the entity term to disambiguate) as the context for that entity. If we are using the example Tweet, and our current entity to disambiguate is “Arsenal,” then the local context is {winger, Walcott, Becks. .}. Again, note that we exclude stopwords from the context. }. Again, note that we exclude stopwords from the context. More formally, we define the Tweet’s local context, CT, for an entity, ET as:

where TT is the set of terms in the Tweet. We define each candidate entity from Wikipedia as ei E (the set of candidates), and define the context for the page of each candidate entity as:                                 
we choose entity ei from the set of entity candidates E with the maximum contextual overlap:                                   
We are interested in higher concepts which are relevant to the entity in question. We retrieve these as a category tree based on the folksonomic category tree which the identified entity page is situated in. This is done by following the categories which can be found at the bottom of most Wikipedia pages. We start with the set of categories for the given entity, and trace through the links of each category, collecting the parent categories along the way. At the end of this process, we have a “sub-tree” of the folksonomy, rooted at the most specific term as shown below
         

The root categories are more specific than the other categories. To even the counts, we weight categories by their depth in the tree and then rank each of the categories c, in the set of sub-trees according to the following ranking function:

where Freq(c) is the frequency of the category’s occurrence and wc is a weight, inverse to the category’s level in sub-tree. Finally, we define a user’s topic profile as the complete set of all observed categories for that user, ranked according to ranking function.

Homophily Model:
We want to compute P(retweet(x)),where x is a Tweet previously seen (up to and including the most recent Tweet).This model is based on profiles of users. It may be that a user is more likely to retweet another user if they share similar profiles. By observing what is retweeted, we can generate the underlying empirical distribution of  Pps (x|simP (x, u)) where simP (x, u) is the similarity between a user’s profile and that of the profile of the user who sent the original Tweet. Our Profile-based model is then defined as:
                                       
As above, simP (x, u) is an empirical model which comes from the data, and we may find that there are certain levels of similarity where a user is more likely to retweet.

References:
·         Borau, K.; Ullrich, C.; Feng, J.; and Shen, R. 2009. Microblogging for language learning: Using twitter to train communicative and cultural competence.
·         Clauset, A.; Newman, M. E. J.; and Moore, C. 2004. Finding community structure in very large networks. Physical Review E 70. 066111

No comments:

Post a Comment