Friday, 4 March 2016

Online Analysis of Information Diffusion in Twitter

Online Analysis of Information Diffusion in Twitter


In this post, I am going to present a paper where the information diffusion in Twitter was studied and analyzed in real time. Note that this paper is the first in itself that studied the influence paths of information cascades in Twitter in an online manner as opposed to previous works that used offline data. This paper got published in WWW'14 conference.

Introduction


This paper focuses on analyzing how the information spreads over microblogging social networking services like Twitter in an online manner. This involves reconstructing informa- 
tion cascades that model how information is being propagated from user to user from the stream of messages and the underlying social graph . Monitoring social media in real time has attracted a 
lot of interest both by academia and industry because social media has changed the speed of interaction in addition to providing a large audience. Analysis on how information is being spread had been performed in an offline fashion in previous works. This paper presents methods and 
results of how information diffusion can be studied in real-time, using retweets on Twitter as a starting point . The problem of determining influence paths that express the relationship of who 
was influenced by whom is addressed. Information cascades are the set of influence paths that form a social graph sharing a common root that is the source of a tweet. The online analysis of diffusion relies on an algorithm and supporting system to infer possible influence paths from the stream of tweets and the underlying social graph (follower and friendship network) which no other prior work has done.

The major research question is that whether information propagates mostly over explicit social links or other means also play an important role. If the latter is the case, attributing influence spread is challenging. Other research aspects include how influential is the root user, whether cascades tend to be wide or deep and to what extent users are exposed to multiple influencers and the effects of various influence models.

Two baseline approaches presented here for information diffusion are the independent cascades (IC) and the Linear Threshold (LT) model. The IC model includes a diffusion probability that is associated with each edge while the LT model [8] defines an influence degree on each edge and an influence threshold for each node. The statistical, structural and content aspects of information cascades .

Models and Algorithms


In order to track information diffusion on real-time, we need to extract information cascades out of the message stream. A cascade is formed when users forward the same original tweet from the root user to their own set of followers. Under the assumption that the social connections serve as means of information diffusion and influence, we can derive the influence paths from these social connections among users. Users might be exposed and influenced by a piece of information by multiple users, hence forming multiple influence paths. An earlier retweet from a friend should be considered as a potential influencer, if no constraints are made on the influence model and also it is needed to consider that all retweets received do not have equal influence. There are various influence models that can be considered like the least recent influencer where users are influenced by the first exposure , most recent influencer where users are influenced by the last exposure, most followered influencer where users with the most followers tend to trigger diffusion and the most retweeted influencer considering users whose messages are forwarded the most.


The model considers that there is a social graph SG = (V;F) which is a directed graph of follower/friend relationships, showing for each user from V who follows this user F.
The message stream is expressed as sequence of messages M in temporal order. Each message M contains several attributes, such as timestamp t, user v which belongs to V, i that is a retweet ID or a hashtag of the message,etc. Any 2 messages is said to belong to the same cascade iff m1:i = m2:i.. Now a cascade graph C(U;E) is defined with U V as directed graph of influence paths among users. C is a subset of SG annotated with the influence time on the edges. It is important to note that C contains only nodes of those users who actually (re)tweeted, but not those that were exposed to the information, but did not react. This allows use of a smaller graph to trace the possible influence paths.

Among the users in C that (re)tweeted, here u0 belonging to U is designated as the ”root”, who is the source of the original message. An influence path should fulfil the condition that if a user um who spreads information using a message m must have been possibly influenced by an user ui if there is a social network connection from ui to um and ui is either the root or was exposed to this information by a message n which happened before m.

In order to evaluate the connectivity of a diffusion graph this paper defines two metrics, namely the Connectivity-Rate (CR) that assesses whether there is a connection between two users (nodes) in the cascade returning the percentage of users that have at least one connection, and are thus influenced by another user. The second is the Root-Fragment-Rate (RFR) that assess whether there is a path to the root user from every other user. 

Dataset


This paper used a dataset that was recorded from August 3rd to September 24th 2012, covering most of the Olympics and the Paralympics 2012 using a Twitter streaming API to  subscribe to the filter terms "Olympics" and "London2012". Intotal the data set contains almost 11 million tweets, in particular 1.1 million separate retweet cascades. The size of the largest cascade is more than 60000 retweets, around 150 have more than 1000 retweets. For 50% of the cascades we get 90% completeness or more, while for only 15% of the cascades we get completeness less than 80%.

Evaluation


The focus of evaluation of the models for reconstructing information cascades is on data quality, feasibility and cascade properties. The results we present on the cascade data cover four aspects. Firstly, the assumption that social links are carriers of information is verified. For 20% of the cascades they got more than 80% connectivity rate and 70% root fragment rate. Thus, it can be concluded that social links are indeed the predominant carriers of information. As a second step, it is shown how the quality of input data affects the reconstruction influence paths. 2 cascades of size 1000 are taken, one a star network and the other with a complex structure, with very good connectivity rates in the presence of full social network data and messages. In order to investigate the impact of incomplete data they gradually removed follower lists and messages. Star cascades are expected to undergo lower degradation since most of the users (retweeters) are connected with the root. It is observed that by degrading the follower lists to just 5% of the original data, the connectivity rate drops for the star cascade only 2% and for the complex cascade by 20%. The reason for this is that most users actually don’t exert much influence, while multiple diffusion paths compensate for the lost social connections in complex structures. When random messages were removed, a decrease of more than 20% for star cascades and 30% for complex cascades when keeping 75% of messages were observed. 
When observing the distribution of cascade sizes, a skewed distribution for retweet counts were found. The diameter analysis shows that cascades tend to be deep, with a mean value of diameter 4. In fact, cascades are more deep than swallow, indicating complex structures are more prevalent than star structures. In general, influence or popularity of the root user of cascades have little or no impact on the cascade sizes.
Comparing various influence models, it is found that the temporal distribution of edges changes according to themodel. The earlier influencer model produces edges closer to the root’s timestamp, while latest influencer favors a more stretched distribution of late retweets. In fact, the latest influencer model considers the longest path from the root as the influence path triggering the retweet. In addition, the out-degree of users in the cascade changes according to different model.


Conclusion



This paper presents the first steps towards the real-time analysis of information diffusion and user interaction in social media. Some models are defined to reconstruct information cascades over real-life Twitter data. We showed that such a reconstruction is feasible and social links play predominant role in information diffusion. The results show that such an inference is feasible even on noisy, large scale, rapidly produced data. It also demonstrates the impact of incomplete data and the effect of different influence models on the cascades. The observed cascades show a significant amount of variety in scale and structure.


References



1.  A.Guille et al. Information diffusion in online social networks: A survey. SIGMOD Record, 42(2): 17,2013.

No comments:

Post a Comment