Filter Bubble: Detection of Spammers on Social Networking Sites

1. Introduction

Social networking has become an indispensable part of one’s life nowadays. It enables people to be in touch with their friends, relatives. It also enables us to find people having similar interests, hobbies etc. According to a study made by alexa.com it is observed that an average user spends more time on social networking sites than any other site. In the present scenario people do not think much before accepting a friend request or following a person on any social networking site like Facebook or Twitter. Due to this spammers now focus on these sites to propagate their attacks. It is interesting to note that spammers were able to generate $ 2 million using Koobface worm which propagated through these social networking sites in just 1 year.

In this blog post we will be discussing about a research conducted at University of California which was able to detect 15,857 spam profiles on twitter.

2. Data Collection

To collect the data for analysis they created 900 profiles, 300 profiles each on Facebook, Twitter and Myspace. These accounts were used to log all the friend requests, messages, invitations received from other users. The researchers also termed these profiles as honey-profiles due to their similarity with honeypots. After creating honey profiles, a scripts were run on each profile to periodically check the activities on those accounts. The friend requests and the messages received on these profiles are shown in the table below:

It is interesting to note out of 3831 friend requests only 173 were spammers on Facebook, however on Twitter out of 397 requests 361 were spammers. Also the average life of a spam account on Facebook is around 4 days whereas on twitter the average life was 31 days.

Bots (automated computer programs) are generally used by spammers to send those spams to the user profiles. The bots can be classified into 4 different categories:

1. Displayer: These are the bots that post spam messages on their own profiles, a targeted user has to visit this page to view the spam message. This type of bot is least effective and has a very less reach to the targeted users. All Myspace bots belong to this category.

2. Bragger: This is somewhat similar to the bot discussed above, it also posts the spam message on his own profile, however the message gets propagated to the targeted user network as feeds in Facebook and as tweets in Twitter. The spam messages reach only to those people who are directly connected to these spam accounts. On Facebook 163 bots of this kind was found and on Twitter 341 bots were found.

3. Poster: These are those bots which directly send a message to victim. The message posted by these bots are not private and is public, shown to all the friends of the victim e.g. Spam message posted on a victim’s wall on Facebook. This type of attack is most effective because it reaches a higher number of audience. Koobface belongs to this category. 8 bots of this kind was found on Facebook.

4. Whisperer: These are those bots which send private message to the victim. Unlike poster bots only the specific user is able to see the message. This type of bot is most common on Twitter and 20 of them was found on it.

3. Spam Profile Detection

Since the “bragger” and “poster” spammers do not require real profiles for detection and can be detected easily by checking the feeds. Hence they used machine learning to classify spammers and normal users. They used Weka framework with Random Forest algorithm for classifier. For classifying a profile as spammer six features were defined:

1. FF Ratio (R): It compares the number of friend request that a user has sent to the number of friends he has. Since a bot is just a computer program hence very little number of users will accept the request from a bot. Hence we can expect that this ratio is very high for bots and low for normal users. In case of Twitter it can be calculated as R = following / followers.

2. URL Ratio (U): It is used to check the number of URLs present in the messages. To get the users to spam web pages, bots generally send URLs in messages. Thus ratio U is defined as:

U = Message_Containing_URLs / Total_Messages

In case of only those URLs are counted which pointed to third party sites excluding Facebook pages.

3. Message Similarity (S): This feature focusses on the similarity between the messages. The similarity measure S is defined as

Where P is possible message-to-message combinations among any two messages logged for an account, p is a single pair, c(p) is a function which calculates the number of words that two messages share, l_ais the average length of the messages posted by that user and l_pis the number of message combinations. A bot sending similar messages to various users will have low S.

4. Friend Choice (F): It is used to detect whether a profile has friends with same name, a spammer can program the bot and provide some names to add it as its friends. It is defined as

F = T_n/ D_n

Where Tn is the total names from friend list and Dn is the number of distinct first names. Their observation showed that normal users have this value close to 1 whereas spammers have this value close to 2 or more.

5. Messages Sent (M): It is the number of messages sent by a profile. It is observed that spammers sent less than 20 messages as the chance of getting detected increases if more messages of similar type are sent from a single profile, however normal users may send out hundreds of messages and do not get misclassified as spammer by the networking site.

6. Friend Number (FN): It refers to the number of friends that a profile has. It is observed that a spam profile typically has less friends whereas a normal user may have thousands of friends.

3.1 Spam Detection on Twitter:

Since most profiles are public on twitter, it becomes easier to detect spammers from the crowd. To train the classifiers, they picked 500 spam profiles. These spam profiles were obtained from honey profiles and manually selected profiles which have at least a higher value for R, U and S features. They also manually picked 500 legitimate profiles for proper training of the classifier. The R feature was modified for twitter because legitimate normal users on twitter may have less followers but may be following thousands of other profiles Hence R was modified to R’, which is the ratio of the R value divided by the number of followers a profile has. A 10-fold cross validation for the classifier estimated false positive ratio of 2.5% and a false negative ratio of 3% on training set.

From 6^th March 2010 to 6^th June 2010 they crawled 1,35,834 profiles out of which 15,932 were classified as spammers. Out of these 15,932 only 75 were reported as false positives by the Twitter. Rest of the other profiles were deleted from Twitter.

4. Conclusion:

With the increased rate of adoption of social networking sites from all around the globe, it has also become a hotspot for the malicious user. In this post, we discussed how the spammers can be detected on social networking sites using machine learning technique with some carefully chosen features.

5. Reference:

Stringhini, Gianluca, Christopher Kruegel, and Giovanni Vigna. "Detecting spammers on social networks." Proceedings of the 26th Annual Computer Security Applications Conference. ACM, 2010.

Filter Bubble

Monday, 7 March 2016

Detection of Spammers on Social Networking Sites