1. Introduction
Social networking has become an indispensable part of one’s
life nowadays. It enables people to be in touch with their friends, relatives.
It also enables us to find people having similar interests, hobbies etc. According
to a study made by alexa.com it is observed that an average user spends more
time on social networking sites than any other site. In the present scenario people
do not think much before accepting a friend request or following a person on any
social networking site like Facebook or Twitter. Due to this spammers now focus
on these sites to propagate their attacks. It is interesting to note that spammers
were able to generate $ 2 million using Koobface worm which propagated through
these social networking sites in just 1 year.
In this blog post we will be discussing about a research
conducted at University of California which was able to detect 15,857 spam
profiles on twitter.
2. Data Collection
To collect the data for analysis they created 900 profiles,
300 profiles each on Facebook, Twitter and Myspace. These accounts were used to
log all the friend requests, messages, invitations received from other users.
The researchers also termed these profiles as honey-profiles due to their
similarity with honeypots. After creating honey profiles, a scripts were run on
each profile to periodically check the activities on those accounts. The friend
requests and the messages received on these profiles are shown in the table
below:
It is interesting to note out of 3831 friend requests only
173 were spammers on Facebook, however on Twitter out of 397 requests 361 were
spammers. Also the average life of a spam account on Facebook is around 4 days
whereas on twitter the average life was 31 days.
Bots (automated computer programs) are generally used by
spammers to send those spams to the user profiles. The bots can be classified
into 4 different categories:
- 1. Displayer: These are the bots that post spam messages on their own profiles, a targeted user has to visit this page to view the spam message. This type of bot is least effective and has a very less reach to the targeted users. All Myspace bots belong to this category.
- 2. Bragger: This is somewhat similar to the bot discussed above, it also posts the spam message on his own profile, however the message gets propagated to the targeted user network as feeds in Facebook and as tweets in Twitter. The spam messages reach only to those people who are directly connected to these spam accounts. On Facebook 163 bots of this kind was found and on Twitter 341 bots were found.
- 3. Poster: These are those bots which directly send a message to victim. The message posted by these bots are not private and is public, shown to all the friends of the victim e.g. Spam message posted on a victim’s wall on Facebook. This type of attack is most effective because it reaches a higher number of audience. Koobface belongs to this category. 8 bots of this kind was found on Facebook.
- 4. Whisperer: These are those bots which send private message to the victim. Unlike poster bots only the specific user is able to see the message. This type of bot is most common on Twitter and 20 of them was found on it.
3. Spam Profile Detection
Since the “bragger” and “poster” spammers do not require
real profiles for detection and can be detected easily by checking the feeds.
Hence they used machine learning to classify spammers and normal users. They
used Weka framework with Random Forest algorithm for classifier. For
classifying a profile as spammer six features were defined:
- 1. FF Ratio (R): It compares the number of friend request that a user has sent to the number of friends he has. Since a bot is just a computer program hence very little number of users will accept the request from a bot. Hence we can expect that this ratio is very high for bots and low for normal users. In case of Twitter it can be calculated as R = following / followers.
- 2. URL Ratio (U): It is used to check the number of URLs present in the messages. To get the users to spam web pages, bots generally send URLs in messages. Thus ratio U is defined as:
U = Message_Containing_URLs / Total_Messages
In case of only those URLs are
counted which pointed to third party sites excluding Facebook pages.
- 3. Message Similarity (S): This feature focusses on the similarity between the messages. The similarity measure S is defined as
Where P is possible
message-to-message combinations among any two messages logged for an account, p
is a single pair, c(p) is a function which calculates the number of words that
two messages share, la is the average length of the messages posted
by that user and lp is the number of message combinations. A bot
sending similar messages to various users will have low S.
- 4. Friend Choice (F): It is used to detect whether a profile has friends with same name, a spammer can program the bot and provide some names to add it as its friends. It is defined as
F = Tn /
Dn
Where Tn is the total names from
friend list and Dn is the number of distinct first names. Their observation
showed that normal users have this value close to 1 whereas spammers have this
value close to 2 or more.
- 5. Messages Sent (M): It is the number of messages sent by a profile. It is observed that spammers sent less than 20 messages as the chance of getting detected increases if more messages of similar type are sent from a single profile, however normal users may send out hundreds of messages and do not get misclassified as spammer by the networking site.
- 6. Friend Number (FN): It refers to the number of friends that a profile has. It is observed that a spam profile typically has less friends whereas a normal user may have thousands of friends.
3.1 Spam Detection on Twitter:
Since most profiles are public on twitter, it becomes easier
to detect spammers from the crowd. To train the classifiers, they picked 500
spam profiles. These spam profiles were obtained from honey profiles and manually
selected profiles which have at least a higher value for R, U and S features.
They also manually picked 500 legitimate profiles for proper training of the
classifier. The R feature was modified for twitter because legitimate normal users
on twitter may have less followers but may be following thousands of other
profiles Hence R was modified to R’, which is the ratio of the R value divided
by the number of followers a profile has. A 10-fold cross validation for the
classifier estimated false positive ratio of 2.5% and a false negative ratio of
3% on training set.
From 6th March 2010 to 6th June 2010
they crawled 1,35,834 profiles out of which 15,932 were classified as spammers.
Out of these 15,932 only 75 were reported as false positives by the Twitter.
Rest of the other profiles were deleted from Twitter.
4. Conclusion:
With the increased rate of adoption of social networking
sites from all around the globe, it has also become a hotspot for the malicious
user. In this post, we discussed how the spammers can be detected on social
networking sites using machine learning technique with some carefully chosen
features.
5. Reference:
Stringhini, Gianluca,
Christopher Kruegel, and Giovanni Vigna. "Detecting spammers on social
networks." Proceedings of the 26th Annual Computer Security Applications
Conference. ACM, 2010.
No comments:
Post a Comment