Introduction:
The
use of web page content based Spam filtering unleashed arms war between spammers and filter developers . Spammers continuously changes Spam message
content that might be circumvent the current filters. Here, a method is
describe which take account the content of the web pages linked by e-mail for Spam detection. Here first a methodology
is describe for extracting web pages in e-mail and then a machine learning technique describe to extract classification rules
from the web pages.
Methodology:
Each
email that contain the URL s of web pages , first that web pages downloaded then
a content analysis techniques are applied to the web pages content. Its
basically a filtering technique.Here a spamicity score assign to the web pages
that linked in e-mail messages.
Below all the steps are described:
1.Here first all the URL s that attached in e-mail messages
are extracted and then corresponding web pages are downloaded. ‘Lynx’, a text
based browser used to extract all the text from downloaded web pages and
formatted the text of web pages such a way that user would perceive it. Lynx
actually generates dump of the web pages. Textual term that are extracted by
Lynx are use to spamicity calculation.
2.With given text data a associative classifier need to
built. This classifier already trained with training data. This classifier
algorithm generates rule in the form X -> c, where x is a set of words and c is a class. Each
one of those rules has a support and a confidence. The final result of each
page’s classification is a score between 0 and 1 that indicates both the
predicted class(Spam or ham) and the certainty of the prediction.
3.Now the messages will be scored based on their various
spamicity score. The page score defined by .
Methodology Evaluation:
In order to evaluate the applicability of building anti-Spam
filters using the content of the web pages., here in this method a dataset is
built with all of Spam messages collected between period July, 2010 to December,2010
from Spam archive dataset. Daily basis Spam archive updated and each-day web
pages linked with messages collected. 157,114 web pages obtained .Here two
files stored, one containing the HTML contain of the web pages and another
contain HTTP session information. This technique is only evaluate the unique
web pages, so if a group of messages point to the same web pages would not
consider in the evaluation.
Here
in Fig-2 we can see 32,929 unique Spam messages linked by 157,114 Spam web
pages . It is interesting to notice that the average number of pages per
message was not very different between hams and Spam's.
In fig-3 we can see the distribution of the number of pages
downloaded for each messages in the Spam dataset. This technique evaluated
using all the resulting unique pages and the sampled e-mail messages that
pointed to them.
References
Ribeiro, Marco Túlio, et al. "Spam detection using web page content: a new battleground." Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference. ACM, 2011.
nice article. The plots explain well! thanks
ReplyDelete