Monday, 7 March 2016

Spam Detection Method For e-mail


Introduction:
       The use of web page content based Spam filtering unleashed arms war between spammers and filter developers . Spammers continuously changes Spam message content that might be circumvent the current filters. Here, a method is describe which take account the content of the web pages linked by e-mail for Spam detection. Here first a methodology is describe for extracting web pages in e-mail and then a machine learning technique describe to extract classification rules from the web pages.
  

Methodology:
     Each email that contain the URL s of web pages , first that web pages downloaded then a content analysis techniques are applied to the web pages content. Its basically a filtering technique.Here a spamicity score assign to the web pages that linked in e-mail messages.

Figure-1 shown the work-flow of this filtering technique


Below all the steps are described:

1.Here first all the URL s that attached in e-mail messages are extracted and then corresponding web pages are downloaded. ‘Lynx’, a text based browser used to extract all the text from downloaded web pages and formatted the text of web pages such a way that user would perceive it. Lynx actually generates dump of the web pages. Textual term that are extracted by Lynx are use to spamicity calculation.
2.With given text data a associative classifier need to built. This classifier already trained with training data. This classifier algorithm generates rule in the form  X -> c, where x is a set of words and c is a class. Each one of those rules has a support and a confidence. The final result of each page’s classification is a score between 0 and 1 that indicates both the predicted class(Spam or ham) and the certainty of the prediction.
3.Now the messages will be scored based on their various spamicity score. The page score defined by .
         

  The choice of     will influence how much impact the web page classification will have on the final score.The is combined with other spamicity score obtained within the messages  .There is threshold value  of spamicity score.

If  ,  > Threshold  , it’s a Spam.
If  ,    < Threshold,  it’s a ham.

Methodology Evaluation:

     In order to evaluate the applicability of building anti-Spam filters using the content of the web pages., here in this method a dataset is built with all of Spam messages collected between period July, 2010 to December,2010 from Spam archive dataset. Daily basis Spam archive updated and each-day web pages linked with messages collected. 157,114 web pages obtained .Here two files stored, one containing the HTML contain of the web pages and another contain HTTP session information. This technique is only evaluate the unique web pages, so if a group of messages point to the same web pages would not consider in the evaluation.



Here in Fig-2 we can see 32,929 unique Spam messages linked by 157,114 Spam web pages . It is interesting to notice that the average number of pages per message was not very different between hams and Spam's.






In fig-3 we can see the distribution of the number of pages downloaded for each messages in the Spam dataset. This technique evaluated using all the resulting unique pages and the sampled e-mail messages that pointed to them.


References
     Ribeiro, Marco Túlio, et al. "Spam detection using web page content: a new battleground." Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference. ACM, 2011.

1 comment: