Saturday 26 March 2016

Opinion Fraud Detection : A Network-based Approach

“When the tap of a fingertip holds the capability to send shock waves across a nation, it becomes but highly indispensable to be wary of those whose fingers could move for reasons not deemed praiseworthy by any standards.” –  a worried Netizen.


Motivation

Opinion Fraud is not something unheard of in today’s world. With services shifting to the e-domain followed by a growth in the importance attached to reviews while making an online purchase, there is a whole new world of review spammers emerging everyday who try their best to sway the opinions of the consumers in favor of their products by aggressive review campaigns (hyper-spam) and defamation of competitors (defamation-spam).

To add to the intensity, businesses rely on such content – positive reviews could bolster sales, while negative ones could imply a significant loss of reputation and revenues – a process which in many cases, becomes irreversible.

This article is based upon work done by researchers from Carnegie Mellon University and Stony brook University towards spotting fraudsters and fake reviews in online review datasets with emphasis on the connectivity structure of the review data.

The authors have proposed an unsupervised, general and network-based framework, FraudEagle to tackle the opinion fraud detection problem in online review data. The problem has been formulated as a network classification task on signed networks.


Network  Description

The review dataset has a set of users, a set of products and reviews. Each review is written from a particular user to a particular product, and contains a star rating, often an integer from 1 to 5. This could be visualized as a bipartite relationship in which user nodes are connected to product nodes and the link represents the ‘reviewed’ relationship, weighted by the rating. More specifically, a signed network has been considered in which each network link (i.e. review) is marked as positive if its rating is above a threshold, and as negative otherwise.

The objects in the network have been assumed to belong to certain classes - products are either good or bad quality, users are either honest or fraud, and finally reviews are either real or fake. Intuitively, a product is good (bad) if it most of the time receives many positive (negative) reviews from honest users. Similarly, a user is honest (fraud) if s/he mostly writes positive (negative) reviews to good products, and negative (positive) reviews to bad products. In other words, a user is fraud if s/he is trying to promote a set of target bad products (hypespam), and/or damage the reputation of a set of target good products (defaming-spam).


A toy review network of 6 users and 4 products with reviews in between.



Problem Definition

Given

·         a bipartite network Gs = (V, E) of users and products connected with signed edges,
·         prior knowledge (probabilities) of network objects belonging to each class, and
·         compatibility of two objects with a given pair of labels being connected;

Classify the network objects Yi Y = Y one of two respective classes; LU = {honest, fraud}, LP = {good, bad}, and LE = {real, f ake}, where the assignments yi maximize the objective probability.


Approach

STEP 1: Scoring

Finding the best assignments turns out to be an NP-hard problem. The authors have circumvented this by extending the Loopy Belief Propagation to signed networks. The algorithm works on iterative message passing – it proceeds by making each set of users and products alternately communicate messages until the messages stabilize. Then the marginal probabilities of assigning the belief is calculated. The prior beliefs of the users and products are initialized suitably if there is such information available, and assumed equally likely to belong to any category otherwise.

The concept of compatibility matrices has been introduced, which addresses the universality of the behaviors of both genuine commenters and spammers - a genuine persdon could behave as a spammer sometimes, and vice-versa.

In order to judge the network edges as fake or real, the converged messages from products to users have been simply taken as the final beliefs on reviews.

STEP 2: Grouping

Marginal class probabilities over users, products and reviews are utilized to order each set of them in a ranked list. While a rank list of, say, users with respect to being a fraudster is a valuable resource, it does not put the top such users in context with the products that they rated.

In order to make further sense of the results, the induced subgraph of the selected users is obtained along with the union of the products that were rated by them. The subgraph is then partitioned into clusters to gain more insights into how they are organized in the network. The CA (Cross-Association) clustering has been employed on the adjacency matrix of the induced subgraph – an approach which tends to provide near-bipartite cores with less computational efforts.

Dark red (green) : Fraud (honest) users; similar for reviews after running the Algorithm



Conclusion

The results of this analysis leveraging the network features have been compared to the baselines without them and the results have been quoted to have improved. The introduciton of the network features seem to have an overall significant impact when it comes to feature engineering for machine learning problems.


References:
·         Opinion Fraud Detection in Online Reviews by Network Effects, by Leman A, Rishi C and Kristos F


2 comments: