Friday, 4 March 2016

Fraud Detection in Online Reviews by Network Effects


Veracity means the quality of being truthful or honest. Now, as the internet grows in volume as well as importance, the importance of reviews and comments is gradually increasing, specially in field of eCommerce. Deception in reviews influences the customers as it gives fake information about the product on which they are dependent. As well as in this competition era, competitors write fake reviews to defame the product or services of their competitor. That's why it is important to detect the Veracity in the reviews.
But data-set is growing linearly with time, hence form a complex network.

Fraudeagle,  is the network based approach to tackle the opinion fraud detection. 

Problem Description: 
Consider the problem of spotting fraudulent reviewers, and consequently spotting fake reviews in online review datasets.

For eCommerce, Online review datasets consists of :
(1) Reviewer -  {'Fraud','Honest'}
(2) Product    -  {'Good','Bad'}
(3) Reviews   -  {'Positive','Negative'}

Reviewer is Honest if s/he writes mostly Positive review for Good product and Negative review for Bad product. and vice-versa for Fraud reviewer.
Figure 1 showing the Signed Bipartite Graph, where nodes are product and user(reviewer) and edges are signed as positive (+ / thumbs up) if rating is above threshold, otherwise negative (- / thumbs down).


Users 1-4 are honest while 5-6 are fraud, and Products 1-2 are good quality while 3-4 are bad.
The network structure and the sentiment (sign) of edges are observed. The fraudsters write fake reviews for the products except the + review from user 6 to product 2.

The goal is to develop a method to automatically label the users, products, and reviews in the large complex network.

Problem Formulation:
signed Inference Algorithm (sIA): Given a signed network G (V,E).

  • User set U={u1, u2, ..., un} and Product set P={p1, p2, ..., pm} are linked with edge e(ui, pj, s) ∈ E, s ∈ {+,-}.
  • U∪P=V.
  • Each node in V and each edge in E is a random variable that takes a value from an appropriate label domain.
  • LU = {honest, fraud}, LP = {good, bad}, and LE = {real, fake}. 
  • In this classification task,
    • let YV = YU ∪ YP and YE respectively denote the nodes and edges the values of which need to be assigned, and let yi refer to Yi’s label.
  • Let Ψ denote a set of clique potentials that consists of two types of functions:
    •  For each Yi ∈ YU and Yj ∈ YP , ψi ∈ ψ is a prior mapping ψU: LU → R ≥0, and ψP: LP → R ≥0, respectively
    • For each e(YUi, YPj, s) ∈ E, ψsij ∈ Ψ is a compatibility mapping ψsij : LU × LP → R ≥0

Given an assignment y to all the unobserved variables Yand x to observed ones XV (variables with known values), our objective is to maximize the probability distribution


Given,
- a bipartite network G = (V, E) of users and products connected with signed edges,
- prior knowledge (probabilities) of network objects belonging to each class, and
- compatibility of two objects with a given pair of labels being connected.



Now, Score the graph by Fraudeagle Scoring in Step 1 and then obtain induced subgraph and apply clustering algorithm to partition as apply in step 2.

(I) Scoring for fraud detection (Convergence algorithm) :
Proceeds by making each set of Yi ∈ YU and Yj ∈ YP alternately communicate messages with its neighbors in an iterative fashion until the messages stabilize, i.e. convergence. After they stabilize, calculate the marginal probabilities; say of assigning Yi with label yi by computing the final belief bi(yi).
where mi→j is a message sent by user Yi to product Yj (a similar equation can be written for messages from products to users), and bi(yi) denotes the belief of user i having label yi (again, a similar equation can be written for beliefs of products). α’s are the normalization constants, which respectively ensure that each message and each set of marginal probabilities sum to 1.
The prior beliefs ψUiand ψPj, respectively of users and products can be suitably initialized if there is any prior knowledge of the objects.

(II) Grouping for analysis and sensemaking (Clustering Algorithm):
  • Project the top users back on the review graph, and obtain the induced subgraph including these users along with the union of products that they rated. Then partition this subgraph into clusters to gain more insight about how they are organized in the network.

  • For partitioning, can use any graph clustering algorithm: Cross-associations (CA) clustering algorithm on the adjacency matrix of the induced graph performs clustering by finding a permutation of the rows (users) and columns (products) of the matrix such that the resulting matrix contains homogeneous blocks (defined by the clustering), where dense blocks correspond to near-bipartite cores (e.g., a team of users attacking on a target set of products).


Lemma 1 Proposed Fraudeagle is scalable to large data (complex graph), with computational complexity linear in network size. 
STEP 1 Time complexity O(|E|d2t), 
where |E| is number of edges in the network,d is maximum domain size of variable(i.e. number of classes), t is the number of iterations until convergence.

STEP 2 Cluster the induced graph on top users and their products using the CA algorithm, which is also linear in number of edges. The induced graph is also much smaller than the input graph.


No comments:

Post a Comment