Filter Bubble: Fraud Detection in Online Reviews by Network Effects

Veracity means the quality of being truthful or honest. Now, as the internet grows in volume as well as importance, the importance of reviews and comments is gradually increasing, specially in field of eCommerce. Deception in reviews influences the customers as it gives fake information about the product on which they are dependent. As well as in this competition era, competitors write fake reviews to defame the product or services of their competitor. That's why it is important to detect the Veracity in the reviews.
But data-set is growing linearly with time, hence form a complex network.

Fraudeagle, is the network based approach to tackle the opinion fraud detection.

Problem Description:
Consider the problem of spotting fraudulent reviewers, and consequently spotting fake reviews in online review datasets.

For eCommerce, Online review datasets consists of :

(1) Reviewer - {'Fraud','Honest'}

(2) Product - {'Good','Bad'}

(3) Reviews - {'Positive','Negative'}

Reviewer is Honest if s/he writes mostly Positive review for Good product and Negative review for Bad product. and vice-versa for Fraud reviewer.
Figure 1 showing the Signed Bipartite Graph, where nodes are product and user(reviewer) and edges are signed as positive (+ / thumbs up) if rating is above threshold, otherwise negative (- / thumbs down).

Users 1-4 are honest while 5-6 are fraud, and Products 1-2 are good quality while 3-4 are bad.
The network structure and the sentiment (sign) of edges are observed. The fraudsters write fake reviews for the products except the + review from user 6 to product 2.

The goal is to develop a method to automatically label the users, products, and reviews in the large complex network.

Problem Formulation:
signed Inference Algorithm (sIA): Given a signed network G (V,E).

User set U={u₁, u₂, ..., u_n} and Product set P={p₁, p₂, ..., p_m} are linked with edge e(u_i, p_j, s) ∈ E, s ∈ {+,-}.
U∪P=V.
Each node in V and each edge in E is a random variable that takes a value from an appropriate label domain.
L_U = {honest, fraud}, L_P = {good, bad}, and L_E = {real, fake}.
In this classification task,

let Y_V = Y_U ∪ Y_P and Y_E respectively denote the nodes and edges the values of which need to be assigned, and let y_i refer to Y_i’s label.

Let Ψ denote a set of clique potentials that consists of two types of functions:

For each Y_i ∈ Y_U and Y_j ∈ Y_P , ψ_i ∈ ψ is a prior mapping ψ^U_i: L_U → R ≥0, and ψ^P_j: L_P → R ≥0, respectively
For each e(Y^U_i, Y^P_j, s) ∈ E, ψ^s_ij ∈ Ψ is a compatibility mapping ψ^s_ij : L_U × L_P → R ≥0

Given an assignment y to all the unobserved variables Y^Vand x to observed ones X^V (variables with known values), our objective is to maximize the probability distribution

Given,

- a bipartite network G = (V, E) of users and products connected with signed edges,

- prior knowledge (probabilities) of network objects belonging to each class, and

- compatibility of two objects with a given pair of labels being connected.

Now, Score the graph by Fraudeagle Scoring in Step 1 and then obtain induced subgraph and apply clustering algorithm to partition as apply in step 2.

(I) Scoring for fraud detection (Convergence algorithm) :

Proceeds by making each set of Y_i ∈ Y_U and Y_j ∈ Y_P alternately communicate messages with its neighbors in an iterative fashion until the messages stabilize, i.e. convergence. After they stabilize, calculate the marginal probabilities; say of assigning Y_i with label y_i by computing the final belief b_i(y_i).

where m_i→j is a message sent by user Y_i to product Y_j (a similar equation can be written for messages from products to users), and b_i(y_i) denotes the belief of user i having label y_i (again, a similar equation can be written for beliefs of products). α’s are the normalization constants, which respectively ensure that each message and each set of marginal probabilities sum to 1.

The prior beliefs ψ^U_iand ψ^P_j, respectively of users and products can be suitably initialized if there is any prior knowledge of the objects.

(II) Grouping for analysis and sensemaking (Clustering Algorithm):

Project the top users back on the review graph, and obtain the induced subgraph including these users along with the union of products that they rated. Then partition this subgraph into clusters to gain more insight about how they are organized in the network.

For partitioning, can use any graph clustering algorithm: Cross-associations (CA) clustering algorithm on the adjacency matrix of the induced graph performs clustering by finding a permutation of the rows (users) and columns (products) of the matrix such that the resulting matrix contains homogeneous blocks (defined by the clustering), where dense blocks correspond to near-bipartite cores (e.g., a team of users attacking on a target set of products).

Lemma 1 Proposed Fraudeagle is scalable to large data (complex graph), with computational complexity linear in network size.

STEP 1 Time complexity O(|E|d²t),

where |E| is number of edges in the network,d is maximum domain size of variable(i.e. number of classes), t is the number of iterations until convergence.

STEP 2 Cluster the induced graph on top users and their products using the CA algorithm, which is also linear in number of edges. The induced graph is also much smaller than the input graph.

References: Opinion Fraud Detection by Network Effects, review spam detection, Review Graph based Spammer Detection

Filter Bubble

Friday, 4 March 2016

Fraud Detection in Online Reviews by Network Effects

No comments:

Post a Comment