Monday 7 March 2016

Topological vs topical community detection

Introduction

1. Communities detected in Citation, co-citation and co-authorship networks can be based on  two kinds of connections namely Social Connection and Similarity Connection. Social Connections are often real connections in the networks like friendship and co-authorship.  Similarity Connections are derived connections which normally do not physically exist like number of times two authors were co-cited.

2. Among different approaches for community detection, two approaches are one which considers the graph structure of the network (topology-based community detection approach) and the other considers the textual information of the network nodes (topic-based community detection approach).

3. In topology-based community detection, communities are generally detected based on graph partitioning approach, which tries to minimise the number of edges between communities.So, nodes inside a community should have more intra connection than interconnection with other community nodes. The Girvan-Newman approach does the same by removing edges with high betweenness centrality.

4. In topic-based community detection, communities are generally detected based on topics from the papers published by authors. Hierarchical clustering is a common topic-based community detection approach based on distance or similarity metrics.

5. Communities and topics are interweaving and co-evolving.A topology-based community might contain diverse topic-based sub-communities and vice versa. So two hypothesis has been proposed by [14], which are given as under:-
(a) Hypothesis 1: Communities detected by the topology-based community detection approaches tend to contain topically-diverse sub-communities within each community.
(b) Hypothesis 2: Communities detected by the topic-based community detection approaches tend to contain topologically-diverse sub-communities within each community.


Example

6. Consider the figure given below for co-authored networks:- 




7. The figure above shows Community A detected based on graph topology covers two topics (i.e., Topic A and Topic C), while the Community of Topic D contains four different sub topology-based communities (i.e., Community c, d, e, f).

Dataset

8. The dataset used for experiment by Ying Ding in the paper [1] is from the field of Information Retrieval (IR). Papers and citations from Web of Science (WOS) were collected for the latest 15 years (1993-2008). The period has been divided into two phases: Phase 1(1993-2000) and Phase 2 (2001-2008). The details are given as under:- 



9. Topology-based Community Detection Approach. The Clauset-Newman-Moore approach was applied here to detect communities based on the co-authorship network of Phase 1 and Phase 2 of the dataset. The table below shows five largest communities for the two phases:- 



10. Hypothesis 1.  For each community, five topics were extracted based on LDA. Some diverse topics can be discovered within the community. For phase 1 the topics were as: largest community (databases vs. image retrieval), second community (databases vs. query), third community (user feedback vs. information retrieval), fourth community (temporal vs. database vs. query language), and fifth community (query vs. multimedia vs. mining vs. Web). Similarly, some topics emerged for Phase 2. Using the Phase 1 and Phase 2 data, it was found that communities detected by the topology-based community detection approach tend to contain different topics within each community. For Phase 1, the largest community had the following details  



The topic correlation based on Pearson Correlation coefficient for the above community is give as under:- 




11. Topic-based Community Detection Approach.   The Author-Topic Model was applied here on the 2 phases and 5 topics were extracted for each phase. The list of authors belonging to one topic is called a community here. Table below shows the number of authors in each of the five  communities that are corresponding to the five extracted topics



12. Hypothesis 2.  The Author-Topic model was applied to detect five topic-based communities for each phase. For each topic-based community, the Clauset-Newman-Moore approach was used  to detected sub-communities. The communities detected were as : Community 1 (multimedia retrieval), Community 2 (database), Community 3 (medical retrieval), Community 4 (information retrieval), and Community 5 (mixture of different topics) for Phase 2. The sub community gives the picture as the example below
Community 1.  
-  The largest sub-community shows the collaboration network of Smith JR, Jain AK and Ma WY and all of them have research focuses on image retrieval. Smith JR did not collaborate with Jain AK and Ma WY, but Jain AK and Ma WY wrote one paper together on relevance feedback for natural image retrieval.
-  The second sub-community features the collaboration network of Kittler J,
-  The third sub-community shows the network of Smeulders AW,
-  The fourth sub-community identifies the collaboration network of Rui Y. 
All of them are well-known for their multimedia retrieval research.







Conclusion
13.     Among many different community detection approaches, two kinds of approaches are: topology-based and topic-base. The topology-based community detection approaches are commonly used. However, discovering a community purely based on graph topology can be problematic as it is hard to explain the semantic reason why such communities are formed purely based on the topology-based approach. Topology-based community detection approach and a topic-based community detection approach was applied to the coauthorship networks of the information retrieval field and the results were consistent with the proposed hypotheses

Reference and Further Reading
14.      Ding, Ying. "Community detection: Topological vs. topical." Journal of Informetrics 5.4 (2011): 498-514.


No comments:

Post a Comment