XJIPC OpenIR  > 多语种信息技术研究室
基于非负矩阵分解的重叠社区发现研究及其应用
赵清华
Subtype硕士
Thesis Advisor蒋同海
2018-05-25
Degree Grantor中国科学院大学
Place of Conferral北京
Degree Discipline计算机技术
Keyword重叠社区发现 非负矩阵分解 复杂网络 社交网络
Abstract

目前,真实生产生活中存在大量的图数据,该种数据包含实体和实体间的关系,且实体间关系相对复杂。以社交网络数据为例,该数据包含用户数据和用户间的关注、评论、转发等交互数据,且该数据可以表示成以用户为顶点,以用户间的交互关系为边的图数据。这种图数据中的顶点、边和结构均包含了大量的信息。目前已经有多种用于揭示图数据中信息的技术,社区发现是该技术中的一种。社区发现旨在检测图数据中具有紧密联系的、由一群顶点组成的团体——社区。在该领域中,当社区间是不可重叠时,称为非重叠社区发现,亦简称社区发现;当社区间可以重叠时,称为重叠社区发现。本文主要针对重叠社区发现的研究与应用,即所检测的圈子是可以重叠的。社区发现的研究具有重要的理论意义和现实应用价值。在理论层面,社区发现技术可以帮助挖掘图数据的结构、图数据中顶点的交互关系和聚集情况;在应用层面,社区发现技术可以用于商品推荐和朋友推荐,购物平台刷单团伙检测,以及黄赌毒网站、垃圾网站等异常站点的检测等。本文以重叠社区发现为核心,围绕非负矩阵分解算法内容展开。 首先,为进一步提高基于非负矩阵分解的重叠社区发现算法的稳定性,以及从不同的角度探究基于非负矩阵分解的重叠社区发现算法的效果,本文提出了一个从概率角度描述网络的基于非负矩阵分解的重叠社区发现算法。该方法将网络的顶点和边看作观测变量,将社区看作隐藏变量,假设整个网络由观测变量和隐藏变量共同组成。为揭示该隐藏变量,使用最大期望算法进行迭代求解。并将该迭代过程生成的矩阵看作非负矩阵分解的一种,以欧式距离的平方和作为损失函数控制迭代次数。该算法不但可以计算顶点到社区的隶属度,而且具有相当好的社区划分效果。其次,在研究非负矩阵分解的重叠社区发现过程中,我们发现如果利用某种定义将文本数据转化成图数据,那么可以将重叠社区发现算法迁移至文本挖掘领域话题发现问题的解决中。因此为解决上述问题,本文利用词频-逆文档频率(TFIDF)将文档转化为文档-词语矩阵,利用文档-词语矩阵的稀疏性,将文档-词语矩阵离散化为二值矩阵,并利用基于非负矩阵分解的社区发现算法进行话题发现。相比于传统话题发现算法,例如LDA,该方法不但不用假设词语间是条件独立的能够捕获词语间的语义关联,而且具有更快的计算速度(30+倍)和更高的准确率。然后,在利用非负矩阵分解的算法检测到重叠社区后,这些社区是原始图数据的子集,具有高维稀疏的网络结构,但是图数据中的顶点和边往往带有自己的特征,这些特征以向量的形式存在,这使得检测得到的社区不能很好地与顶点特征和边特征相结合。为解决上述问题,本文利用顶点社区关系模型将表示网络社区映射为欧式空间特征向量,从而使得社区特征可以方便地与用户特征相结合。实验表明,该社区特征带来了7.2%的效果提升,证明将网络社区映射为欧式空间特征向量的操作是有效的,同时该社区特征中嵌入了网络社区信息。最后,新疆各地的加气站积累了大量的车辆加气数据,这些数据可以表示成一个二元异构网络。二元异构网络指的是,网络中包含两种不同类型的顶点,且仅在不同类型的顶点之间存在边,同种类型的顶点之间不存在边。为对该加气数据利用图挖掘的技术进行分析,本文利用基于非负矩阵分解的重叠社区发现算法进行分析,挖掘出车辆与加气站之间的聚集关系。该聚集关系,不仅反映了真实的车辆加气行为聚集情况而且在车辆异常检测、车辆团伙检测中可以发挥重要价值。

Other Abstract

At present,there are plenty of graph data in real-world which contains both entities and the relations of entities, and this kind of relations are pretty complex. Take social network data as an example, the data contain both users and the interactions between users such as comments, follow and forwarding. This data can be represented as graph data where nodes represent users and edges denote the relations between users.The nodes, edges and structures of the graph data contain a lot of information. And many techniques have been applied to reveal the information in graph data and one of them is community detection. Community detection aims at detection groups which are densely connected and consist of a group of nodes--communities. In this field, if the communities are disjoint, it is called non-overlapping community detection or community detection, otherwise, it is called overlapping community detection. This essay falls into the second category. That is the circles detected are overlapping.The research of community detection is of great value in theory and applications. At the theoretical level, community detection technique helps mine the structure of graph data, and the cluster and interactions between nodes. At the application level, community detection technique can be applied to products recommendation or friends recommendation, gang detection on shopping platform, and the detection of abnormal sites such as pornographic and gambling drug sites, junk websites, etc.The thesis focuses on overlapping community detection and unfold with non-negative matrix factorization algorithm. Firstly, to improve the stability of overlapping community detection based on non-negative matrix factorization and explore the performance from different views, and a novel overlapping community detection algorithm model network from probabilistic aspect is proposed. This algorithm treats vertices and edges in network as observations and communities as latent factors, the network includes unknown factors in addition to observations. To reveal this latent factors, expectation-maximization algorithm is adopted to solve it iteratively. The generation process of the vertices-communities matrix is treated as a type of non-negative matrix factorization, and the square sum of Euclidean distance is formulated as loss function to control iterations. This algorithm can computes the membership of nodes to communities, and has a comparable performance.Then, in the study of community detection algorithms based on non-negative matrix factorization, we find that if the text data are converted into graph data by means of definitions, then the overlapping community detection algorithms can be applied to solve problems in the filed of text mining.Therefore, to solve the problems mentioned above, we convert text data into document-word matrix by term-frequency inverse document frequency, and discretize document-word matrix into binary matrix by its nature of sparsity, and detect topics by non-negative matrix factorization methods. Compared with traditional topic models such as LDA, our method makes the conditional independence assumption between words unnecessary and can capture the semantic relations between words, and is more computationally efficient (30x faster) and comparative effective.Then, when the overlapping communities are detected by non-negative matrix factorization method, these communities are the subset of raw graph data with high dimensional sparse structure. However, vertices and edges in graph always own attributes existing in the form of vectors, which makes it difficult to combine both communities and attributes.To solve problems mentioned above, in this thesis, we detect network communities via non-negative matrix factorization method, and then map network communities into feature vector in Euclidean space, so that the network communities feature can be concatenated with other features vectors easily. The experimental results show that the performance improve by 7.2% with the addition of the community feature and the transformation of mapping communities into Euclidean space feature vector is effective.Finally, gas stations in the Xinjiang province accumulate a lot of vehicles refueling data, and these data can be represented as bipartite heterogeneous network. Bipartite heterogeneous network contains two types of vertices and links are only formed between different type of vertices. To analyse the data by graph analysis technique, overlapping community detection based on non-negative matrix factorization are adopted. Analysing the refueling network by overlapping community detection algorithm can mine the cluster relations between vehicles and gas stations. This cluster relations are of great importance for maintaining public safety, such as abnormal vehicles detection.

Document Type学位论文
Identifierhttp://ir.xjipc.cas.cn/handle/365002/5455
Collection多语种信息技术研究室
Recommended Citation
GB/T 7714
赵清华. 基于非负矩阵分解的重叠社区发现研究及其应用[D]. 北京. 中国科学院大学,2018.
Files in This Item:
File Name/Size DocType Version Access License
基于非负矩阵分解的重叠社区发现研究及其应(1914KB)学位论文 开放获取CC BY-NC-SAView Application Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[赵清华]'s Articles
Baidu academic
Similar articles in Baidu academic
[赵清华]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[赵清华]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: 基于非负矩阵分解的重叠社区发现研究及其应用.pdf
Format: Adobe PDF
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.