XJIPC OpenIR  > 多语种信息技术研究室
基于关键属性的高维相似记录检测方法研究
宋国兴
学位类型硕士
导师周喜
2017-05-21
学位授予单位中国科学院大学
学位授予地点北京
学位专业计算机应用技术
关键词相似记录检测 高维特性 关键属性 Snm算法 R-树索引
摘要

进入21世纪,计算机网络、电子商务以及物联网等信息技术快速发展,无论是IT公司内部,还是整个信息网络,信息的产生均出现爆炸式的增长。但是信息量的增多并不意味着信息价值的提高。面对如此庞大的信息量,没有价值的信息往往远多于真正有价值的信息,也就是说信息的价值很容易被淹没在海量信息中而失去它存在的意义。本文主要研究如何从海量、多数据源中检测出那些描述同一个物理实体的相似记录。在对多源信息进行融合以及后期的数据挖掘和数据分析时,由于每个数据源所采用的数据格式、表示方式、数据定义等的不同,对应的同一事物的描述就会有不同的表示。如果对这些同一事物的描述记录不加处理,而是简单地存储到一起,不仅会造成存储信息的冗余、存储空间的浪费,也将使得从原始数据中挖掘有用信息、分析问题和效率带来不必要的开销。本文从实际工程数据出发,针对数据记录的高维、数据量大的特点进行相似重复记录的相关研究。本文的研究主要包括两个方面的内容:第一,记录关键属性选择。每条记录包含有多个属性维,有些属性对描述记录起关键作用,有的则没有作用,甚至起负面作用。从数据挖掘主成分分析的角度出发,结合信息论的相关内容,提出统一互相信息方法,从高维属性中选择表征记录实体的关键属性,过滤噪音属性,降低记录维度,从而提高检测准确率和效率;第二,经典的SNM算法在相似重复记录检测中取得了很好的效果,但是,在高维、大数据的背景下,SNM算法有两个明显的不足:算法的本质是将记录投影在一维空间,但随着记录维度的增加,投影过程不仅会导致数据丢失,算法的误差率也会明显增大;数据量大的情况下由于涉及到数据记录的排序,必然导致时间效率的降低。本文根据R-树索引和聚类思想,通过构建R-树保留待检测记录的空间特性,利用聚类,将潜在的相似记录聚合在叶结点中,减少相似记录间的比较次数。同时,为了避免大量属性空值对记录相似性检测的影响,改进了传统的基于编辑距离的记录相似度判定方法。最后,在从实际工程当中提取部分数据,针对本文算法构造相应的训练集和测试集,通过在不同维度下记录检测在时间效率和准确率方面的对比,验证本文算法的有效性。

其他摘要

In twenty-first Century, the rapid development of information technology, such as computer networks, e-commerce and the Internet of things, both in the IT company, or the entire information network, the emergence of information explosion. However, the increase in the amount of information does not mean that the value of information. Faced with such a huge amount of information, the value of information is often far more than the true value of the information, that is to say, the value of information can easily be submerged in the vast amount of information and the significance of its existence.In this paper, we mainly focus on the detection of multi source data similarity records in massive information. In the fusion and the data mining and data analysis of multi-source information, because the data format of each data source, data representation, the definition is different, the same thing will have a different description of the corresponding representation. If the same thing described record without treatment, but simply stored together, redundant storage space will not only cause the waste of storage of information, accuracy will directly reduce the efficiency of data mining and data analysis and information. Based on the actual engineering data, this paper makes a research on the similar duplicate records for the characteristics of high dimension and large amount of data.The main content of this paper is divided into two parts: first, according to the characteristics of high dimensional data records, according to the characteristics of data mining in data preprocessing, dimension reduction algorithm, put forward a selection method of the key attribute based on mutual information, select the key attributes to describe the things from the high dimensional data, filtering noise properties reduce the recording, dimension, so as to improve the detection accuracy and efficiency; second, the classical SNM algorithm in duplicate record detection and achieved good results, but in the background of high dimension, big data, SNM algorithm has two obvious shortcomings: the essence of the algorithm is to record projection in one dimensional space. But with the increase of recording dimension, projection process will not only lead to the loss of data, the algorithm error rate will be increased significantly; the case of large amount of data due to data records Sorting, will inevitably lead to the reduction of time efficiency. To solve this problem, the characteristics of high dimensional space is constructed in this paper using the R- tree index keep records, reduce the number of comparisons by clustering recorded in the leaf nodes in improving efficiency,at the same time, in order to avoid the influence of a large number of attributes on the similarity detection, we improve the traditional method based on edit distance.Finally, in the extraction of data from the actual project, according to the corresponding training set and test set constructed by this algorithm, compared with the SNM algorithm in this paper with the classical algorithms to verify the validity of this algorithm for the actual data application.

文献类型学位论文
条目标识符http://ir.xjipc.cas.cn/handle/365002/4971
专题多语种信息技术研究室
作者单位中国科学院新疆理化技术研究所
推荐引用方式
GB/T 7714
宋国兴. 基于关键属性的高维相似记录检测方法研究[D]. 北京. 中国科学院大学,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
基于关键属性的高维相似记录检测方法研究.(1425KB)学位论文 开放获取CC BY-NC-SA浏览 请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[宋国兴]的文章
百度学术
百度学术中相似的文章
[宋国兴]的文章
必应学术
必应学术中相似的文章
[宋国兴]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 基于关键属性的高维相似记录检测方法研究.pdf
格式: Adobe PDF
此文件暂不支持浏览
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。