|Place of Conferral||北京|
|Keyword||多源异构数据 数据清洗 数据融合 Snm 动态可配置规则 实体识别|
近年来，随着互联网、社交网络、云计算、搜索引擎等IT与通信技术的迅速发展，数以亿级的用户每天都在产生大量的数据。大规模数据的涌现给许多行业带来了宝贵的机遇，但同时这些数据所伴随的典型特性，如大规模、多来源（多源）、类型和模式多样（异构）、高维度以及质量良莠不齐等，使得数据的表示、理解、计算和运用等多个环节都面临着极大的挑战。数据的质量是制约数据使用的“瓶颈”，作为提高数据质量的重要解决技术，数据清洗和数据融合是多源异构数据处理中的热点研究领域，具有重要的价值与意义。但是传统数据清洗方法借助硬编码方法实现业务逻辑，导致系统的可重用性、可扩展性与灵活性较差。另外，现实中的许多应用经常需要集成来自不同途径的异构数据，如何确保这些数据的一致性正逐渐成为一个必须要解决的问题，即实体识别技术。本文研究了现有实体识别技术，例如采用“分块（Blocking）”和“窗口（Windowing）”等技术来解决多源异构环境下的实体识别的性能瓶颈。但是现有解决方案通常伴随高昂的时间开销，其运行时间会随着数据集中属性维度的增加而呈指数增长；现有的分块技术通常假设数据集中只包含字符串型数据，且采用单一的相似度计算方法，因此很难满足现实数据中多种数据类型的不同需求；传统的分块方法通常根据关键字将多个完整实体归于同一个块或多个块，该方法缺乏灵活性，尤其是很难与其它提高实体识别性能的方法相结合；经典的一些窗口方法，如SNM方法，对排序关键字过分依赖，如果排序关键字里包含脏数据则会严重影响排序效果，进而弱化整体实体识别的效果，且滑动窗口的大小难以确定。针对以上问题，首先，本文提出了一种基于动态可配置规则的数据清洗方法(Dynamic Rule-based Data Cleaning Method, DRDCM)，该方法支持多种类型的规则以及规则间的复杂逻辑运算，并支持多种脏数据修复行为。DRDCM方法是集数据检测、数据修复与数据转换于一体，具有跨领域、可重用、可配置、可扩展等特点。其次，本文提出一种基于属性值类型进行分块的算法（Splitting Blocking Algorithm，SBA）和属性聚类算法（Attribute Clustering Algorithm, ACA）来减少数据维度，以达到降低后续搜索或计算的复杂度。最后，本文结合DRDCM、SBA和ACA方法，提出了分块邻近排序算法（Multi-blocking sorted neighborhood，MBN）来解决多源异构背景下的实体识别问题。在MBN方法中，依据不同数据类型采用不同的动态可变窗口策略和多种灵活的相似性度衡量方法，并整合了合了多种有效策略来提高实体识别性能，如边权重图和边裁剪策略等。综上所述，本文从理论、方法、技术与应用的视角研究了多源异构环境下的数据清洗与数据融合的一些关键技术，提出了相应的解决方案，设计并实现了一个集数据清洗与数据融合为一体的参考实现系统。最后通过两个真实的多源异构数据集来对每个阶段的性能进行了详细的实验评估，良好的实验结果表明本研究所提出的方法可以无缝集成于多个数据源和多种不同应用领域，具有较好的清洗与融合效果。
In recent years, with the rapid development of IT and communications technology, such as Internet, social networking, cloud computing, search engines, etc., hundreds of millions of users generate large amounts of data every day. The emergence of large-scale data brings valuable opportunities to many industries, but also makes the data expression, understanding, calculation and use are facing greater challenges because of some typical characteristics of these data, such as large scale, multi-source, heterogeneous, high dimensons and low quality, etc.. The quality of data is the bottleneck restricting the use of data. As the main solutions to improve the quality of data, data cleaning and data fusion are two active research fields in multi-source heterogeneous data processing, and have very important value and significance. However, traditional data cleaning approaches usually implement cleaning rules specified by business requirements through hard-code mechanism, which leads to well-known issues in terms of reusability, scalability and flexibility. In addition, many applications in reality need to integrate heterogeneous data from different sources, and how to ensure their consistency is becoming a problem that must be sloved, also know as Entity Resolution (ER). This paper studies existing entity resolution technologies, such as such as "Blocking" and "Windowing", which are often used to solve the performance bottleneck of ER in multi-source heterogeneous environment. However, these solutions are usually accompanied by high time overhead, and their runtime increases exponentially with the increase of dataset attribute dimension; the existing blocking technologies usually assume that the dataset only contains string type data and use a single similarity calculation method, thus, they are difficult to meet the different requirements for multiple data types in real-life dataset; another common drawback of traditional blocking techniques is that they put the complete entities in one block (or multiple blocks) according to the block keywords, as a result, they lack necessary flexibility and are especially difficult to combine with other methods for improving the performance of ER; the performance of typical windowing methods (e.g. SNM) depends strongly on the quality of sort attribute values and the size of sliding window, but it is difficult to determine a reasonable window size, besides, if the sorting attribute contains dirty data , this will seriously affect the sorting effect, thus weakening the overall ER effect.In order to solve the above problems, firstly, this paper proposed a novel Dynamic Rule-based Data Cleaning Method (DRDCM), which supported the complex logic operation between various types of rules and three kinds of dirty data repair behavior. This method integrated data detection, error correction and data transformation in one system and contributed several unique characteristics, including domain-independence, reusability and configurability. Secondly, this paper proposed a Splitting Blocking Algorithm (SBA) and an Attribute Clustering Algorithm (ACA) to reduce the dimension of data attributes, so as to reduce the complexity of subsequent search or computation. Finally, in combination of DRDCM, SBA and ACA, Multi-blocking Sorted Neighborhood (MBN) was proposed to solve the problem of ER in multi-source heterogeneous data. MBN method emploied several flexible similarity metrics and diffenrent dynamic scalable window size strategies according to different data types, and integrated a variety of effective strategies to improve the performance of ER, such as Edge-Weighted Graph and Weight Edge Pruning etc..In conclusion, this paper studied some key technologies of data cleaning and data fusion in multi-source heterogeneous environment from the perspective of theory, method, technology and application, and presented the corresponding solutions. With these solutions, a reference implementation system was designed and implemented, which integrated data cleaning and data fusion together. Finally, the performance of every phase was exmined analytically through a thorough experimental study that involved two large-scale, real-life datasets. The experimental results demonstrate that our methodology achieves excellent data cleaning and fusion effects, and can successfully support multiple data sources in cross-domain scenarios.
|朱会娟. 多源异构数据源下数据清洗与数据融合关键技术的研究[D]. 北京. 中国科学院大学,2017.|
|Files in This Item:|
|多源异构数据源下数据清洗与数据融合关键技（2305KB）||学位论文||开放获取||CC BY-NC-SA||View Application Full Text|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.