XJIPC OpenIR  > 多语种信息技术研究室
维吾尔文文本分类中文本表示的研究
董瑞
Subtype硕士
Thesis Advisor周喜
2012-04
Degree Grantor中国科学院研究生院
Place of Conferral北京
Degree Discipline计算机应用技术
Keyword不平衡数据集 特征选择 文本分类 维吾尔文 文本表示 卡方检验 逆文档频数
Abstract

互联网的发展使得电子文本文档的数目飞速增长,自动文本分类越发的被人们所需要。文本分类作为数据挖掘、信息检索、机器学习等领域的热点问题,从最初的人工分类逐步发展到现在的由计算机自动完成分类。 英文和中文文本分类已经有很多研究人员进行了大量的研究,现已较为成熟并且已经有了实际应用。但是维吾尔文文本分类的研究,相对起步较晚,现阶段研究还较少,并没有一个成熟、稳定的方法应用于维吾尔文文本分类中。 文本表示是文本分类中一个非常重要的方面,其目的是将非结构化的文本文档转换成计算机可以处理和识别的形式。文本表示的内容包含:文本预处理、特征选择、特征权值计算几部分。本文从维吾尔文文本表示入手,详细研究维吾尔文文本表示各因素对最终分类结果的影响。 通过对维吾尔文进行词干提取和未进行词干提取进行对比实验,发现进行词干提取的分类精度要高于未进行词干提取的结果。在特征选择算法方面,和其他语言文本分类相似,传统的特征选择方法CHI和IG分类效果相近,与DF相比,能够取得更好的分类精度。在特征权值得表示方面,本文对特征权值算法进行了比较,实验结果表明TF*IDF的效果要好于布尔型和TF方法。 针对维吾尔文不平衡数据集问题,提出了一种结合CHI和IDF新特征选择方法—CIDF。实验表明该方法在不平衡数据集上表现要由于传统的特征选择方法。

Other Abstract

Along with the quickly development of World Wide Web, the number of electronic text document grows rapidly, and automatic text classification technology is becoming more and more important. As one of the hot issue of data mining, information retrieval, machine learning and other research area, text classification developed from manual classification to machine automatic classification. Many researchers have engaged in the research of English and Chinese text classification, and the achievements have been used into practice. On the contrary, Uyghur text classification is still in the initial stage, the research is relatively less than that in English and Chinese. For now, there is not a stable metric to solve the Uyghur text classification problem. Text representation is a very important issue in text classification, which aims to translate the unstructured text documents into the forms that computer can process. Text representation includes: text preprocessing, feature selection, feature weight calculation, etc. In this paper, the factors of Uyghur text representation have been studied and the effect to the classification results have been compared. We established a comparative experiment, in which the Uyghur texts are stemmed and un-stemmed, the results turned out the accuracy in the stemmed classification is higher than the other. In the comparison of feature selection methods, Uyghur text classification is similar to other language, the effect of traditional feature selection method CHI and IG is better than that of DF. In the comparison of feature weighting methods, the effect of TF*IDF method is better than that of Boolean method and TF method. For Uyghur imbalance dataset problem, a combination of CHI and IDF feature selection method—CIDF. Proved that the method performance due to the traditional feature selection methods on the imbalanced data set.

Document Type学位论文
Identifierhttp://ir.xjipc.cas.cn/handle/365002/4372
Collection多语种信息技术研究室
Affiliation中国科学院新疆理化技术研究所
Recommended Citation
GB/T 7714
董瑞. 维吾尔文文本分类中文本表示的研究[D]. 北京. 中国科学院研究生院,2012.
Files in This Item:
File Name/Size DocType Version Access License
维吾尔文文本分类中文本表示的研究.pdf(1377KB)学位论文 开放获取CC BY-NC-SAView Application Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[董瑞]'s Articles
Baidu academic
Similar articles in Baidu academic
[董瑞]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[董瑞]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: 维吾尔文文本分类中文本表示的研究.pdf
Format: Adobe PDF
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.