XJIPC OpenIR  > 多语种信息技术研究室
维文文本分类器研究
李艳姣
Subtype硕士
Thesis Advisor蒋同海
2012-05
Degree Grantor中国科学院研究生院
Place of Conferral北京
Degree Discipline计算机应用技术
Keyword维文 贝叶斯 支持向量机 参数寻优 加权
Abstract

随着信息技术的发展,维文电子文档数目迅速增长,如何快速准确的从浩如烟海的电子文档中获得所需信息成为一个亟待解决的问题。文本分类是处理和组织维文电子文档的一项关键技术,维文文本分类系统的构建能够提高知识获取以及知识组织的效率。 本文简单介绍了文本分类系统的框架以及相关技术,并根据维吾尔语自身特点详细介绍了维文文本分类系统中关键技术的实现。同时朴素贝叶斯和支持向量机两个分类算法进行了深入研究,并提出相应的改进算法。 朴素贝叶斯分类器是一个简单有效的模式识别算法,在文本分类中得到了广泛的使用。但是在朴素贝叶斯分类中,条件属性对于决策分类的作用相同的假设在很多情况下并不成立。为提高朴素贝叶斯分类器的分类性能,考虑决策分类过程中条件属性的不同重要程度,提出了一种基于特征选择权重的贝叶斯分类算法。首先,将卡方值和文档频数的数值相结合来表示特征词的重要程度;然后,对该值进行处理进而获得每个特征词权重;最后,根据该权重建立加权贝叶斯分类器。在搜集到的维文语料库上的实验结果表明,该算法比朴素贝叶斯拥有更好的分类性能。 支持向量机是一种基于结构风险最小化原则的模式识别算法,是目前公认的最有效的文本分类算法之一。支持向量机在小样本、高维特征空间上也能够获得非常好的效果。由于维文文本分类没有较为统一和大规模的数据集,且维文特征空间很大,在维文文本分类中使用支持向量机是一个较好的选择。 支持向量机的训练过程较为复杂,时间和空间复杂度较高;同时支持向量机的参数较多,参数寻优成为训练过程中的瓶颈。本文基于序贯均匀设计方法提出一种新的参数寻优方法,以减少训练时间。首先,基于均匀设计表设计参数组合,获得最好分类效果的参数组合;然后,以最好的参数组合为中心,减少差距并设计第二批参数组合数据,进而通过交叉验证获得最好分类效果的参数组;最后,以该参数组合训练获得SVM分类器。实验表明该方法能够在保证分类效果的基础上,大幅减少训练时间。

Other Abstract

With the development of information technology, the number of Uyghur E-document increases quickly. Gaining the information from such a big amount of E-document is a problem to be solved. Text classification is a key technology to process and organize the Uyghur E-document. Uyghur text categorization System can improve the efficiency of information acquisition and organization. This paper briefly introduces the framework of text classification system, and describes the related technical details about the Uyghur text categorization System in accordance with the characteristics of the Uyghur language. Naïve Bayesian and support vector machine classification algorithms are researched, and two improved metrics are proposed in this paper. Naïve Bayesian is a simple and efficient pattern recognition algorithm, and has been widely used in text classification. But the assumption of Naïve Bayesian is often not hold in the real application. To improve the performance of the Naïve Bayesian Classifier, a weighted Bayesian method is proposed based on feature selection weight for taking into account different conditions have different effects to the decision conditions. Firstly, the effect value of every feature is computed by the combination of the Chi Square value and DF (Document Frequency). Then, the weight of every feature is computed by the effect value. Lastly, weighted Bayesian Classifier is built on the weight. Results of the experiment based on the Uyghur corpus which collected from the internet indicated the metric has better classification performance than Naïve Bayesian Classifier. The Support Vector Machine is a pattern recognition algorithms based on structural risk minimization in statistical learning theory. It has been successfully used in text classification process, and is widely recognized as one of the most efficient classification algorithm. Support Vector Machine is also able to obtain very good results on a small corpus and high dimensional feature space. Due to no uniform and large-scale data sets and very large dimension of feature space, using Support Vector Machine in Uyghur Text Classification is a good choice Support Vector Machine Training process is a complicated, time and space complexity process, and due to it has many parameters, the optimization parameter optimization is a bottleneck in the training process. This article presents a new parameter optimization method based on sequential uniform design method to reduce the training time. First, the second batch of parameters combinations is designed based on the uniform, and obtain the best combination of parameters of the classification results; then reduce the gap and design the second batch of parameter combinations with the first best combination as the central, we obtain the parameter set which can get the best cross-validation precision; last, a Support Vector Machine classifier is trained based on the parameter set. The experiments show that this method can significantly reduced training time on the basis of ensuring the classification result.

Document Type学位论文
Identifierhttp://ir.xjipc.cas.cn/handle/365002/4370
Collection多语种信息技术研究室
Affiliation中国科学院新疆理化技术研究所
Recommended Citation
GB/T 7714
李艳姣. 维文文本分类器研究[D]. 北京. 中国科学院研究生院,2012.
Files in This Item:
File Name/Size DocType Version Access License
维文文本分类器研究.pdf(817KB)学位论文 开放获取CC BY-NC-SAView Application Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[李艳姣]'s Articles
Baidu academic
Similar articles in Baidu academic
[李艳姣]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[李艳姣]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: 维文文本分类器研究.pdf
Format: Adobe PDF
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.