XJIPC OpenIR  > 多语种信息技术研究室
维文文本分类器研究
李艳姣
学位类型硕士
导师蒋同海
2012-05
学位授予单位中国科学院研究生院
学位授予地点北京
学位专业计算机应用技术
关键词维文 贝叶斯 支持向量机 参数寻优 加权
摘要

随着信息技术的发展,维文电子文档数目迅速增长,如何快速准确的从浩如烟海的电子文档中获得所需信息成为一个亟待解决的问题。文本分类是处理和组织维文电子文档的一项关键技术,维文文本分类系统的构建能够提高知识获取以及知识组织的效率。 本文简单介绍了文本分类系统的框架以及相关技术,并根据维吾尔语自身特点详细介绍了维文文本分类系统中关键技术的实现。同时朴素贝叶斯和支持向量机两个分类算法进行了深入研究,并提出相应的改进算法。 朴素贝叶斯分类器是一个简单有效的模式识别算法,在文本分类中得到了广泛的使用。但是在朴素贝叶斯分类中,条件属性对于决策分类的作用相同的假设在很多情况下并不成立。为提高朴素贝叶斯分类器的分类性能,考虑决策分类过程中条件属性的不同重要程度,提出了一种基于特征选择权重的贝叶斯分类算法。首先,将卡方值和文档频数的数值相结合来表示特征词的重要程度;然后,对该值进行处理进而获得每个特征词权重;最后,根据该权重建立加权贝叶斯分类器。在搜集到的维文语料库上的实验结果表明,该算法比朴素贝叶斯拥有更好的分类性能。 支持向量机是一种基于结构风险最小化原则的模式识别算法,是目前公认的最有效的文本分类算法之一。支持向量机在小样本、高维特征空间上也能够获得非常好的效果。由于维文文本分类没有较为统一和大规模的数据集,且维文特征空间很大,在维文文本分类中使用支持向量机是一个较好的选择。 支持向量机的训练过程较为复杂,时间和空间复杂度较高;同时支持向量机的参数较多,参数寻优成为训练过程中的瓶颈。本文基于序贯均匀设计方法提出一种新的参数寻优方法,以减少训练时间。首先,基于均匀设计表设计参数组合,获得最好分类效果的参数组合;然后,以最好的参数组合为中心,减少差距并设计第二批参数组合数据,进而通过交叉验证获得最好分类效果的参数组;最后,以该参数组合训练获得SVM分类器。实验表明该方法能够在保证分类效果的基础上,大幅减少训练时间。

其他摘要

With the development of information technology, the number of Uyghur E-document increases quickly. Gaining the information from such a big amount of E-document is a problem to be solved. Text classification is a key technology to process and organize the Uyghur E-document. Uyghur text categorization System can improve the efficiency of information acquisition and organization. This paper briefly introduces the framework of text classification system, and describes the related technical details about the Uyghur text categorization System in accordance with the characteristics of the Uyghur language. Naïve Bayesian and support vector machine classification algorithms are researched, and two improved metrics are proposed in this paper. Naïve Bayesian is a simple and efficient pattern recognition algorithm, and has been widely used in text classification. But the assumption of Naïve Bayesian is often not hold in the real application. To improve the performance of the Naïve Bayesian Classifier, a weighted Bayesian method is proposed based on feature selection weight for taking into account different conditions have different effects to the decision conditions. Firstly, the effect value of every feature is computed by the combination of the Chi Square value and DF (Document Frequency). Then, the weight of every feature is computed by the effect value. Lastly, weighted Bayesian Classifier is built on the weight. Results of the experiment based on the Uyghur corpus which collected from the internet indicated the metric has better classification performance than Naïve Bayesian Classifier. The Support Vector Machine is a pattern recognition algorithms based on structural risk minimization in statistical learning theory. It has been successfully used in text classification process, and is widely recognized as one of the most efficient classification algorithm. Support Vector Machine is also able to obtain very good results on a small corpus and high dimensional feature space. Due to no uniform and large-scale data sets and very large dimension of feature space, using Support Vector Machine in Uyghur Text Classification is a good choice Support Vector Machine Training process is a complicated, time and space complexity process, and due to it has many parameters, the optimization parameter optimization is a bottleneck in the training process. This article presents a new parameter optimization method based on sequential uniform design method to reduce the training time. First, the second batch of parameters combinations is designed based on the uniform, and obtain the best combination of parameters of the classification results; then reduce the gap and design the second batch of parameter combinations with the first best combination as the central, we obtain the parameter set which can get the best cross-validation precision; last, a Support Vector Machine classifier is trained based on the parameter set. The experiments show that this method can significantly reduced training time on the basis of ensuring the classification result.

文献类型学位论文
条目标识符http://ir.xjipc.cas.cn/handle/365002/4370
专题多语种信息技术研究室
作者单位中国科学院新疆理化技术研究所
推荐引用方式
GB/T 7714
李艳姣. 维文文本分类器研究[D]. 北京. 中国科学院研究生院,2012.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
维文文本分类器研究.pdf(817KB)学位论文 开放获取CC BY-NC-SA浏览 请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[李艳姣]的文章
百度学术
百度学术中相似的文章
[李艳姣]的文章
必应学术
必应学术中相似的文章
[李艳姣]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 维文文本分类器研究.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。