In this paper, the methods of classifying the Uighur language feature extraction have been studied. According to the feature that the Uighur language belongs to adhesion, three experiments were designed to inspect the influence on the accuracy of text classification by using different methods of feature extraction, the first experiment was designed to inspect the accuracy of text classification in case of stem segmentation by using the traditional methods of feature extraction, such as DF,IG,MI,CHI. The results show that the best classification accuracy rate is 91.34% by the method of DF feature extraction, while the best accuracy rate is 88.03% in the second experiment by the method of CHI feature extraction in the case of stem that are not segmented. The third experiment uses combination of feature selection methods, such as DF+IG,DF+MI,DF+CHI, and the result show that the accuracy rate of classification is 93.57% by the method of DF+CHI feature selection, which shows that it is the best method in all experiments.
The Xinjiang Technical Institute of Physics and Chemistry Chinese Academy of Sciences, China;Graduate University of Chinese Academy of Sciences, China;College of Computer Science, Xinjiang Normal University, China
Yong Yang,Jian Xue Hua,Hua Dong Xin,et al. comparative study on feature selection in uighur text categorization[J]. Advances in Information Sciences and Service Sciences,2012,4(3):19-26.