|Uyghur word segmentation using a combination of rules and statistics|
|Xue, Huajian; Yang, Yong; Turghun, Osman; Li, Xiao; Zhang, Ronghui|
|发表期刊||Advances in Information Sciences and Service Sciences|
Rich morphology of Uyghur produces a large number of words and leads to high out of vocabulary (OOV) rates that can cause many errors in Uyghur natural language processing (NLP). Morphological word segmentation is the very important component to overcome this problem caused by Uyghur morphology. This paper depicts some morphological rules by analyzing the universal structure of Uyghur words and presents a partly supervised word segmentation method. In this method, the suffix corpus was utilized to give all the possible morphological word segmentations, from which the optimal word segmentation is selected by the MAP-based model. In addition, cascaded language model was used to improve the accuracy of word segmentation. The test set composed of 5000 words was collected and segmented by hand. The experiment on this test set was given and experimental results show that the proposed method was more effective.
|作者单位||Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, China|
|Xue, Huajian,Yang, Yong,Turghun, Osman,et al. Uyghur word segmentation using a combination of rules and statistics[J]. Advances in Information Sciences and Service Sciences,2011,3(11).|
|APA||Xue, Huajian,Yang, Yong,Turghun, Osman,Li, Xiao,&Zhang, Ronghui.(2011).Uyghur word segmentation using a combination of rules and statistics.Advances in Information Sciences and Service Sciences,3(11).|
|MLA||Xue, Huajian,et al."Uyghur word segmentation using a combination of rules and statistics".Advances in Information Sciences and Service Sciences 3.11(2011).|