XJIPC OpenIR  > 多语种信息技术研究室
Thesis Advisor王磊 ; 唐新余
Degree Grantor中国科学院研究生院
Place of Conferral北京
Degree Discipline计算机应用技术
Keyword维吾尔语 语言模型 困惑度 模型基元 词素
Abstract语言模型是描述自然语言内在规律的数学模型,在自然语言处理过程中占据着重要的地位,但目前维吾尔语语言模型的研究尚处于起步探索阶段,因此构建一个可靠的语言模型对于维吾尔语自然语言处理技术很关键。维吾尔语语言模型是维吾尔语自然语言处理技术的重要基石,广泛应用于语音识别、机器翻译、信息检索等领域,它的研究对促进新疆地区的少数民族自然语言信息处理技术的发展具有重要的意义。 本文针对当前维吾尔语语言模型存在的语料库资源匮乏、数据稀疏以及困惑度较高等问题,试图找出使困惑度最低的平滑算法和建模单元(基元)建立语言模型。具体研究工作如下所示: 为解决数据稀疏问题,本文研究了多种平滑算法,包括加法平滑算法、Good-Turing平滑、Witten-Bell平滑、Katz平滑、绝对折扣平滑、Kneser-Ney平滑。实验结果表明绝对折扣平滑算法的困惑度最低。 本文将基于电话信道的维吾尔口语对话的文本、双语教学系统中的课本教材以及一些日常用语作为实验数据,然后它们进行预处理,并将处理后的数据作为本实验中建立维吾尔语语言模型的文本语料。接着对维吾尔语文本语料进行分词,这里采用两种分词方法:一种是基于词典的维吾尔语词切分,一种是非监督式形态切分。从结果来看,后者的分词效果好于前者。 在基于维吾尔语分词的基础上,对传统的N-gram统计语言模型做出改进。将维吾尔语单词切分成不同单元,以它们作为建模基元建立了3种维吾尔语语言模型,并提出基于词素类的N-gram语言模型。本文利用SRILM 1.5.12工具包和MITLM 0.4工具包进行实验。结果表明,基于词素的维吾尔语语言模型的困惑度比基于词的维吾尔语语言模型的困惑度降低了约2/3,另外,基于词素的语言模型可有效减少字典词汇量,并有较好的词语的覆盖度。
Other AbstractAs a mathematical model to describe the inherent disciplines of natural language,language model occupies an important position in natural language processing. However, at the present time, the study of Uyghur language model is just at the beginning stage, so it is essential to built a reliable language model in natural language processing. Being the basic part in natural language processing of Uyghur,Uyghur language model is widely used in the field of speech recognition, machine translation, information retrieval, etc.,so a further study on Uyghur Language model will be of great significance for the rapid development on natural language processing of minority in XinJiang district. For the problems existed in the current Uyghur language model such as the scarcity of Uyghur corpus resource,the sparseness of data ,the high degree of perplexity and etc, This dessertation attempted to find the best smoothing method and model units to build Uyghur language model. The contents of this dissertation are as follows: To solve the problem of data sparseness,many smoothing methods such as Addition smoothing, Good-Turing smoothing, Witten-Bell smoothing, Katz smoothing, absolute discount smoothing, Kneser - Ney smoothing were studied. The experimental results shows that the perplexity of absolute discount smoothing was best. The experimental data were collected from transcription of phone based on Uyghur spoken dialog,and text from bilingual teaching system and some daily expression of Uyghur. After pretreatment, these data were processed into Uygur text corpora. Two word segmentation methods were adopted,one was Uyghur words segmentation method based on dictionary and the other was segmented in the unsupervised form. The results shows that the latter was better than the former. Based on Uyghur segmentation, the traditional N-gram statistical language model was improved. The Uyghur words can be divided into different units, using these units,three kinds of Uyghur language model were built and N-gram Language model based on morphemes class was proposed. In this thesis,a series of experiment were conducted using SRILM 1.5.12 toolkit and MITLM 0.4 toolkit,the results showed that the perplexity of the Uyghur language model based on morphemes was far below that based on word. And the perplexity of the former was reduced to about 2/3 of the latter one. Moreover, morpheme-based language model can effectively reduce the amount of dictionary vocabulary, and have better coverage.
Document Type学位论文
Recommended Citation
GB/T 7714
张小燕. 维吾尔语统计语言模型中建模基元的研究[D]. 北京. 中国科学院研究生院,2011.
Files in This Item:
File Name/Size DocType Version Access License
维吾尔语统计语言模型中建模基元的研究.p(1487KB)学位论文 开放获取CC BY-NC-SAView Application Full Text
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[张小燕]'s Articles
Baidu academic
Similar articles in Baidu academic
[张小燕]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[张小燕]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: 维吾尔语统计语言模型中建模基元的研究.pdf
Format: Adobe PDF
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.