XJIPC OpenIR  > 多语种信息技术研究室
面向机器翻译的维吾尔文词法分析及语义分析研究
热娜·艾尔肯
学位类型博士
导师李晓
2017-05-21
学位授予单位中国科学院大学
学位授予地点北京
学位专业计算机应用技术
关键词机器翻译 维吾尔语 词法分析 语义知识库 相似度计算
摘要

本文以汉维机器翻译过程中的语言学、数学、计算机科学和翻译学为基础,开展面向机器翻译的维吾尔文词法及语义分析研究。汉维机器翻译中取得的目标语言(维吾尔语)通过优化处理,可取得更准确的翻译结果为本论文最终解决达到的目标。利用构建维吾尔语语义信息词典,依据该词典可对语言中的带有歧义现象的单词进行相似度计算,最终达到提高汉维机器翻译质量也是该论文的创新点。通过自动机器自动翻译取得的结果往往是一个形态改变的一连串字符。词法分析是优化模块的基础,主要分析词语变化形式。通过词法分析,能够更好的得知词语原形、变动形态。根据单词原形进行语义分析、计算语义相似性后取得最优。研究由词法分析和语义分析两部分组成,通过实验结果分析,提出具体的处理方法。论文的主要工作如下:1.句子边界识别:机器翻译中最小单位是词,语言是由一连串的字符串组成,当机器遇到字符串时需界定句子,才可以对句子中的词进行分析和处理。句子边界的识别对错直接影响词法分析。本文句子边界识别提出,以规则方法为主,对于少数歧义现象可利用统计方法进行处理。 2.词法分析中重点解决了词干提取的研究和实现工作。维吾尔语的部分词类具有自身独有的词缀,针对这些词类建立词缀表,利用规则和词典方法对词缀进行切分。而,针对维吾尔语中引起歧义现象的词缀,提出了统计学习方法。有机地融合词典、规则和统计模型等构建了基于混合策略的维吾尔语词干提取系统。根据维吾尔语形态系统的分布特点提出利用词典、规则和统计方法相结合的多种方法来实现词干的提取,也是本论文的创新点之一。3.开发了自动词性标注模块。利用Python平台,自然语言处理工具包NLTK对词干进行自动词性标注,实现了面向机器翻译的词法的研究,利用强大的Python实现词性标注速度快、该方法比较适合与粘着性语言种类。即,运用与维吾尔语得出的准确率较高高。4.建立了维吾尔语义知识库。维吾尔语言存在大量的同形多义词现象,为了使维吾尔语机器翻译质量的提升,本论文重点研究和建立了维吾尔语言知识库。从语义关系角度,对维语开展使用环境分析。通过对相同领域的不同语言进行分析,其他语言的方法可以为维语机器翻译提供借鉴。运用WordNet框架构建了维吾尔语语义库,收集整理了大量词典与语料库,将符合条件的词语收录到该词典中。根据WordNet框架体系重点建立了同义词、反义词等语义网关系,为机器翻译优化模型做好准备工作。5.根据语义知识库提供的信息对单词进行相似度计算来优化了翻译结果。基于维吾尔文语义知识库。提出了,运用基于信息内容的Resnik计算方法以及基于特征计算的Tversky相似度计算方法相结合的混合方法进行相似度计算。最后,对词法和语义进行研究与形成模块。为了验证在机器翻译上的有效性,我们选择了Microsoft提供的免费翻译在线平台Hub来进行了验证,在优化模块的Bleu值仅为42.75。利用我们设计的语义相似度计算后的Bleu值为54.25。

其他摘要

This article based on the linguistics, mathematics, computer science and translatology of the Uyghur and Han Machine Translation process, carries out the analysis and research of Uyghur lexical and semantic, constructs Uyghur semantic information dictionary. The analyzing target (Uyghur) language that is result of Chinese-Uyghur language machine translation system, can get more accurate results. The dictionary can be used to calculate the similarity of words with ambiguity phenomenon in the language. Finally reach the goal of improving the quality of Uyghur language. That is the most important part of this article.Lexical analysis is the basis of optimization module. It mainly analyses the form of words. Through lexical analysis, we can better understand the original form of words, changes in shape. According to Lexical analysis analyzing synsets and similarty. This doctoral thesis mainly studies two major parts: lexical analysis, semantic analysis. Through the analysis of the experimental results, the article puts forward the concrete treatment method. The main work of this article is as follows:1. Sentence boundary recognition: the smallest unit of Machine Translation is the word. But the unprocessed source language is a string of character strings. Sentences must be defined when the machine meets the character strings, then the words in the sentence can be analyzed and processed. The right or wrong recognition of sentence boundary directly affects the lexical analysis. Therefore, the first step of this thesis is to study the recognition of the sentence boundary. Uyghur sentence boundary has regularity in most cases. Therefore, based on the regular method, a small number of ambiguity phenomenon is processed by using statistical methods. 2. Lexical analysis focuses on the research and implementation of lemmatization. Part of the word class of Uyghur has its own affix, affix table for these words, the use of rules and dictionary method for affix segmentation. Morfessor is proposed for ambiguous affix and unknown word. With organic integration of Uyghur stem dictionary, rules and statistical model construct Uyghur extraction system, which is based on hybrid strategy. The innovation points are: according to the distribution characteristics of Uyghur morphology system proposed a variety of methods using dictionaries, rules and statistical methods to achieve a combination of stemming. 3. Developed automatic POS (Part-of-speech) tagging system. Automatic part of speech tagging is carried out for the stem by using the natural language processing toolkits (NLTK) which is freely provided by the recently very popular Python platform, which has achieved the lexical research and implementation oriented to Machine Translation. The fast speed and high accuracy of part of speech tagging is achieved by using the powerful Python, which is also one of the highlights of this thesis. 4. Uyghur language semantic knowledge base is established. By reading many related data of Uyghur Machine Translation system, it is learned that there is hardly any application of Uyghur language knowledge base in the Machine Translation. However, according to the characteristics of Uyghur language, it has many words with polysemy phenomenon. To improve the Uyghur Machine Translation quality, this thesis focuses on the research and establishment of Uyghur language knowledge base. It mainly studies the semantic relationship between Uyghur words, and through the study on the semantic environment and combining with the research results of English, Chinese and other languages in the semantic aspects, the semantic knowledge base with Uyghur language characteristics is got. Uyghur semantic base is decided to construct by using WordNet framework after many materials are collected. After many dictionaries and corpora are sorted out, the words matching the condition are collected and included in the dictionary. According to the WordNet framework system, the semantic relations of synonyms, antonyms and so on are established emphatically, which is well prepared for the optimization model of Machine Translation. This method is not only the innovation point but also the highlight of this thesis. 5. According to the information provided by the semantic knowledge base, the similarity computation is carried out for the words to optimize the translation results. The semantic model is optimized by using similarity computation. Finally, the morphology and semantics are studied, and the module is formed. To verify the effectiveness on Machine Translation, we choose the Hub which is a free online translation platform provided by Microsoft to verify. In the absence of lexical analysis, the Bleu value is only after the words are analyzed, the translation Bleu value is 42.75. After using the semantic similarity computation designed by us, the Bleu value is 54.25.

文献类型学位论文
条目标识符http://ir.xjipc.cas.cn/handle/365002/4980
专题多语种信息技术研究室
作者单位中国科学院新疆理化技术研究所
推荐引用方式
GB/T 7714
热娜·艾尔肯. 面向机器翻译的维吾尔文词法分析及语义分析研究[D]. 北京. 中国科学院大学,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
面向机器翻译的维吾尔文词法分析及语义分析(3530KB)学位论文 开放获取CC BY-NC-SA浏览 请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[热娜·艾尔肯]的文章
百度学术
百度学术中相似的文章
[热娜·艾尔肯]的文章
必应学术
必应学术中相似的文章
[热娜·艾尔肯]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 面向机器翻译的维吾尔文词法分析及语义分析研究.pdf
格式: Adobe PDF
此文件暂不支持浏览
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。