中国科学院新疆理化技术研究所机构知识库
Advanced  
XJIPC OpenIR  > 多语种信息技术研究室  > 学位论文
题名: 面向汉维机器翻译的泛化语言模型研究
作者: 李响
答辩日期: 2014-05-21
导师: 周喜
专业: 计算机应用技术
授予单位: 中国科学院大学
授予地点: 北京
学位: 硕士
关键词: 汉维机器翻译 ; 泛化语言模型 ; 最大熵 ; N-best 假设译文
摘要: 语言模型是为解决自然语言这种上下文相关特性建立的一种数学模型,在自然语言处理技术中占有重要地位,被广泛应用于机器翻译、语音识别、中文拼音输入、信息检索等领域。在机器翻译系统中,语料库的训练过程通过自动学习的方式生成两类模型:翻译模型和语言模型。翻译模型的主要思想是对平行语料进行统计分析计算,使翻译译文表达出源语言的思想,而语言模型的作用是决策翻译译文的流利度,其性能的好坏决定翻译译文是否通顺可读。目前,在汉维机器翻译系统中,维语语言模型仍存在不足,具体表现在:维语构词方式是由词干添加若干词缀组成新词,这些新词持续的增多导致语言模型训练过程中数据稀疏现象严重,在翻译解码的过程中识别未登录词的能力较差;同时维语句子结构中维语动词与主语的人称、时态保持一致,词语之间的长距离语法依存关系较高,然而,基于统计的语言模型对维吾尔语的长距离相依关系描述能力不足,导致最终的翻译译文质量下降;针对汉维机器翻译系统中维语语言模型存在的上述问题本文阐述并且开展了相关研究,皆在有效缓解或者克服这些问题,本文的主要工作概述如下:1.为了解决维吾尔语构词多样性产生的数据稀疏问题以及提高长距离依存关系的能力,本文重点提出了一种基于泛化思想的语言模型,该模型借助维吾尔语语言模型训练过程中生成的语言模型文本,进行文本预处理工作,结合字符串相似度算法,取相似的短语字符串进行规则抽取和参数估计。2.由于解码过程本质上是由启发式搜索算法决定的,同时未登录词识别能力差对调序顺序产生一定的影响,导致搜索得到的概率最高的候选译文可能不是最优的译文,本文结合抽取出的泛化规则,构造分数产生器对解码过程中生成N-best假设译文进行预处理、“共现词”计算、重排序及1-best译文的提取。最后,本文将泛化语言模型应用到汉维机器翻译系统中,测试验证泛化语言模型的有效性,实验结果表明,该方法有效地提高了翻译译文的质量。
英文摘要:
Language model is a mathematical model established to solve the problem of contextual-sensitive features in natural language. It plays an important role in natural language processing technology, and is widely used in fields of machine translation, voice recognition, Chinese pinyin input, information retrieval, etc. Two models: translation model and language model were generated during the training process of corpus by means of automatic learning. The main idea of translation model is to perform statistical analysis and calculation of parallel corpora to make the translation convey the idea of the source language, while the function of language model is to decide the fluency of the translation, and the performance of translation model determines whether the translation is smoothly readable. At present, in the Uyghur- Chinese machine translation system, there are still shortages in Uygur language model, and the specific performance in: Uygur words are formed by adding some affixes to stems. Continuous increase of new words leads to the serious phenomenon of data sparseness in the process of language model training and poor ability to recognize unknown words in the process of translation decoding. Meanwhile, the form of a verb should be consistent with the subject and tense, and high long-distance grammar dependence relationship exists between words, however, the ability of language model based on statistical to describe grammar dependent relationship of Uyghur language is insufficient, leading to the decrease in the quality of the final translation of translation. In this paper, some work was done to alleviate or overcome the problems existing in the Uygur language model, and the main work is as follows: 1. Aimed at the problem that the Uyghur language has long-distance dependence and the phenomenon that language model has generally data sparseness in Chinese-Uyghur statistical machine translation, this paper presented an Uyghur language model which based on generalization. With the help of the text generated from the training process of Uyghur language model, the model combines the string similarity algorithms and gets similarly Uyghur strings to extract the rules and compute parameters. 2. Due to the fact that the decoding process is essentially determined by the heuristic searching algorithm, and poor recognition ability of the unknown words has effects on the sequence order, the highest probability of candidate translation may not be the optimal translation. Based on the extracted generalization rules, the score generator is established to achieve preprocessing of the N-best hypothesized translation generated in the process of decoding, the word "co-occurrence" calculation, reorder and 1 - best translation extraction. Finally, generalization language model is applied to the Chinese-Uyghur machine translation system, and the effectiveness of the generalization language model is tested, and the experimental results show that this method is effective to improve the quality of the translations. Keywords: Chinese-Uyghur machine translation, generalization language model, maximum entropy model, N-best hypothesized translation
内容类型: 学位论文
URI标识: http://ir.xjipc.cas.cn/handle/365002/3442
Appears in Collections:多语种信息技术研究室_学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
李响.pdf(1670KB)学位论文--暂不开放View 联系获取全文

作者单位: 中国科学院新疆理化技术研究所

Recommended Citation:
李响. 面向汉维机器翻译的泛化语言模型研究[D]. 北京. 中国科学院大学. 2014.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[李响]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[李响]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
文件名: 李响.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Powered by CSpace