XJIPC OpenIR  > 多语种信息技术研究室
面向维汉机器翻译的词对齐关键技术研究
米成刚
学位类型博士
导师李晓
2015-05-25
学位授予单位中国科学院大学
学位授予地点北京
学位专业计算机应用技术
关键词维汉机器翻译 词对齐 非对称对齐 数据稀疏 借词 组块对齐
摘要

随着社会的不断发展,不同文化背景、不同语言习惯的人们在文化,经贸等各个领域的交流日渐频繁,语言障碍成为人们交流过程中遇到的主要障碍。统计机器翻译(Statistical Machine Translation, SMT)研究的蓬勃发展为人们“跨越”这一障碍提供了契机。统计机器翻译的核心思想是首先对大量的双语平行语料进行统计分析,进而构建统计翻译模型,使用得到的模型对测试文本进行翻译。双语词对齐是统计机器翻译整体框架中十分重要的一个部分,它是短语表生成、调序规则抽取等的前提,词对齐的准确率对统计机器翻译系统的性能有着不容忽视的影响。然而,学术界对于面向维汉机器翻译的词对齐研究目前尚处于起步阶段。维吾尔语属于黏着语,通过在词尾附加若干词缀构成新词,汉语属于孤立语,它是通过字形的变化表达不同的词义;维吾尔语句法结构是主语-宾语-谓语,而汉语是主语-谓语-宾语。两种语言构词及句法结构上的差异决定了维汉词对齐过程中会出现较严重的数据稀疏以及非对称对齐,影响维汉机器翻译系统的翻译性能性能。本文以维汉机器翻译中的词对齐为主线,围绕对齐中存在的两个问题:数据稀疏与非对称对齐展开研究。从维汉两种语言构词及句法结构差异出发,提出优化的词对齐策略(基于共现程度的维汉词对齐)以及新的双语资源发现方法(基于二元分类的维吾尔语中借词识别),极大地改善了词对齐中的数据稀疏问题;提出了维汉组块对齐这一新思路(面向机器翻译的维汉组块对齐),有效地解决了维汉词对齐中的非对称对齐问题。另外,为了最大限度地减小词对齐阶段出现的错误对后续的参数调整、解码的影响,本文提出了一种基于分类思想的维汉短语表过滤模型(基于朴素贝叶斯模型的维汉短语表过滤),对短语表中的不合理短语对进行过滤。本论文的创新点描述如下:1、 基于共现程度的维汉词对齐为了从对齐模型层面缓解维汉词对齐过程中的数据稀疏,本文提出了一种面向维汉机器翻译的基于共现程度的词对齐方法。该方法与传统的基于词共现次数的方法有较为明显的不同,它是通过结合词共现次数以及模糊共现权值来构成词对齐程度。与基于词干切分的方法相比,该方法可以有效地保证维吾尔语端信息的完整性。2、 基于二元分类的维吾尔语中借词识别依据维吾尔语中借词与原语言中对应词语发音相似这一特点,并充分考察维吾尔语词的构词特点,本文提出一种基于二元分类的维吾尔语借词识别模型。该模型借鉴统计机器学习中的分类思想,将多个字符串相似度算法获取的相似度指标作为分类器的输入,是否为借词作为分类器的输出。3、 面向机器翻译的维汉组块对齐为了缓解维汉机器翻译词对齐过程中的非对称对齐问题,本文从统计模型的角度提出了一种基于对数线性模型的维汉组块切分方法。对照汉语句子的切分结果,基于无监督的特征学习方法,获取维吾尔语端的组块边界信息。为了最大限度地使用双语资源提供的信息,融合多种特征,本文将对数线性模型作为基线模型。4、 基于朴素贝叶斯模型的维汉短语表过滤本文充分考察维汉机器翻译短语表中源语言短语和目标语言短语之间的相关性和差异性,并结合维吾尔语语言特点及其迥异于汉语的构词方式,提出了一种基于朴素贝叶斯的维汉机器翻译短语表过滤模型。该模型将从短语表中抽取的信息作为朴素贝叶斯模型的四个特征,当前短语对是否过滤作为模型的输出。

其他摘要

As the rapid development of recent society, people from different cultures, with different languages communicate with each other frequently in areas such as cultures, trade et.al. Linguistic barrier become more and more important in people’s daily life. To overcome this barrier, the statistical machine translation (SMT) research provides a good chance. The main idea of SMT is that translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. Bilingual word alignment is one of the most important components in SMT, which is the basis of phrase extraction, reordering rules generation. The precision of word alignment affects the performance of statistical machine translation. The research on word alignment for Uyghur-Chinese machine translation is still at its early stage. Uyghur belongs to morphologically-rich language and displays vowel harmony and agglutination, lacks noun classes or grammatical gender; Uyghur is a left-branching language with Subject (S) – Object (O) – Verb (V) word order, which is very different with Chinese (S-V-O). The data sparsity and asymmetrical word alignment occurred in Uyghur-Chinese word alignment due to the significantly difference between Uyghur and Chinese can affects the performance of Uyghur-Chinese machine translation model. In this dissertation, we take the Uyghur-Chinese word alignment as the main line in my research. I try to solve the data sparsity and asymmetrical word alignment in Uyghur-Chinese word alignment. In order to alleviate the data sparsity in word alignment, I proposed an optimized Uyghur-Chinese word alignment model (Co-occurrence degree based word alignment model), which improved the precision significantly; meanwhile, I also suggested a novel method to detect the loan words in Uyghur texts (Bianry classification based loan words detection model for Uyghur texts), which enrich the bilingual resources effectively. I suggested a chucker based Uyghur-Chinese word alignment model (Chucker alignment for Uyghur-Chinese machine translation) to solve the asymmetrical word alignment in Uyghur-Chinese word alignment. Moreover, for alleviate the affection to the machine translation, a phrase table filter model (Uyghur-Chinese phrase table filter model based on Na?ve Bayes) based on Na?ve Bayes proposed in this dissertation. The main innovations of this dissertation are listed as follows: (1) Co-occurrence degree based word alignment model We proposed a Uyghur-Chinese word alignment method based on word co-occurrence degree to alleviate the data sparseness problem. Our approach combine the co-occurrence counts and the fuzzy co-occurrence weights as word co-occurrence degree, fuzzy co-occurrence weights can beobtained by searching for fuzzy co-occurrence word pairs and computing differences of length between current Uyghur word and other Uyghur words in fuzzy co-occurrence word pairs (2) Binary classification based loan words detection model for Uyghur texts To enrich bilingual resources, we detect Chinese and Russian loan words from Uyghur texts according to phonetic similarities between a loan word and its corresponding donor language word. In this paper, we propose a novel approach based on perceptron model to discover loan words from Uyghur texts, which consider the detection of loan words in Uyghur as a classification procedure. (3) Chucker alignment for Uyghur-Chinese machine translation For alleviate the asymmetrical word alignment problem in Uyghur-Chinese word alignment, a log-linear based Uyghur-Chinese bilingual chucker method has been proposed in this dissertation. The main ideas of this method can be described as follows: Chinese is one of the resource-rich languages, with the help of Chinese Tree Bank we can parsing the Chinese part of the Uyghur-Chinese corpus and chucked it; then, refer to the Chinese sentence, and based on an un-supervised feature learning algorithm, the Uyghur part can also

文献类型学位论文
条目标识符http://ir.xjipc.cas.cn/handle/365002/4230
专题多语种信息技术研究室
作者单位中国科学院新疆理化技术研究所
推荐引用方式
GB/T 7714
米成刚. 面向维汉机器翻译的词对齐关键技术研究[D]. 北京. 中国科学院大学,2015.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
面向维汉机器翻译的词对齐关键技术研究.p(2650KB)学位论文 开放获取CC BY-NC-SA浏览 请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[米成刚]的文章
百度学术
百度学术中相似的文章
[米成刚]的文章
必应学术
必应学术中相似的文章
[米成刚]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 面向维汉机器翻译的词对齐关键技术研究.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。