XJIPC OpenIR  > 多语种信息技术研究室
汉维统计机器翻译中的句法形态信息研究
陈丽娟
学位类型硕士
导师周俊林
2011-05-30
学位授予单位中国科学院研究生院
学位授予地点北京
学位专业计算机应用技术
关键词统计机器翻译 句法调序 形态学 因素模型 翻译模型 维语
其他摘要
国内的机器翻译研究集中在汉语和英语互译上,针对少数民族语言的机器翻译以蒙古语居多,对于维语的机器翻译仍处于起步阶段。对于新疆这样一个多民族聚居的 地区,以汉族和维吾尔族居多。随着信息化时代的到来,各民族之间的交流日趋频繁,而语言的差异为信息交流带来了障碍,所以民族语言之间的翻译对于促进民族 间的交流具有重要意义。 在统计机器翻译中,基于短语的统计机器翻译方法是一种经典的方法。根据汉维机器翻译研究现状,利用现有技术和工具构建了基于短语的汉维统计机器翻译平台, 将其作为基线系统,对汉维机器翻译进行初探。汉维机器翻译中存在的主要问题有以下三个:(1)缺乏大规模汉维平行语料库。(2)汉语和维语的语序差异较 大。(3)汉语和维语的形态差异较大。此外,由于基于短语的方法在长距离重排序能力上表现欠佳,并且不包含句法和形态信息等语言学知识。以上问题导致在汉 语到维语的统计机器翻译中,未登录词较多,且产生的维语译文出现语序混乱现象。 针对上述问题,本文讨论将汉语句法信息和维语形态学信息加入到汉语到维语的统计机器翻译模型构造中,以解决维语译文的语序问题和降低词形错误率: 1、对汉语句子中的短语进行调序,使其与维语句法相近。在对汉语和维语的语序进行系统研究的基础上,归纳了一系列汉语句法重排序规则。训练前,对源语言句 子进行句法分析,对分析得到汉语短语结构树运用本文提出的重排序规则进行调序,使汉语和维语的在语序上相近。 2、使用维语的形态信息,使之参与模型的训练。对汉语和维语的形态学差异进行了分析,在系统地总结了维语形态学特征的基础上,研究维语形态特征的提取方 法,以及在引入维语形态特征后语料库的形式。 针对汉语和维语的句法差异和形态学差异,在汉语到维语的统计机器翻译中,对源语言端进行句法调序,对目标语言端进行形态分析,将句法信息和形态学信息以 “因素”的形式引入到对数线性模型中。实验证明,本文提出的方法能够获得较好的系统性能,较基线系统有实质性的提高。
; Machine translation research has focused on Chinese and English in China, beside machine translation for minority languages such as Mongolian. Machine translation for Uighur is still in its infancy. In Xinjiang, where there are many race, the majority race are Chinese and Uyghur. With the age of information coming, communication between different ethnic groups becomes more frequent. So, the differences between languages become obstacles to communication. Therefore, the translation between the national languages is important to promote communication between nations. Phrase-based statistical machine translation is a classical method in statistical machine translation. According to the situation of Chinese - Uyghur machine translation, build a phrase-based statistical machine translation platform as baseline system for Chinese and Uygur using the existing technologies and tools. On the basis of the baseline system, we study the Chinese - Uyghur machine translation. There are three main problems in Chinese - Uyghur machine translation: (1) large amounts of parallel corpora are not available for Chinese and Uyghur; (2) Chinese differ widely from Uyghur in terms of word-order; (3) Uyghur is morphologically quite rich. In addition, the Phrase-based method is weak in performing long-distance phrase reordering and does not contain linguistic knowledge, which leads to many unknown words and confusion word order in Uyghur when translate from Chinese to Uyghur using statistical method. For the problems mentioned above, we incorporate syntactic information and morphological information to Chinese - Uyghur machine translation to reduce the word order problems in Uyghur translation and form error. 1. Reorder Chinese sentences to match the Uighur syntax. On the basis of study on the word order of Chinese and Uyghur, we propose a Chinese syntactic reordering method. Analyze the syntax of the source sentences before training, and then perform the reorder rules on the analysis trees to make Chinese and Uyghur similar in word order. 2. Incorporate the morphology of Uyghur to model training. On the basis of morphology features of Uyghur, study the extraction method for features of Uyghur morphology and the corpus form after introducing morphology. Incorporate the morphology of Uyghur as factors to the log-liner model. Verify the effect of the morphology to translation performance in factored model. To reduce the order and morphology difference between Chinese and Uyghur, in the Chinese - Uyghur machine translation, we perform reorder on source side and segment on target side. Incorporate syntactic information and morphological information to factor based Chinese-Uyghur Statistical Machine Translation System and study its effect on translation performance. Experiments show that our approach can achieves a substantial improvement in translation quality over the baseline phrase-based system.
文献类型学位论文
条目标识符http://ir.xjipc.cas.cn/handle/365002/4416
专题多语种信息技术研究室
作者单位中国科学院新疆理化技术研究所
推荐引用方式
GB/T 7714
陈丽娟. 汉维统计机器翻译中的句法形态信息研究[D]. 北京. 中国科学院研究生院,2011.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
汉维统计机器翻译中的句法形态信息研究.p(1369KB)学位论文 开放获取CC BY-NC-SA浏览 请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[陈丽娟]的文章
百度学术
百度学术中相似的文章
[陈丽娟]的文章
必应学术
必应学术中相似的文章
[陈丽娟]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 汉维统计机器翻译中的句法形态信息研究.pdf
格式: Adobe PDF
此文件暂不支持浏览
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。