XJIPC OpenIR  > 多语种信息技术研究室
维汉机器翻译语料自动获取及领域自适应研究
朱少林
学位类型博士
导师李晓
2018-05-25
学位授予单位中国科学院大学
学位授予地点北京
学位专业计算机应用技术
关键词维汉机器翻译 资源稀缺 深度学习 领域自适应 双语语料
摘要

文字的出现标志着人类文明的诞生,文字是信息的载体,人们通过文字进行思想的交流,文化的传播,但是不同国家的语言不同,这一问题严重制约着人类的发展,特别是在全球一体化快速发展的今天,实现各种语言之间的相互翻译已经成为一个重要的研究课题。目前流行的统计机器翻译(Statistical Machine Translation, SMT)和神经网络机器(Neural Machine Translation,NMT)翻译已经取得了突飞猛进的发展并取得了令人欣慰的研究成果,虽然英汉、英法、葡英等语言之间的翻译在特定领域已经取得了较好的翻译效果,但是对于诸多地区性语言或者非世界性语言,例如维语、哈语、土耳其语到汉语等,这些语言之间的翻译还处在研究的初期,翻译的效果还不尽如人意。不论统计机器翻译还是神经机器翻译,其核心思想都是通过训练双语语料得到翻译系统,双语语料对机器翻译有着至关重要的作用,但是目前维汉、哈汉等机器翻译存在着双语语料严重不足的问题,研究自动获取双语语料的方法可以快速的构建翻译系统并提高翻译质量。另一方面翻译系统领域的影响,不同领域的机器翻译系统有着不同的翻译效果,用与翻译系统领域差距较大的文本进行翻译,会大大降低翻译质量。本文以维汉机器翻译为突破口,以快速构建机器翻译系统和提高机器翻译质量为目的,重点研究维汉双语语料的自动获取和维汉机器翻译领域自适应。一方面,针对维汉双语低资源的现状,提出首先构建携带语义信息的维语和汉语词向量模型,然后通过深度学习方法推导双语词向量,进而推导句对齐双语语料,该方法可以极大的缓解双语资源稀缺的现状,用尽量少的双语知识自动获取双语句对齐语料,另一方面,为提高维汉机器翻译质量,本文提出了一种机器翻译领域自适应方法,分别通过翻译模型和语言模型两个方面进行领域自适应,在实际翻译中,通过构建词向量并结合主题分析模型,选取与领域相关性最高的翻译系统进行翻译。本文的主要贡献可以总结如下:1、维汉双语词典自动获取针对目前维汉双语语料资源稀缺的现状,本文提出一种从双语单语语料中推导学习双语互译词的方法。该方法与传统的从平行句对齐语料中获取双语互译词的方法不同,该方法最大的特点就是不需要使用双语句对齐语料,仅仅需要数百个双语词就能从双语单语语料中获取双语互译词对。该方法特别适用于资源稀缺型语言间的语义表示。2、面向稀缺资源的维汉机器翻译双语语料自动获取目前主要获取双语句对齐语料的方法是通过构建分类器,通过分类器识别平行语料,但是训练一个良好的分类器需要足够的双语句对齐语料,然而对于资源稀缺型语言,双语句对齐语料极其难以获取,本文提出了结合深度学习的方法进行双语句对齐语料的自动获取,该方法首先借鉴于双语互译词的推导过程,通过双语互译词的学习,得到分类器的句对齐训练语料,然后本文构建了一个深度双向循环神经网络分类器,将双语语料的获取过程视为一个分类的过程,进而自动构建机器翻译双语语料。3、面向维汉机器翻译的领域自适应为了进一步提高维汉统计机器翻译质量,本文提出了一种领域自适应的方法,该方法分为翻译模型领域自适应和语言模型领域自适应。针对翻译模型领域自适应本文使用词向量和主题分析模型将训练语料进行主题领域自动分类,然后在翻译过程中,根据翻译需求选取领域最相关的系统进行翻译。对于语言模型领域自适应,根据特定领域,使用基于权重的编辑距离方法选取特定领域的训练语料,提高语言模型质量。4、基于覆盖度的维汉机器翻译训练语料选取为了进一步研究提高维汉机器翻译的方法,本文提出了一种基于双语句对覆盖度的方法进行语料的选取,该方法结合维汉双语一对多现象严重的特点,使用一种n-gram的方法计算句子中的冗余信息,通过过滤双语句子中的冗余信息,使得在使用较少训练语料的情况下,得到一个近乎相当于较大规模训练语料翻译性能的翻译系统训练语料。

其他摘要

The appearance of words marks the birth of human civilization. Words are the carrier of information. People use words to exchange ideas and spread culture. However, different countries have different language and this problem seriously restricts the development of humanity. Especially in the world for today, with the rapid development of integration, realizing the mutual translation between various languages has become an important research topic. The current popular statistical machine translation (SMT) and neural machine translation (NMT) translations have achieved rapid development and have yielded promising research results, although languages such as English-Chinese, English-French, English-Portuguese, etc. have achieved good translation effects in specific areas, for many regional or non-world languages, such as Uyghur, Kazakh and Turkish to Chinese, translations between these languages are still under study. In the early days, the effect of translation was not satisfactory. Regardless of statistical machine translation or neural machine translation, the core idea is to obtain a translation system by training bilingual corpus. Bilingual corpora have a crucial role in machine translation. However, there are serious shortages of bilingual corpora in such machine translations as Uygur-Chinese and Kazakh-Chinese. The problem of studying the method of automatically acquiring bilingual corpus can quickly build a translation system and improve the quality of translation. On the other hand, in the field of translation systems, machine translation systems in different fields have different translation effects. Translations using texts that differ from the translation system fields can greatly reduce the quality of translation.This paper uses the translation of Uyghur-Chinese machine as a breakthrough point, with the aim of rapidly constructing a machine translation system and improving the quality of machine translation. It focuses on the automatic harvesting Uyghur-Chinese bilingual corpus and the domain adaption of Uyghur-Chinese machine translation. On the one hand, aiming at the current situation of bilingual low-resources in Uyghur-Chinese, it is proposed to first construct a vector model of Uyghur and Chinese words representations that carry semantic information, and then use a deep learning method to deduce bilingual words, and then infer sentences to align bilingual corpus. This method can be extremely useful. Relieve the status quo of the scarcity of bilingual resources, and automatically acquire double-sentence aligned corpus with as little bilingual knowledge as possible. On the other hand, in order to improve the quality of Uyghur-Chinese machine translation, this paper proposes an domain adaptive method for machine translation. It adapts the domain in terms of both translation model and language model. In actual translation, it constructs word vectors and combines it with topic model. Thematic analysis model selects the translation system with the highest relevance to the domain for translation.The main contributions of this article can be summarized as follows:1. Uyghur-Chinese bilingual dictionary automatically obtainingIn view of the current situation of the scarcity of Uyghur-Chinese bilingual corpus resources, this paper proposes a method for learning bilingual and bilingual translations from bilingual monolingual corpus. This method is different from the traditional method of obtaining bilingual translated words from parallel sentence aligned corpus. The biggest feature of this method is that it does not require the use of double sentences to align the corpus. It only needs hundreds of bilingual words to achieve bilingual translation of pairs of words. This method is particularly suitable for semantic representation between low-resource languages.2. Automatic translation of bilingual corpora for translation of Uyghur-Chinese machines for low-resourcesAt present, the main method to obtain the double-statement alignment corpus is to construct the classifier and recognize the parallel corpus through the classifier. However, training a good classifier requires enough double-statement alignment corpora, but for low-resource languages, the double-statement alignment corpus is not. Therefore, this paper proposes a method of deep learning to automatically acquire the double-sentence alignment corpora. This method first draws on the process of derivation of bilingual translated words, and obtains sentence-aligned training corpora of the classifier through the learning of bilingual mutual translations. Then this paper constructs an in-depth bidirectional cyclic neural network classifier, which treats the acquisition process of bilingual corpus as a classification process, and then automatically constructs machine translation bilingual corpus.3. Domain adaptation for Uyghur-Chinese machine translationIn order to further improve the quality of Uyghur-Chinese statistical machine translation, this paper proposes a domain adaptive method, which is divided into translation model domain adaptation and language model domain adaptation. Adapting to the translation model domain, this paper uses the word vector and thematic analysis model to automatically classify training corpora in subject areas. Then in the translation process, the most relevant system in the field is selected for translation. For language model domain adaptation, according to specific fields, we use weight-based editing distance method to select specific areas of training corpus to improve the quality of the language model.4. Based on the coverage of the Uyghur-Chinese translation machine training language selectionIn order to further study the method of improving the translation of Uyghur-Chinese machines, this paper proposes a method based on double-sentences for the selection of the corpora of coverage. This method combines the characteristics of the Uyghur and Chinese bilingual one-to-many phenomenon and uses an n-gram. The method calculates the redundant information in the sentence and filters the redundant information in the parallel bilingual sentences. This results in a training system corpus that almost equals the translation performance of the large-scale training corpus with less training corpus.

页数123
文献类型学位论文
条目标识符http://ir.xjipc.cas.cn/handle/365002/5456
专题多语种信息技术研究室
推荐引用方式
GB/T 7714
朱少林. 维汉机器翻译语料自动获取及领域自适应研究[D]. 北京. 中国科学院大学,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
维汉机器翻译语料自动获取及领域自适应研究(3044KB)学位论文 开放获取CC BY-NC-SA浏览 请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[朱少林]的文章
百度学术
百度学术中相似的文章
[朱少林]的文章
必应学术
必应学术中相似的文章
[朱少林]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 维汉机器翻译语料自动获取及领域自适应研究.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。