XJIPC OpenIR  > 多语种信息技术研究室
维汉口语机器翻译中维语无监督词汇归一化研究
罗延根
学位类型硕士
导师李晓
2017-05-21
学位授予单位中国科学院大学
学位授予地点北京
学位专业计算机技术
关键词维吾尔语口语 维汉机器翻译 非正规词 词汇向量化 重采样
摘要

机器翻译是通过计算机利用自动化的方法实现不同语种之间的互译过程。目前维汉机器翻译主流方法是基于统计的机器翻译。由于缺乏口语领域的维汉双语语料,以及口语文本中存在很多书写不规范的单词(非正规词),口语领域的统计机器翻译效果因为这些非正规词而不太理想。针对维文非正规词归一化的方法近年研究甚少,已有方法大多是基于规则或者规则与统计相结合的方法。无监督归一化在英语等其他语种中正逐渐成为研究热点。本文对已有无监督归一化方法进行改进,提出了适合维语的无监督词汇归一化的方法,用于改善维汉口语机器翻译的性能。将口语文本中的非正规维吾尔语词项归一化到正规文本中意思相近的正规词,即能在保留文本意思的前提下,将文本变得更加规整且适合已有的机器翻译系统进行处理。首先通过神经网络对词进行向量化,通过向量空间获取非正规词的候选正规词列表,最后利用解码搜索算法完成归一化。本文的研究主要集中在两个方面。第一,如何为一个非正规词获取语义相近的正规词候选列表;第二,如何对口语文本中句子的非正规词归一化到其候选列表中最相近的正规词。对于第一个研究点,进行了维语文本上的词向量对比实验,最后采用fasttext进行维语单词的向量化,因为fasttext能表达出词的n-gram信息,这也符合维语单词由词干和词缀组成的特性。对于第二个研究点,本文提出了一个贪心解码搜索算法,通过语言模型也综合考虑到非正规词在当前句子的上下文信息。此外,我们提出使用重采样方法,在完成归一化后,重采样迭代,利用已经成功归一化的非正规词调整向量空间进而归一化之前未能归一化的非正规词。最后,实验结果表明,本文提出的词汇归一化方法效果优于已有的方法,将本方法作为维汉机器翻译的前处理过程,系统的性能得到了提升。

其他摘要

Machine translation is an automatic process that translates between different languages. At present, the main method of Uyghur-Chinese machine translation is statistical machine translation. Due to the lack of parallel bilingual corpus in spoken text fields, and the existence of too many non-standard words in spoken Uyghur text, existed Uyghur-Chinese machine translation system doesn’t perform well in the spoken language domain.Not much research has been done on the non-standard Uyghur word’s normalization, mainly by rule-based method and statistical-based method. Unsupervised word normalization approach is becoming popular for other languages such as English. We make some improvement on the existing unsupervised methods, and put forward the method of unsupervised normalization of Uyghur words, to normalize the non-standard words in Uyghur spoken text to its canonical word. It can make non-standard word in spoken text to a standard form that can be better handled by existing machine translation system, while holding the original meaning. Firstly, embed word to a low dimension vector, get canonical word candidates, then normalize it by a decode searching algorithm.The research of this paper is mainly focused on two aspects. The first one is how to generate a non-standard word’s canonical word candidates. The second one is how to bind a non-standard word with the most semantic similar canonical word from its candidate list. For the first part, a comparison experiment of different word embedding method on Uyghur word is performed. Fasttext performs best as fasttext takes n-gram inner the word into account, which matches the situation that Uyghur word is composited by stems and affixes. For the second part, a greedy decoding algorithm is proposed, it takes use of language model which considers the context information of a non-standard word in the decoding spoken language sentence. Besides, bootstrapping method is proposed, after normalization process, we resample the corpus and recurs the fully normalization process, in that way, normalized non-standard words can help the normalization of the words that failed to normalize before.At last, experiment shows, that the proposed method does better on Uyghur word normalization. And take the proposed method to Uyghur-Chinese machine translation system as a pre-process, performance of machine translation system improves.

文献类型学位论文
条目标识符http://ir.xjipc.cas.cn/handle/365002/4933
专题多语种信息技术研究室
作者单位中国科学院新疆理化技术研究所
推荐引用方式
GB/T 7714
罗延根. 维汉口语机器翻译中维语无监督词汇归一化研究[D]. 北京. 中国科学院大学,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
维汉口语机器翻译中维语无监督词汇归一化研(1850KB)学位论文 开放获取CC BY-NC-SA浏览 请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[罗延根]的文章
百度学术
百度学术中相似的文章
[罗延根]的文章
必应学术
必应学术中相似的文章
[罗延根]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 维汉口语机器翻译中维语无监督词汇归一化研究.pdf
格式: Adobe PDF
此文件暂不支持浏览
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。