XJIPC OpenIR  > 多语种信息技术研究室
维汉口语机器翻译中维语无监督词汇归一化研究
罗延根
Subtype硕士
Thesis Advisor李晓
2017-05-21
Degree Grantor中国科学院大学
Place of Conferral北京
Degree Discipline计算机技术
Keyword维吾尔语口语 维汉机器翻译 非正规词 词汇向量化 重采样
Abstract

机器翻译是通过计算机利用自动化的方法实现不同语种之间的互译过程。目前维汉机器翻译主流方法是基于统计的机器翻译。由于缺乏口语领域的维汉双语语料,以及口语文本中存在很多书写不规范的单词(非正规词),口语领域的统计机器翻译效果因为这些非正规词而不太理想。针对维文非正规词归一化的方法近年研究甚少,已有方法大多是基于规则或者规则与统计相结合的方法。无监督归一化在英语等其他语种中正逐渐成为研究热点。本文对已有无监督归一化方法进行改进,提出了适合维语的无监督词汇归一化的方法,用于改善维汉口语机器翻译的性能。将口语文本中的非正规维吾尔语词项归一化到正规文本中意思相近的正规词,即能在保留文本意思的前提下,将文本变得更加规整且适合已有的机器翻译系统进行处理。首先通过神经网络对词进行向量化,通过向量空间获取非正规词的候选正规词列表,最后利用解码搜索算法完成归一化。本文的研究主要集中在两个方面。第一,如何为一个非正规词获取语义相近的正规词候选列表;第二,如何对口语文本中句子的非正规词归一化到其候选列表中最相近的正规词。对于第一个研究点,进行了维语文本上的词向量对比实验,最后采用fasttext进行维语单词的向量化,因为fasttext能表达出词的n-gram信息,这也符合维语单词由词干和词缀组成的特性。对于第二个研究点,本文提出了一个贪心解码搜索算法,通过语言模型也综合考虑到非正规词在当前句子的上下文信息。此外,我们提出使用重采样方法,在完成归一化后,重采样迭代,利用已经成功归一化的非正规词调整向量空间进而归一化之前未能归一化的非正规词。最后,实验结果表明,本文提出的词汇归一化方法效果优于已有的方法,将本方法作为维汉机器翻译的前处理过程,系统的性能得到了提升。

Other Abstract

Machine translation is an automatic process that translates between different languages. At present, the main method of Uyghur-Chinese machine translation is statistical machine translation. Due to the lack of parallel bilingual corpus in spoken text fields, and the existence of too many non-standard words in spoken Uyghur text, existed Uyghur-Chinese machine translation system doesn’t perform well in the spoken language domain.Not much research has been done on the non-standard Uyghur word’s normalization, mainly by rule-based method and statistical-based method. Unsupervised word normalization approach is becoming popular for other languages such as English. We make some improvement on the existing unsupervised methods, and put forward the method of unsupervised normalization of Uyghur words, to normalize the non-standard words in Uyghur spoken text to its canonical word. It can make non-standard word in spoken text to a standard form that can be better handled by existing machine translation system, while holding the original meaning. Firstly, embed word to a low dimension vector, get canonical word candidates, then normalize it by a decode searching algorithm.The research of this paper is mainly focused on two aspects. The first one is how to generate a non-standard word’s canonical word candidates. The second one is how to bind a non-standard word with the most semantic similar canonical word from its candidate list. For the first part, a comparison experiment of different word embedding method on Uyghur word is performed. Fasttext performs best as fasttext takes n-gram inner the word into account, which matches the situation that Uyghur word is composited by stems and affixes. For the second part, a greedy decoding algorithm is proposed, it takes use of language model which considers the context information of a non-standard word in the decoding spoken language sentence. Besides, bootstrapping method is proposed, after normalization process, we resample the corpus and recurs the fully normalization process, in that way, normalized non-standard words can help the normalization of the words that failed to normalize before.At last, experiment shows, that the proposed method does better on Uyghur word normalization. And take the proposed method to Uyghur-Chinese machine translation system as a pre-process, performance of machine translation system improves.

Document Type学位论文
Identifierhttp://ir.xjipc.cas.cn/handle/365002/4933
Collection多语种信息技术研究室
Affiliation中国科学院新疆理化技术研究所
Recommended Citation
GB/T 7714
罗延根. 维汉口语机器翻译中维语无监督词汇归一化研究[D]. 北京. 中国科学院大学,2017.
Files in This Item:
File Name/Size DocType Version Access License
维汉口语机器翻译中维语无监督词汇归一化研(1850KB)学位论文 开放获取CC BY-NC-SAView Application Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[罗延根]'s Articles
Baidu academic
Similar articles in Baidu academic
[罗延根]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[罗延根]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: 维汉口语机器翻译中维语无监督词汇归一化研究.pdf
Format: Adobe PDF
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.