XJIPC OpenIR  > 多语种信息技术研究室
词法规则在维吾尔语语音识别中的应用
薛化建
Thesis Advisor李晓
2012-12-04
Degree Grantor中国科学院研究生院
Place of Conferral北京
Degree Name博士
Degree Discipline计算机应用技术
Keyword维吾尔语 语音识别 词法规则 词切分 语音现象建模
Abstract

自动语音识别技术经过60多年的研究和开发,取得了很大进展。目前,英语、汉语等大语言的语音识别技术逐步成熟,开始进入商用阶段。维吾尔语语音识别研究在近几年才得到重视,在研究过程中主要借鉴大语言中成熟的语音识别技术。但是维吾尔语的语言特点不同于英语、汉语,因此在维吾尔语语音识别研究中,仍存在很多问题亟待解决。 维吾尔语是一种粘着语。最主要的特点是词具有丰富的形态变化,既导致语音识别系统要处理的词汇量急剧增长,也产生了大量的未登录词。这给维吾尔语语音识别研究带来了极大的挑战。采用何种策略处理这个难题,是一个重要的研究课题。除此之外,在词的形态变化过程中,存在着元音弱化、脱落、增音等语音现象。这些语音现象对语音识别的性能产生了一定的影响,因此,有必要研究这些语音现象的处理方法。 针对上述难题,本文重点关注维吾尔语的形态变化系统,研究了维吾尔语词切分算法,研究了基于子字单元的维吾尔语语音识别系统,同时研究了对元音弱化、脱落、增音等语音现象进行建模的方法。 本文的主要工作与创新包括以下几个方面: 1.维吾尔语语音识别研究中的OOV问题 维吾尔语的形态变化导致语音识别研究面临严重的OOV问题。为了定量研究OOV问题对维吾尔语语音识别系统识别性能的影响,本文提出了一种基于三音素的最佳文本挑选算法,通过算法来控制测试集OOV比率,建立不同的测试集。使用Python语言实现了本文提出的算法,将其应用于一个电话语音库的文本转写,构建了维吾尔语电话语音库。实验结果表明当测试集OOV比率较高时,能降低测试集OOV比率的技术才能有效的提高语音识别系统的识别性能。 2.维吾尔语词切分算法 词切分算法研究是维吾尔语自然语言处理研究的基础工作。本文研究了维吾尔语中词的形态变化,描述了词干和词缀在组合成词时需要遵循的词法规则。通过收集词干库和词缀库,实现了基于规则的维吾尔语词切分算法,并提出了一种规则和统计相结合的维吾尔语词切分算法。该算法保留了基于规则的词切分算法的优点,同时可以对未登录词进行切分处理。实验结果表明该算法具有最好的词切分性能。 3.基于子字单元的维吾尔语语音识别研究 维吾尔语具有丰富的形态变化,产生了大量的OOV词,给维吾尔语语音识别研究带来了巨大的挑战。针对该问题,本文研究并构建了基于子字单元的维吾尔语语音识别系统。在基于子字的语音识别实验中,采用不同的词切分算法来生成子字序列,对比了不同子字单元在语音识别中的性能。 4.在语音识别中对语音现象进行建模处理 在维吾尔语中,当在一个词上添加特定的词缀时,会发生元音弱化、脱落、增音等语音现象。本文对这些语音现象进行了研究,提出了一种在语音识别中对这些语音现象进行建模的方法。该方法使用基于规则的词切分算法来识别这些语音现象,生成变形词干库,然后使用变形词干库,生成多发音字典对这些语音现象进行建模处理。实验结果表明该方法有效的提高了语音识别系统对识别单元的识别率。

Other Abstract

After sixty years of research and development, great progress has been accomplished in Automatic Speech Recognition (ASR). Today, speech recognition technology of mainstream languages such as English and Chinese is gradually making a success story and has been put into practice. Recently, there is growing interest in porting the state-of-the-art speech recognition system to Uyghur so as to develop Uyghur ASR applications. However, the nature of Uyghur is different from English and Chinese. Many problems in Uyghur speech recognition are not to be solved. Uyghur is an agglutinative language. The main characteristic of Uyghur is its morphology, which leads to not only a very fast growth of vocabulary and but also enormous out of vocabulary (OOV) words. It is considered to be a great challenge to Uyghur ASR and become an important research topic. In addition, there are speech phenomena such vowel weakening, dropping and inserting in Uyghur morphology, which decrease the accuracy of speech recognition. Thus, it is another research topic in this thesis. To solve the above research challenges to Uyghur speech recognition, this thesis concentrates on studying Uyghur morphology, develops Uyghur word segmentation algorithms, develops Uyghur speech recognition system based on subword units, and develops the way to deal with the speech phenomena in morphology. Our main contributions and innovations are illustrated here: 1. OOV problem in Uyghur speech recognition Complex morphology of Uyghur can produce a bad OOV problem in speech recognition. For quantificational study on the OOV problem, an optimal text selection algorithm is presented, which can tune OOV rate of test set in the corpus. The algorithm is realized by Python programming language and has been applied to text transcriptions of Uyghur telephone speech in order to build a telephone speech corpus. Experimental results show that the key to improve the accuracy of speech recognition when OOV rate of test set is higher, is utilizing the technology that can reduce OOV rate of test set. 2. Uyghur word segmentation algorithm The study of word segmentation algorithm is the basic work in Uyghur natural language processing. This thesis studies Uyghur morphology, describes Uyghur morphological rules, collects stem corpus and suffix corpus, implements the word segmentation algorithm based on rules, and proposes an word segmentation algorithm based on rules and statistics which can keep the accuracy of the word segmentation algorithm based on rules and segment the OOV words. Experimental results show that the accuracy of the new word segmentation algorithm is best. 3. Research on Uyghur speech recognition based on subword units Complex morphology of Uyghur can produce a mass of OOV words, which is the great challenge to Uyghur speech recognition. To solve this problem, this thesis studies and constructs Uyghur speech recognition system based on subword units. In the experiments, various word segmentation algorithms are utilized to produce subword units. This thesis investigates the use of various subword units in Uyghur speech recognition. 4. Dealing with the speech phenomena in Uyghur speech recognition There are speech phenomena such as vowel weakening, dropping and inserting when some suffixes are attached to words in the Uyghur. This thesis studies the speech phenomena, proposes the way to deal with the speech phenomena in Uyghur speech recognition. In this way, word segmentation algorithm based on rules is utilized to identify the speech phenomena, to build the corpus of variant stem forms, and then to deal with the speech phenomena by making the pronunciation lexicon including the words that have more than one pronunciation. Experimental results show that this way improves the accuracy of Uyghur speech recognition.

Document Type学位论文
Identifierhttp://ir.xjipc.cas.cn/handle/365002/4362
Collection多语种信息技术研究室
Recommended Citation
GB/T 7714
薛化建. 词法规则在维吾尔语语音识别中的应用[D]. 北京. 中国科学院研究生院,2012.
Files in This Item:
File Name/Size DocType Version Access License
词法规则在维吾尔语语音识别中的应用.pd(1251KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[薛化建]'s Articles
Baidu academic
Similar articles in Baidu academic
[薛化建]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[薛化建]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.