XJIPC OpenIR  > 多语种信息技术研究室
现代维吾尔语词汇信息处理关键技术研究
艾孜尔古丽·玉素甫
学位类型博士
导师李晓
2016-05-29
学位授予单位中国科学院大学
学位授予地点北京
学位专业计算机应用技术
关键词现代维吾尔语 平衡语料库构建 词汇信息处理技术 词汇计量研究
摘要

本文对现代维吾尔语多策略统计与多维度动态特征数据分析关键技术、常用词表和现代维语词语标注规范的研制、词性标注关键技术进行全方位、系统的研究,形成统一地现代维吾尔语词汇信息处理研究体系,是维吾尔语进行舆情分析、语言理解的智能系统的开发奠定技术及资源基础。最终形成一定数量的现代维吾尔语词汇计量研究成果,开拓出现代维吾尔语计量语言学新领域。系统、科学地构建一种现代科学意义上的现代维吾尔语计量语言学理论。对现代维吾尔语进行了多层次、全方位的计量研究,发现维吾尔语语言系统的结构和演化规律。其研究成果也可直接应用于维吾尔语文网络舆情分析研究以及我国与中亚的跨境语言研究和舆情监测,也为“一带一路”战略提供“互联网+”语言服务。本文面向社会媒体,结合计算语言学、语料库语言学、计量语言学等学科基础,从现代维吾尔语平衡语料库的资源建设、现代维吾尔语词汇信息处理关键技术、现代维吾尔语词汇计量研究等三个方面开展研究。以提高现代维吾尔语词汇一级信息处理技术的代表性、可靠性和权威性为目标,探索现代维吾尔语词汇信息处理研究的基本理论体系、系统地基本方法和关键技术。(1)现代维吾尔语平衡语料库的资源建设研究维吾尔语平衡语料库构建技术,构建维吾尔语词汇库、语法语义词干词典等语料库,研制较为全面的现代维吾尔语常用词表和基于语法语义的词干词性标注规范标准。为了做好现代维吾尔语平衡语料库的可靠性、代表性和权威性,对语料来源、语料范围、语料载体等方面进一步研究,保证平衡语料库的权威性和代表性,根据现有语料具体情况,以传播媒体作为筛选依据。在现有语料规模基础上,不断完善和优化网络媒体、平面媒体、有声媒体、教育教材媒体等四大媒体大规模文本语料库,总语料规模1.42G(词汇量达8千万以上)。已建立85万条以上词种的维吾尔语动态词汇库、10万条以上的语法语义信息词干词典,20万条短语以上的短语库、维汉维吾尔人人名词汇、世界地名、新疆地名、维汉双语词汇等知识库。(2)现代维吾尔语词汇信息处理关键技术重点研究如何结合并利用计算语言学知识改进词汇统计、数据分析、词性标注方法,研究现代维吾尔语词干提取算法、词尾切分算法、特征数据分析算法、常用词提取、词干词性标注算法等关键技术,构建基于平衡语料库的现代维吾尔语多策略统计模型、现代维吾尔语词汇动态特征数据分析模型及词性标注模型。现代维吾尔语词汇统计及数据分析研究包括现代维吾尔语词干提取、词尾切分、数据分析等关键技术研究。现代维吾尔语词干提取关键技术讨论现代维吾尔语基于词干提取算法、文本格式转换与标准、文本调整等重要模块功能以及研究词干在网络媒体语料中应用形式。现代维吾尔语词尾切分关键技术研究陈述维吾尔语词法结构、词语还原方法、语料应用领域、采集的语料时间跨度与统计方法及分析结果。现代维吾尔语数据分析技术研究介绍数据分析方法的组成,研究频次与词种的关系、词种覆盖率、词种分布分析、词长分析等内容。现代维吾尔语常用词提取关键技术部分构建现代维吾尔语言语料库的关键技术与方法,特别是现代维吾尔语言语料库的构建,并对现代维吾尔语语料预处理技术,现代维吾尔语语料统计技术,现代维吾尔语词干提取技术,现代维吾尔语数据分析技术进行了研究;研制了现代维吾尔语常用词候选表,从词语的使用频度和词语的分布两方面对词语进行了基本考察,将维吾尔语词语的“词种数、频次、频率、文本数、词长”作为常用词候选表的依据。现代维吾尔语词干词类标注标记集验证性研究以维吾尔语小学语文教材语料为验证对象,利用从语法语义相结合角度制定的《现代维吾尔语词干词类标注标记集》,对维吾尔语小学语文教材词干进行了词性标注,验证该标记集规范的可行性、适应性和可靠性。补充和改正部分词类的语义分类及标注代码,提出了规范的扩充建议。基于形态分析的现代维吾尔语名词词干识别研究陈述形态分析概念和通过这些形态特征可以准确地识别其词性的意义。并总结维吾尔语的词类划分标准、名词的形态特征分析,总结词缀歧义及消解规则。本文提出研究总体思路,设计现代维吾尔语新词中名词识别算法,其中包括特征选择及参数估计、词内部特征、前后依存词特征等。(3)现代维吾尔语词汇计量研究现代维吾尔语词汇计量分析与应用为前面所述研究的应用,开拓出现代维吾尔语计量语言学新研究领域。将现代维吾尔语作为研究对象,扩充现代维吾尔语大规模动态文本语料库,开发与完善现有的计量处理工具,使用语法语义相结合的多策略方法,对现代维吾尔语词尾、词干、词语进行多层次、全方位、科学、系统的计量分析研究,构建现代维吾尔知识库,从计量语言学角度探索发现维吾尔语的语言规律。包括现代维吾尔语网站、九年义务教育维吾尔文教材,维吾尔语高中语文教材、现代维吾尔语有声媒体等词汇计量分析。针对教育教材媒体、网络媒体、有声媒体语料,应用以上所述的词汇信息处理关键技术进行计量分析,探索现代维吾尔语词汇计量语理论与方法。本文希望通过语言的定量特征以及这些特征之间的协同关系,采用从真实文本中抽象出的数量关系描述与理解维吾尔语词汇系统及其组成成分的发展和运作规律,研究所发现的语言规律也有助于更精确地描写与解释维吾尔语语言现象,系统、科学地构建一种现代科学意义上的现代维吾尔语计量语言学理论。对现代维吾尔语进行了多层次、全方位的计量研究,发现维吾尔语语言系统的结构和演化规律。

其他摘要

In this paper, the modern Uyghur language strategies more statistics and multidimensional dynamic characteristics of the data analysis of key technology, often use tables and the development of the modern Uyghur words tagging specification, part-of-speech tagging, all-round and systematic research on the key techniques, the formation of a unified information processing in modern Uyghur language vocabulary research system, analyses public opinion is the Uyghur language, language understanding of the development of intelligent system technology and resource basis. Eventually form a certain number of modern Uyghur language lexicon research, develop new areas of modern Uyghur language linguistics measurement. System, science to construct a modern science in the sense of modern Uyghur language measuring linguistic theories. Of modern Uyghur language has carried on the multi-level and comprehensive qualitative research, found that the structure and evolution of the Uighur language systems. The research results can be directly applied in the Uyghur language network public opinion analysis as well as our country and central Asian cross-border language research and public opinion monitoring, also provide "area" strategy with "+" Internet language services.This article is written for social media, combined with computational linguistics, corpus linguistics basis, measurement of linguistics and other disciplines, from the balance of modern Uighur corpus resources construction, the modern Uyghur language information processing key technology, the modern Uyghur vocabulary qualitative research studies from three aspects. In order to improve the modern Uyghur vocabulary level of the representation of the information processing technology, the reliability and authority as the goal, to explore the modern Uyghur vocabulary study the basic theory of information processing system, basic methods and key technology systematically.(1) balance of modern Uighur corpus resources constructionStudy balance of Uighur corpus building technology, construction of Uyghur language vocabulary, grammar, semantics stems dictionary and corpus, to develop comprehensive modern Uyghur language use tables and based on semantic grammar stemming the standards on the part of speech tagging.In order to make the reliability of the balance of modern Uighur corpus, representative and authoritative, the corpus source and scope of corpora, the corpus carrier further study, ensure the balance of the authority and representative of corpus according to the specific situation of the existing corpus, to the media as a filter. On the basis of existing corpus scale, improve and optimize the network media, print media, audio media, the media education textbooks and so on four big media mass text corpus, total corpora size 1.42 G (vocabulary of 80 million or above). Has established more than 850000 words of Uyghur language dynamic lexicon, syntax semantic information more than 100000 stems dictionary, more than 200000 phrases of phrases library, d han and Uyghur people, the world's place names, place names, in xinjiang bilingual vocabulary knowledge base.(2) information processing key technology of modern Uyghur language vocabularyFocuses on how to combine and use the knowledge of computational linguistics improve vocabulary statistics, data analysis, part of speech tagging method, studies the modern Uyghur words dry extraction algorithm, suffix segmentation algorithm, feature data analysis algorithm, common word extraction, stem part-of-speech tagging algorithm, the key techniques such as the construction of modern Uyghur language more strategy based on balance corpus statistics model, the dynamic characteristics of modern Uyghur language vocabulary data analysis model and the model of the part of speech tagging.Modern Uyghur vocabulary statistics and data analysis including words of modern Uyghur dry extract, suffix segmentation, key techniques of data analysis, etc. Words of modern Uyghur dry extract key technology based on the modern Uyghur language, stemming algorithms, text format and standard, text adjustment important function modules and research stems in the application form from the corpus of the network media. Key techniques of modern Uyghur language suffix segmentation method statement Uyghur words structure, words, reduction method, and gathering in the application of the corpus corpus time span and statistical method and the results of the analysis. Modern Uyghur language data analysis to study introduces the components of the data analysis method, studies the relationship between frequency and word, term of coverage, word according to the analysis of the distribution project and discussion and analysis of automatic formation of word frequency and word, term of coverage, species distribution, long term analysis of results.Modern Uyghur vocabulary to extract the key technology of the key to construction of modern Uyghur language corpus part of the technology and method, especially the building of modern Uyghur language corpus, and the modern Uyghur language corpora pretreatment technology, modern Uyghur language corpora statistical techniques, modern Uyghur words dry extract technology, modern Uyghur language data analysis technology was studied; Development of the modern Uyghur language use the candidate list, from two aspects of the distribution of the use of the word frequency and word words for the basic investigation, the Uyghur words "word species, frequency, frequency, the number of text, the word" as the basis of common candidate list.Words of modern Uyghur dry part of speech tagging tag set confirmatory study to Uighur primary school language teaching material corpus for authentication object, use from the perspective of grammatical semantic combination of the words of modern Uyghur dry part of speech tagging tag set ", on the Uyghur language elementary school language teaching material stems part-of-speech tagging, verify the feasibility of the tag set norms, adaptability and reliability. Supplement and correction of semantic classification and part of speech tagging code, and advances some Suggestions on the specification of expansion.Modern Uighur noun stems recognition based on morphological analysis research presents concept of morphological analysis and through these characteristics can accurately identify the meaning of its parts of speech. And summarize the part of speech division standard of Uyghur language, the characteristics of noun form, analysis, summary and affix ambiguity resolution rules. This paper puts forward the general idea, the design of modern Uyghur language in the new term recognition algorithm, including the feature selection and parameter estimation, internal characteristics, the characteristics of the dependent word before and after the word, and so on. The final will be a junior high school, high school physics Uyghur language teaching material as test object, statistics and analysis of the noun stems.(3) the modern Uyghur vocabulary qualitative researchVocabulary econometric analysis and application of modern Uyghur language described earlier research application, open up a new research field of modern Uyghur language linguistics measurement. Of modern Uyghur language as the research object, the expansion of modern Uyghur language on a large scale dynamic text corpus, to develop and perfect the existing measurement processing tools, use the syntax semantics the multi-strategy method of combining words of modern Uyghur language suffix, stems, multi-level, comprehensive, scientific and systematic measurement analysis and research, to build knowledge base of the modern Uyghur, from the linguistics Angle measurement in the exploration of Uighur language. Including modern Uyghur language website usage survey, the nine-year compulsory education teaching material of Uyghur language survey, Uyghur high school language teaching material choice, modern Uyghur language audio media usage survey.In view of the education teaching media, network media, audio media corpus, the application of the above vocabulary econometric analysis on the key techniques for information processing, to explore the theory and method of modern Uyghur language lexicon language. This paper hope through the quantitative characteristics of the language and the cooperative relationship between these characteristics, with the number of abstracts from the actual text relationship description and understanding of the development of Uyghur language vocabulary system and its composition and operation rule, institute found that the language law and also help to more accurately describe and explain Uyghur language, system, science to construct a modern science in the sense of modern Uyghur language linguistics of measurement theory. of modern Uyghur language has carried on the multi-level and comprehensive qualitative research, found that the structure and evolution of the Uyghur language systems.

文献类型学位论文
条目标识符http://ir.xjipc.cas.cn/handle/365002/4594
专题多语种信息技术研究室
作者单位中国科学院新疆理化技术研究所
推荐引用方式
GB/T 7714
艾孜尔古丽·玉素甫. 现代维吾尔语词汇信息处理关键技术研究[D]. 北京. 中国科学院大学,2016.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
现代维吾尔语词汇信息处理关键技术研究.p(5302KB)学位论文 开放获取CC BY-NC-SA浏览 请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[艾孜尔古丽·玉素甫]的文章
百度学术
百度学术中相似的文章
[艾孜尔古丽·玉素甫]的文章
必应学术
必应学术中相似的文章
[艾孜尔古丽·玉素甫]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 现代维吾尔语词汇信息处理关键技术研究.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。