摘 要
Implements of Term Extraction Module based on
Statistical and Linguistic Analysis
71114317 Ling Liyang
Advisor Yang Wang
When new concepts and products come into being, there are new terms coming with. In order to adapt to changing terms, enterprises localization departments need to invest a lot of manpower and time to complete terminology analysis and statistics. As terminology automatic extraction is one of the important research topics of Natural Language Processing, the significance of this technology is to reduce the burden of artificial censorship of experts in the field, on the other hand, to ensure the consistency of Machine Translation as much as possible.
Aiming at the above goal, this paper designs a set of term extraction module for enterprise product project.
First, according to the characteristics of the original document of the enterprise, the data cleaning strategy is formulated and the initial part of the real term table is pre-processed; then the training of the part of speech tagging model is completed based on NLTK, and the operation of the word tokenization, tagging and chunking of the text, and the use of the existing statistical measures after obtaining the list of candidates. The designed algorithm completes the scoring of candidate terms and the extraction of real terms. Finally, the performance of the module is evaluated and optimized to form a complete and reliable prototype of the term extraction system for the enterprise.
KEY WORDS: term extraction, NLTK, model training, linguistic analysis, statistical measures, stop-words
目 录
摘 要 I
Abstract ............................................................................... II
第一章 绪论 1
1.1选题背景及意义 1
1.2应用领域及实现目标 1
1.3论文组织结构 1
第二章 术语抽取理论基础 3
2.1语义分析方法 3
2.2统计学方法 4
2.2.1统计学指标 5
2.3混合方法 8
2.4评估方法 9
2.4.1术语评估现状 9
2.4.2术语抽取评估常用方法 9
第三章 术语抽取方案设计 11
3.1术语抽取总体框架 11
3.2原始数据分析 11
3.3预处理设计方案 12
3.4语义处理方案 13
3.4.1分句分词及词性标注 13
3.4.2序列标注及术语提取 14
3.4.3后处理 14
3.5统计处理方案 15
3.5.1候选术语统计 15
3.5.2后处理 16
3.6系统评估方案 16
3.6.1抽取结果评价 16
3.6.2系统评价 17
3.7本章小结 18
第四章 术语抽取模块实现 19
4.1预处理方案实现 19
4.1.1原始语料文档预处理 19
4.1.2人工抽取术语集预处理 20
4.2语义处理方案实现 21
4.2.1词性标注模型训练 21
4.2.2词性标注实现 22
4.2.3序列标注实现 22
4.2.4停用词表生成 23
4.3统计处理方案实现 23
4.3.1数据统计算法 23
4.3.2真实术语抽取 24
4.3.2后处理 24
4.4本章小结 24
第五章 数据分析及方案评估 25
5.1预处理清洗数据分析 25
5.2词性标注模型评估 26
5.3统计指标评估 27
第六章 总结与展望 32
6.1论文总结 32
6.2未来工作展望 32
致 谢 33
参考文献 34
第一章 绪论