科研协同创新平台中学者主题标签抽取研究毕业论文
2020-04-23 20:13:49
摘 要
在信息化和互联网日趋成熟的今天,社会岗位对人才的需求正在向着高综合能力发展,科学研究也在逐渐向着多学科协同合作的方向发展,信息公开是大势所趋,然而如何从海量、杂乱无章的数据中提取出有用的信息具有重要的现实意义。
现实中,在高校里存在信息传递的障碍,院系之间、行政人员与教职工、领导层与新晋教师、学生对老师的研究领域都不甚了解,各个院系的官网上学者、科研信息缺乏维护、严重滞后。兵法云:知己知彼,百战不殆。如此看来,将全院校的信息进行整合、制作一个全校性的科研协同创新平台势在必行,而其中,如何实现平台上学者的领域标签提取则是该平台的核心问题之一。
随着学校日渐地发展,我校不断地有优秀的青年教师加入,如果能够快速地从已有的信息中提取出每位学者的研究领域标签,而不需要人为判断,那么将大大减少人工的工作量,并且使得信息更加及时高效。
本文分别使用目前最实用的关键词提取算法TF-IDF与LDA算法,基于python编程语言,对学者主题标签提取算法进行研究。本文最初以南京工业大学科研人员的论文信息为数据集,之后为LDA的性能着想,引入外校学者的数据集作为补充。
本科研协同创新平台中学者主题标签抽取算法主要是在TF-IDF的基础上,增加了从关键词到主题标签映射的算法,最终通过手工标注的测试集测试,判断由此方法提取出的主题标签粒度适中、代表性较好,虽然仍有较大提升空间,但仍不失为一种有效的思路。
关键词:科研协同创新;协同创新;分词;主题抽取;关键词抽取;LDA;TF-IDF
Research on topic tag extraction of scholars in collaborative innovation platform of scientific research
Abstract
In today's information and the Internet mature, social positions and demand for talent is towards high comprehensive ability development, scientific research has also been gradually towards the direction of multidisciplinary collaboration, information disclosure is the trend of The Times, however, how to extract data from large, sprawling useful information has important practical significance.
In reality, there are barriers to information transmission in colleges and universities. There are not enough knowledge about teachers' research fields among departments, administrative staff and teaching staff, leadership and newly promoted teachers, and students. Scholars and scientific research information on the official websites of various departments are lack of maintenance and seriously lag behind. The art of war says: know your enemy and yourself. In this way, it is imperative to integrate the information of the whole university and create a university-wide collaborative innovation platform for scientific research, among which, how to realize the extraction of scholars' field labels on the platform is one of the core issues of the platform.
As our school developes,more and more excellent young teachers join us. If we can quickly extract each scholar's research field label from the existing information without human judgment, the workload of human will be greatly reduced and the information will be more timely and efficient.
In this paper, the most practical keyword extraction algorithms tf-idf and LDA are used respectively, and the subject tag extraction algorithm of scholars is studied based on python programming language. In this paper, the data set was initially based on the paper information of researchers of nanjing university of technology, and then for the sake of the performance of LDA, the data set of scholars from other universities was introduced as a supplement.
The scholars in collaborative innovation platform of scientific research subject label extraction algorithm is mainly based on TF - IDF, increase the label mapping algorithm from keywords to the theme, ultimately through the manual annotation test set test, determine the theme of this method to extract the label size is moderate, the representative is better, although there is still a large room to improve, but he is still an effective way of thinking.
Keywords: Collaborative innovation ;word segmentation;Topic extraction;keyword extraction;LDA;TF-IDF
目 录
摘 要 I
Abstract II
目 录 III
第一章 绪论 1
1.1 研究背景及目标 1
1.2基础理论与方法 2
1.2.1 机器学习的介绍 2
1.2.2数据挖掘的介绍 2
1.3本文的工作 3
1.4本文的结构 4
第二章 综述及关键技术 5
2.1信息抽取的研究现状 5
2.1.1 分词 5
2.1.2 关键词抽取 5
2.1.3 主题抽取 5
2.2主题抽取的关键技术 6
2.2.1 TF-IDF 6
2.2.2 LDA 7
2.3 总结 8
第三章 基于TF-IDF的学者主题抽取研究 9
3.1 TF-IDF算法 9
3.2 数据集 9
3.3 方法及分析 10
3.4 总结 13
第四章 基于LDA的学者主题抽取研究 14
4.1 LDA主题模型 14
4.2 数据集 17
4.3 方法及分析 17
4.4 实验评估 24
4.4.1 评价指标 24
4.4.2 测试集制作 24
4.4.3 测试及评价 24
4.5 总结 26
第五章 总结与展望 28
5.1 总结与讨论 28
5.2 不足 28
5.3 展望 28
致谢 30
参考文献 31
第一章 绪论
1.1 研究背景及目标
去年9月份,习近平总书记提出:加强科研成果的转化落地,是国家建设协同创新体系的重点。2017年出台的多项文件均提出,我国要推动建立一批产、学、研一体的科技协同创新平台。 近年来,我国政府数据开放共享稳步开展,各地政府均加强政务信息系统建设,通过大数据技术完善政务信息整合,打造跨部门数据资源共享共用的格局。《国务院办公厅关于印发政务信息系统整合共享实施方案》明确提出,要加强数据目录梳理和共享交换平台建设。[1]
协同创新对国家的经济发展方式向创新驱动为主进行转变具有重要指导意义。[2]对推动国家的科技发展、高等院校的科研能力进步起着重要的作用。只有实现我国高校科研团队协同创新,科技创新研究由自闭转向互动,才能为我国经济社会发展 提供优质的科研协同创新成果。
课题毕业论文、开题报告、任务书、外文翻译、程序设计、图纸设计等资料可联系客服协助查找。