论文总字数:30559字
摘 要
随着互联网技术的快速发展,各种终端产生的网络数据规模巨大。同时,数据质量也参差不齐。如何从海量数据中提取出有用的信息并挖掘信息背后的价值有着深远的意义。传统的人工分类方法已满足不了当前的需求,基于机器学习的文本分类方法逐渐变成主流方式。在舆情监测、新闻分类、垃圾邮件过滤、金融欺诈等现实生活中有着广泛的应用场景。相较于传统方法,本论文经过论证提出的基于XGBoost的文本分类取得不错的效果。主要研究工作如下:
(1)数据分析与采集。确定业务目标,根据目标制定科学的数据采集方案,对非结构化数据采用Scrapy爬虫的方式提取有用信息并进行本地存储。
(2)数据预处理。主要包括数据清洗、数据分词、数据整理、数据变换、特征选择几个阶段。其中在中文分词阶段,包括停用词、歧义词、单个词和实体等工作。
(3)构建文本分类模型。通过实验对比几种分类算法,最终选择XGBoost模型,实验数据表明,取得了97%的准确率,文本分类器性能较高。
(4)文本分类的应用。一般分类器模型在封闭测试数据集下性能很好,然而面对开放网络数据性能堪忧。本文随机对开放新闻文本进行分类依然保持很强的兼容性。
关键词:文本分类,网络数据,XGBoost,中文分词,特征降维
Abstract
With the rapid development of Internet technology, the network data generated by various terminals is huge. At the same time, the quality of the data is also uneven. How to extract useful information from massive data and mine the value behind it has far-reaching significance. The traditional manual classification method can not meet the current needs, and the text classification method based on machine learning has gradually become the mainstream method. There are a wide range of application scenarios in real life such as public opinion monitoring, news classification, spam filtering, and financial fraud. Compared with the traditional method, this paper has achieved good results based on XGBoost text classification. The main research work is as follows:
(1) Data analysis and collection. Identify business goals, develop a scientific data collection plan based on goals, and use Scrapy crawlers to extract useful information and store it locally for unstructured data.
(2) Data preprocessing. It mainly includes several stages of data cleaning, data segmentation, data sorting, data transformation and feature selection. Among them, in the Chinese word segmentation stage, including stop words, ambiguous words, single words and entities.
(3) Construct a text classification model. By comparing several classification algorithms through experiments, the XGBoost model is finally selected. The experimental data shows that 97% accuracy is achieved and the text classifier has higher performance.
(4) Application of text classification. The general classifier model performs well under closed test data sets, but the performance of developing network data is worrying. This article randomly maintains a strong compatibility with the classification of open news texts.
KEY WORDS: Text Classification, Network Data, XGBoost, Chinese Word Segmentation, feature dimension reduction
目 录
摘要 ………………………………………………………………………………Ⅰ
Abstract ……………………………………………………………………… Ⅱ
第一章 绪论 1
1.1选题的背景和意义 1
1.2国内外研究现状 2
1.3全文的组织结构 4
第二章 相关技术介绍 5
2.1 Scrapy 爬虫 5
2.2中文分词 6
2.2.1中文分词的研究进展 7
2.2.2中文分词的用途 7
2.2.3中文分词的特点与难点 7
2.2.4常见中文分词工具 8
2.3文本分类 9
2.3.1文本分类概述 9
2.3.2文本分类原理 10
2.3.3分类模型评估 10
第三章 数据采集与存储 12
3.1准备工作 12
3.1.1数据结构与采集策略 12
3.1.2目标业务分析 13
3.2数据库设计与实现 14
3.2.1逻辑模型 15
3.2.2数据表说明 15
3.2.3网络数据存储 17
3.2.4爬虫结果分析 19
第四章 数据预处理 20
4.1数据预处理架构设计 20
4.2新闻文本数据清洗 21
4.3新闻文本分词 23
4.4新闻文本特征构造 24
4.5新闻文本特征选择 26
第五章 XGBoost实现新闻文本分类 27
5.1选择分类算法 27
5.2XGBoost构建新闻分类器 28
5.2.1实验环境介绍 28
5.2.2构建分类器算法 29
5.2.3模型结果分析 30
5.3网络新闻的文本分类应用 31
第六章 结论与展望 33
6.1结论 33
6.2展望 33
参考文献 34
附录 36
剩余内容已隐藏,请支付后下载全文,论文总字数:30559字
该课题毕业论文、开题报告、外文翻译、程序设计、图纸设计等资料可联系客服协助查找;