论文总字数:33828字
摘 要
随着信息技术的普及和应用,计算机已经融入到人类活动的各个领域,各个类型的数据被采集和存储,也由此进入了大数据时代。随着数据的快速累积,如今的数据集也变得非常大而且复杂,无法进行人工操作。但实际数据往往存在各种各样的错误,在数据分析前往往要花大量的时间对数据进行预处理。而且,探索性数据挖掘和数据清洗的任务占数据挖掘过程的80%以上。本文对分析数据所面临的挑战,数据挖掘和清洗方法,现有的数据清洗方法与工具进行了系统的分析整理,并完成实例研究。
探索性数据挖掘是一个重要且庞大的领域,全面掌握这个领域的现状需要学习多种度量方法,统计图,统计特征与计算标准等等。确保数据质量可以通过数据清洗和应用数据质量指标,这是一个需要反复进行的过程。本文从现有的多种数据清洗方法与软件中选择三种清洗方法与五种专用软件进行介绍。在学习了EDM方法与数据清洗技术后,我运用所学知识对某建材公司的生产销售数据进行了处理与分析。将分散的生产销售数据整合到一起,运用于图表将数据更为清晰直观地展示,最后从多个方面进行了分析。
关键词:探索性数据挖掘,数据质量,数据清洗,大数据时代
Abstract
With the popularization and application of information technology, computers have been integrated into various fields of human activities, and various types of data have been collected and stored, which has entered the era of big data. With the rapid accumulation of data, today's data sets have become very large and complex, and cannot be manually manipulated. However, actual data often has various errors, and it takes a lot of time to preprocess the data before data analysis. Moreover, the task of exploratory data mining and data cleaning accounts for more than 80% of the data mining process. In this paper, the challenges of analyzing data, data mining and cleaning methods, existing data cleaning methods and tools are systematically analyzed and completed, and case studies are completed.
Exploratory data mining is an important and huge field. To master the current situation in this field requires learning a variety of metrics, statistical graphs, statistical features and calculation standards, and so on. Ensuring data quality can be done through data cleansing and applying data quality metrics, a process that needs to be repeated. This paper introduces three cleaning methods and five special softwares from the existing data cleaning methods and software. After learning the EDM method and data cleaning technology, I used the knowledge I learned to process and analyze the production and sales data of a building materials company. The disaggregated production and sales data is integrated into a chart to show the data more clearly and intuitively, and finally analyzed from various aspects.
KEY WORDS: exploratory data mining, data cleansing, data quality, big data era
目 录
- 绪论 ………………………………………………………………………1
1.1 数据预处理 ………………………………………………………………1
1.2 关键问题和难点 …………………………………………………………1
1.3 方法与工具 ………………………………………………………………2
1.3.1 EDM方法 …………………………………………………………2
1.3.2数据清洗 …………………………………………………………3
1.4 总结 ………………………………………………………………………3
- 探索性数据挖掘 ……………………………………………………………4
2.1 介绍 ……………………………………………………………………4
2.2 EDM:探索性数据挖掘 …………………………………………………4
2.3 EDM总结 ………………………………………………………………5
2.4 描述性统计分析 ………………………………………………………6
2.4.1 中心性度量 ……………………………………………………7
2.4.2 分散度量 ………………………………………………………8
2.4.3相互关系 …………………………………………………………9
2.5 统计图 …………………………………………………………………10
2.6 EDM总结规律 …………………………………………………………13
2.6.1 统计特征 ………………………………………………………13
2.6.2 计算标准 ………………………………………………………15
- 数据质量与数据清洗 ……………………………………………………16
3.1 介绍 ……………………………………………………………………16
3.2 数据质量问题 …………………………………………………………17
3.2.1 单源问题 ………………………………………………………17
3.3.2 多源问题 ………………………………………………………17
3.3 数据清洗方法 …………………………………………………………18
3.3.1 数据分析 ………………………………………………………19
3.3.2 定义数据转换 …………………………………………………20
3.3.3 解决冲突 ………………………………………………………20
- 数据清洗专用软件 ……………………………………………………23
4.1 AJAX ……………………………………………………………………23
4.2 FraQL ……………………………………………………………………24
4.3 Potter's Wheel …………………………………………………………24
4.4 ARKTOS …………………………………………………………………25
4.5 IntelliClean ………………………………………………………………25
- 实例研究 …………………………………………………………………27
5.1 数据挖掘与数据清洗 …………………………………………………27
5.2 数据分析 ………………………………………………………………29
5.2.1 整体情况 ………………………………………………………29
5.2.2 具体分析 ………………………………………………………29
- 总结与展望 ………………………………………………………………33
致 谢 ………………………………………………………………………………34
参考文献……………………………………………………………………………35
第一章 绪论
剩余内容已隐藏,请支付后下载全文,论文总字数:33828字
该课题毕业论文、开题报告、外文翻译、程序设计、图纸设计等资料可联系客服协助查找;