论文总字数:51698字
摘 要
随着计算机和通信技术的高速发展,社会各领域的数据量也在急剧膨胀,其中存在着大量不同类型的流数据。不同于传统的存储在磁盘上的静态的数据,它是一类新的数据对象,是连续的、有序的、快速变化的、海量的数据,如客户信息数据、交易数据、股票价格信息数据、道路交通监测数据等。数据流的广泛应用,使得流数据成为广大学者研究的重要课题,流数据的聚类更新问题也成为当前研究的热点。聚类分析是数据挖掘的重要方法,近年来也出现了许多聚类算法,但流数据本身拥有的特征使传统的聚类算法并不能直接应用。因此,本文欲围绕变精度粗糙集和聚类算法展开研究,将两者进行科学合理的结合,提出基于距离的变精度扩展粗糙集模型,得到上下近似集,在此基础上进行聚类,并在流数据进入后对聚类进行不断更新。
针对经典粗糙集在划分等价类时对噪音过于敏感的缺点,本文通过变精度阈值,对粗糙集进行了合理的改进和扩展。首先将两个不同对象各属性值之差除以该属性下的最大差值,完成去量纲处理;接着求取去量纲后各属性差异的平方和的算数平均数的算术平方根作为两个对象的距离,在此基础上,用1减去距离,获得两个对象之间的不可分辨可能性。设定变精度阈值,求得每个对象条件属性下的不可分辨类,并进一步定义粗糙集的上下近似集。基于此的扩展粗糙集更符合统计意义,弥补了基于等价关系的对象必须所有属性完全相同才能被视为不可分辨的缺陷。
本文将变精度扩展粗糙集引入聚类算法,从而构建基于距离的变精度扩展粗糙集聚类算法。首先分别求取条件属性集和决策属性的不可分辨类,若同一数据对象条件属性集下的不可分辨类是决策属性不可分辨类的子集,则该数据对象是下近似簇点,否则为边界簇点。而后对下近似簇点和边界簇点进行处理,对于下近似簇点,在条件属性的不可分辨类中搜索除自身外的所有其他下近似簇点,找出在条件属性集下不可分辨类的交集个数最大的下近似簇点,进行合并,否则单独成为新簇。而对边界簇点,搜索其条件属性集下的不可分辨类中相同决策属性下不可分辨可能性最大的对象,合并成同一簇;否则,单独成为新簇。接着根据簇类中对象各属性取值,获得中心点,计算中心点距离。搜索最小中心点距离,若该距离小于等于预定阈值γ且合并后新类簇中各点到新中心点的平均距离也小于等于给定阈值α,则将两簇合并,得到新的类簇。重新计算中心点及中心点间距离,直到距离全部大于阈值γ或中心点间距离小于等于阈值γ但平均距离大于给定阈值α,结束算法,得最终聚类结果。
现实生活中,很多系统的数据是源源不断的。对于新增数据对象,本文首先应用变精度扩展粗糙集模型,判断其为下近似簇点或边界簇点;再根据上述聚类算法,找出新增数据对象该并入哪个对象所在类簇或单独成为新簇。若新增数据对象并入已有类簇,则计算合并后各数据对象到新中心点的平均距离,若平均距离小于等于给定阈值α,则合并;若平均距离大于给定阈值α,且簇类中存在原有数据对象到新中心点的最大距离大于新增对象到新中心点的距离,则用新增数据对象替换使得距离最大的对象,替换后计算平均距离;若平均距离小于等于给定阈值α,则新增数据对象并入该簇类,被替换的数据对象单独成为新簇;若平均距离仍然大于阈值α,则不能合并,新数据对象单独成为新簇。
本文最后给出应用案例,说明算法实际应用价值。统计天猫店铺提供的官方数据样本,并对这些样本应用基于粗糙集的流数据知识更新方法,完成店铺初始聚类,并进行聚类调整,在新店铺数据进入时,进行聚类更新。针对聚类结果进行分析,为相关人员提供决策支持建议。
关键字:变精度粗糙集;聚类分析;流数据;知识更新
Knowledge Updating Method of Streaming Data Based on Rough Set
And Its Application Research
Abstract
With the rapid development of computer and communication technologies, the amount of data in various fields of society is also rapidly expanding, and there are a large number of different types of streaming data. Different from the traditional static data stored on disk, it is a new type of data object, which is continuous, ordered, fast changing and massive data, such as customer information data, transaction data, stock price information data, road traffic monitoring data, etc. With the wide application of data stream, streaming data has become an important research topic for many scholars, and the clustering and updating of streaming data has become a hot topic in current research. Clustering analysis is an important method of data mining. In recent years, many clustering algorithms have appeared. However, the characteristics of streaming data itself make traditional clustering algorithms not directly applicable. Therefore, this thesis intends to focus on the variable precision rough set and clustering algorithm, combine them scientifically and reasonably, propose a distance-based variable precision extended rough set model to obtain the upper and lower approximation sets, cluster on this basis, and continuously update the clustering after the streaming data enters.
Aiming at the shortcomings of the classical rough set which is too sensitive to noise when dividing the equivalence class, this paper improves and expands the rough sets reasonably by means of variable precision threshold. Firstly, the difference of each attribute value of two different objects is divided by the maximum difference value under the attribute to complete the dimensionless processing. Then, the arithmetic square root of the arithmetic mean of the sum of squares of the difference of each attribute after dimensionality is calculated as the distance between two objects. On this basis, the indistinguishable possibility between two objects is obtained by subtracting the distance with 1. The threshold value of variable precision is set to obtain the indiscernible class of each object condition attribute, and the upper and lower approximation sets of the rough set are further defined. The extended rough set based on this is more statistically significant, which makes up the defect that objects based on equivalence relation must have all the same attributes before it can be regarded as indistinguishable.
It can effectively obtain indistinguishable classes according to the data itself, and the results obtained are objective and quick. In this paper, the variable precision extended rough set is introduced into the clustering algorithm so as to construct a distance-based variable precision extended rough set clustering algorithm. First, the indiscernible classes of the conditional attribute set and the decision attribute are obtained respectively. If the indiscernible class under the condition attribute set of the same data object is a subset of the indiscernible class of the decision attribute, the data object is the lower approximate cluster point; otherwise, it is the boundary cluster point. Then, the lower approximate cluster points and boundary cluster points are processed. For the lower approximate cluster points, all other lower approximate cluster points except itself are searched in the indiscercible class of condition attribute, and the lower approximate cluster points with the largest number of intersections of indiscertible classes in condition attribute set are found, and then merged, otherwise it becomes a new clusters separately. For the boundary cluster points, the objects with the largest indistinguishable probability under the same decision attribute in the indistinguishable class under the condition attribute set are searched and merged into the same cluster. Otherwise, it becomes the new cluster alone. Then, according to the value of each attribute of the object in the cluster class, the center point is obtained and the distance of the center point is calculated. Searching for the minimum distance between each center point. If the distance is less than or equal to a predetermined threshold γ and the average distance from each point in the new cluster to the new center point is less than or equal to another given threshold α, the two clusters are merged to obtain a new class cluster. Recalculate the distance between the center points and the center points, until the distance is greater than the threshold γ or the distance between the center points is less than or equal to the threshold γ but the average distance is greater than the given threshold α, and the algorithm ends, and the final clustering result is obtained.
In real life, the data of many systems is continuous. For the newly added data object, this paper first applies the variable precision extended rough set model to judge it as the lower approximate cluster point or boundary cluster point; then according to the above clustering algorithm, find out which class cluster the newly added data object is to be merged into or become a new cluster alone. If the newly added data object is merged into the existing cluster, the average distance of each merged data object to the new center point is calculated. If the average distance is less than or equal to the given threshold α, the merger will take place. if the average distance is greater than the given threshold α, and if the maximum distance from the original data object to the new center point in the cluster class is greater than the distance from the new object to the new center point, replace the object with the largest distance with the newly added data object, and calculate the average distance after replacement; if the average distance is less than or equal to the given threshold α, the newly added data object will be merged into the cluster class, and the replaced data object will become a new cluster separately; if the average distance is still greater than the threshold α, it cannot be merged, and the new data object becomes a new cluster alone.
At the end of the paper, an application case is given to illustrate the practical application value of the algorithm. The official data samples provided by the Tmall store are counted, and the rough set-based flow data knowledge update method is applied to these samples to complete the initial clustering of the store and cluster adjustment, and the cluster update is performed when the new store data enters. Analyze the clustering results and provide decision support suggestions for relevant personnel.
Keywords: Variable Precision Rough Set; Clustering Analysis; Streaming Data; Knowledge Updating
目 录
摘 要 I
Abstract II
第一章 绪 论 1
1.1 研究背景 1
1.2 研究目的和意义 1
1.3 本文的研究内容 1
1.4 创新点与难点 2
第二章 文献综述 3
2.1 粗糙集理论基础 3
2.1.1 粗糙集的概念 3
2.1.2 粗糙集理论的发展及国内外研究综述 3
2.2 聚类算法 4
2.2.1 聚类算法的概念 4
2.2.2 主要的聚类算法 4
2.3 论文研究重点 4
第三章 基于距离的变精度扩展粗糙集模型 5
3.1 模型构建及性质探讨 5
3.2 算法描述 7
3.3 算例分析 9
第四章 基于距离的变精度扩展粗糙集聚类算法 12
4.1 模型构建及性质探讨 12
4.2 算法描述 14
4.3 算例分析 14
第五章 流数据聚类更新算法 23
5.1 模型构建及性质探讨 23
5.2 算法描述 23
5.3 算例分析 24
第六章 案例应用——网络商品经营聚类分析 27
6.1 数据来源 27
6.2 数据预处理 28
6.3 数据分析 30
6.4 推荐策略 38
6.4.1 对于淘宝店铺卖家来说 38
6.4.2 对于消费者来说 39
第七章 结束语 40
剩余内容已隐藏,请支付后下载全文,论文总字数:51698字
该课题毕业论文、开题报告、外文翻译、程序设计、图纸设计等资料可联系客服协助查找;