智能与分布计算实验室
  数据挖掘中归约技术的研究与实现
姓名 赵萍
论文答辩日期 2005.05.11
论文提交日期 2005.05.13
论文级别 硕士
中文题名 数据挖掘中归约技术的研究与实现
英文题名 Research and Implementation on Data Reduction Technology in Data Mining
导师1 卢正鼎
导师2
中文关键词 数据挖掘;数据预处理;主要成分分析;数据归约;属性归约
英文关键词 Data Mining;Data Preprocess;PCA;Data Reduction;Attribute Reduction
中文文摘 数据挖掘是为了解决传统分析方法的不足,并针对大规模数据的分析处理而出现的。但是,目前所进行的关于数据挖掘的研究工作,大多着眼于数据挖掘算法的探讨,而忽视了对数据预处理的研究。数据集过大,挖掘时间过长成为一个日益严重的问题,因此,如何有效缩减原始数据集、提高挖掘效率也成为一个重要的问题。主要成分分析是一个统计学概念,其主要思想是通过提取主元信息将高维数据空间投影到低维特征空间,特征空间中的主元变量保留原始变量的特征信息而去除冗余信息。粗糙集理论是一种刻画不完整性和不确定性的数学工具,能有效地分析和处理不精确、不一致、不完整等不完备信息。 基于粗糙集的属性归约算法目前运用得较为广泛的就是区分矩阵算法和贪婪算法。然而,这两种算法在数据集比较大或约简精度要求较高时并不是很有效。在上述理论的基础上,提出了一种启发式算法。启发式算法利用主要成分分析的思想,采用迭代算法求得数据集的主成分,再结合对区分矩阵算法的改进,得到属性的最小约简。它通过属性的贡献率说明条件属性对决策属性的重要程度,在归约集的选择精度和时间复杂度方面进行了一些改善。改进后的属性归约算法可以用于数据预处理,也能在后期的数据挖掘起到一定的作用。 最后在上述研究的基础上,参与了国家外汇管理局决策支持系统开发工作,并将其应用于系统的数据处理中。实验表明,可以有效地压缩数据规模,合理地对属性进行选择以实现更高效地处理数据的目的,并在非现场监管模式下得到了比较满意的结果。
英文文摘 The appearance of data mining (DM) is to amend the shortage of traditional DM method and contrapose the analyzing and processing of a great deal of data. Now most of the discussing only emphasize to the algorithms of DM but neglect the research of data processing. The magnitude of data sets and the long time of mining process gradually become an important problem. So how to reduce primal data sets and how to increase DM efficiency also turn into a real problem. Principal component analysis (PCA) is a concept in statistics. The main idea is to map the data from a high-dimensional data space to a low-dimensional space by acquiring the principal components. The principal components in the feature space can save the eigen information of the primal variable and expurgate redundancies. The Rough set theory is a mathematic tool to describe the incompleteness and uncertainty of information. The discernible matrix algorithm and greedy algorithm base on Rough set are widely applied in data reduction area. But these algorithms are usually inefficient while dealing with large data sets or the sets needing high precise reduction. A new heuristic algorithm is lodged to obtain the attribute reduction base on those theories. It utilizes the idea of PCA and iterative algorithms to obtain the reduction of data sets by ameliorating discernible matrix algorithm. It also explains the importance of condition attributes by the contributing rate and improving the selecting accuracy. This algorithm will be more accurate and efficient on attribute selecting. When in the developing of SAFE-MIDSS, the algorithm is applied in the data processing of the system and proved to reduce data and select attributes efficiently and get a satisfied result in the mode of absent supervising.