一种基于Hadoop架构的并行挖掘算法研究A parallel mining algorithm based on Hadoop architecture
曾俊
摘要(Abstract):
基于Hadoop架构,提出一种并行的决策树挖掘算法实现大数据集间的知识挖掘。通过MapReduce并行编程模式实现Hadoop架构下SPRINT并行挖掘算法的频繁项集,解决了大数据集挖掘效率低下,时间消耗量大的问题。SPRINT算法通过对原始数据集进行划分,并将分块数据发给不同Map进程并行计算,使系统存储和计算资源得到有效利用,运用MapReduce各计算节点将挖掘结果数据汇聚,减少中间结果数据量,使并行挖掘时间显著减少。SPRINT算法并行化实验表明,Hadoop架构下的SPRINT并行挖掘算法具有良好的可扩展性和集群加速比。
关键词(KeyWords): 挖掘算法;Hadoop架构;SPRINT;并行化;决策树;MapReduce
基金项目(Foundation): 重庆市教委科技项目:Hadoop架构下大数据的并行挖掘研究(KJ15012021);; 春晖计划项目:物联网智能农业平台下大数据的初步应用(S2016038)~~
作者(Author): 曾俊
DOI: 10.16652/j.issn.1004-373x.2018.01.026
参考文献(References):
- [1]DONG Guozhu,HAN Qian.Mining shared decision trees between datasets[R].USA:Wright State University,2010.
- [2]周建华.一种基于Hadoop架构的网络舆情热点话题挖掘方法[J].河北北方学院学报(自然科学版),2014,30(6):19-24.ZHOU Jianhua.A mining method for hot topics of network public opinion based on Hadoop architecture[J].Journal of Hebei North University(natural science edition),2014,30(6):19-24.
- [3]张振友,孙燕,丁铁凡,等.一种新型的基于Hadoop框架的分布式并行FP-Growth算法[J].河北工业科技,2016,33(2):169-178.ZHANG Zhenyou,SUN Yan,DING Tiefan,et al.A new distributed parallel FP-Growth algorithm based on Hadoop framework[J].Hebei industrial science and technology,2016,33(2):169-178.
- [4]施亮,钱雪忠.基于Hadoop的并行FP-Growth算法的研究与实现[J].微电子学与计算机,2015,32(4):150-154.SHI Liang,QIAN Xuezhong.Research and implementation of FP-Growth algorithm based on parallel Hadoop[J].Microelectronics&computer,2015,32(4):150-154.
- [5]HAN jiawei,KAMBER M.数据挖掘:概念与技术[M].范明,孟小峰,译.北京:机械工业出版社,2008.HAN Jiawei,KAMBER M.Data mining:concepts and techniques[M].Translated by FAN Ming,MENG Xiaofeng.Beijing:Mechanical Industry Press,2008.
- [6]ZAKI M J.Fast vertical mining using diffsets[R].New York,USA:Rensselaer Polytechnic Institute,2001.
- [7]Apache.Apache Hadoop[EB/OL].[2012-12-03].http://hadoop.apache.org/.
- [8]MAITREYA S,JHAB C K.Map Reduce:simplified data analysis of big data[J].Procedia computer science,2015,57:563-571.
- [9]李国杰,程学旗.大数据的研究现状与科学思考[J].中国科学院院刊,2012,27(6):647-651.LI Guojie,CHENG Xueqi.Big data research status and scientific thinking[J].Chinese Science Research Institute Journal,2012,27(6):647-651.
- [10]NESI Paolo,PANTALEO Gianni,SANESI Gianmarco.A Hadoop based platform for natural language processing of Web pages and documents[J].Journal of visual languages and computing,2015,31:130-138.
- [11]WHITE T.Hadoop权威指南[M].华东师范大学数据科学与工程学院,译.北京:清华大学出版社,2010.WHITE T.Hadoop authoritative guide[M].Translated by the School of Data Science and Engineering,East China Normal University.Beijing:Tsinghua University Press,2010.