基于Spark的分布式大数据分析建模系统的设计与实现Design and implementation of distributed big data analysis and modeling system based on Spark
徐时芳,罗晓宾,陈阳华
摘要(Abstract):
针对分布式大数据对数据存储、清洗、转化、聚合、挖掘和分析工作所造成的挑战,设计并实现了一种基于Spark的分布式大数据分析建模系统。该系统通过数据采集、数据存储、数据分析、数据管理和数据应用5个模块,实现结构化、半结构化及非结构化数据的适配采集与离线、在线分析处理;并使用管理和控制平台,实现系统的协调运行。软硬件实现及建模测试结果表明,所提出的系统能实现具体场景故障诊断数据的有效、精确聚类,并可满足大数据处理的速度和精度需求。
关键词(KeyWords): 分布式大数据;Spark;数据分析;数据建模;非结构化数据;故障诊断
基金项目(Foundation): 2017年度贵州省科学技术厅、黔南州科学技术和知识产权局、黔南民族师范学院联合基金计划项目(黔南科合社字(2017)95号)~~
作者(Author): 徐时芳,罗晓宾,陈阳华
DOI: 10.16652/j.issn.1004-373x.2018.20.042
参考文献(References):
- [1] ZAHARIA M,XIN R S,WENDELL P,et al. Apache Spark:a unified engine for big data processing[J]. Communications of the ACM,2016,59(11):56-65.
- [2] MAILLO J,RAMíREZ S,TRIGUERO I,et al. kNN-IS:an iterative Spark-based design of the k-nearest neighbors classifier for big data[J]. Knowledge-based systems,2017,117:3-15.
- [3]李洋,何宝灵,刘海涛,等.面向全球能源互联网的分布式电源云服务与大数据分析平台研究[J].电力信息与通信技术,2016,14(3):30-36.LI Yang,HE Baoling,LIU Haitao,et al. Research on distributed generation cloud service and big data analysis platform for global energy interconnection[J]. Electric power information and communication technology,2016,14(3):30-36.
- [4]谌志华.基于大数据的网络舆情分析系统[J].现代电子技术,2017,40(24):15-17.SHEN Zhihua. Network public opinion analysis system based on big data[J]. Modern electronics technique,2017,40(24):15-17.
- [5]乔非,葛彦昊,孔维畅.基于MapReduce的分布式改进随机森林学生就业数据分类模型研究[J].系统工程理论与实践,2017,37(5):1383-1392.QIAO Fei,GE Yanhao,KONG Weichang. MapReduce based distributed improved random forest model for graduates career classification[J]. Systems engineering-theory&practice,2017,37(5):1383-1392.
- [6]焉晓贞,谢红,王桐.一种基于相关分析的多元回归数据估计方法[J].沈阳工业大学学报,2013,35(2):212-217.YAN Xiaozhen, XIE Hong, WANG Tong. Data evaluation method using multiple regression based on correlation analysis[J]. Journal of Shenyang University of Technology,2013,35(2):212-217.
- [7]张宸,韩夏.大数据环境下基于SVM-WNB的网络舆情分类研究[J].统计与决策,2017(14):45-48.ZHANG Chen, HAN Xia. Classification research on NPO based on SVM-WNB under big data environment[J]. Statistics&decision,2017(14):45-48.
- [8]程敏.基于PostgreSQL和Spark的可扩展大数据分析平台[D].深圳:中国科学院深圳先进技术研究院,2016.CHENG Min. Scalable big data analysis platform based on PostgreSQL and Spark[D]. Shenzhen:Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences,2016.
- [9]张繁,袁兆康,肖凡平,等.基于Spark的大数据热图可视化方法[J].计算机辅助设计与图形学学报,2016,28(11):1881-1886.ZHANG Fan,YUAN Zhaokang,XIAO Fanping,et al. Research on Heatmap for big data based on Spark[J]. Journal of computer-aided design&computer graphics,2016,28(11):1881-1886.
- [10]皮艾迪,喻剑,周笑波.基于学习的容器环境Spark性能监控与分析[J].计算机应用,2017,37(12):3586-3591.PI Aidi,YU Jian,ZHOU Xiaobo. Learning-based performance monitoring and analysis for Spark in container environments[J]. Journal of computer applications,2017,37(12):3586-3591.
- [11]卜尧,吴斌,陈玉峰,等.BDAP:一个基于Spark的数据挖掘工具平台[J].中国科学技术大学学报,2017,47(4):358-368.BU Yao,WU Bin,CHEN Yufeng,et al. BDAP:a data mining platform based on Spark[J]. Journal of University of Science and Technology of China,2017,47(4):358-368.
- [12]陈虹君,吴雪琴.基于Hadoop平台的Spark快数据推荐算法分析与应用[J].现代电子技术,2016,39(10):18-20.CHEN Hongjun, WU Xueqin. Analysis and application of Spark fast data recommendation algorithm based on Hadoop platform[J]. Modern electronics technique,2016,39(10):18-20.