基于MapReduce和Spark的大数据模糊K-means算法比较

doi:10.3969/j.issn.1000-1565.2020.04.015

河北大学学报(自然科学版) ›› 2020, Vol. 40 ›› Issue (4): 433-440.DOI: 10.3969/j.issn.1000-1565.2020.04.015

基于MapReduce和Spark的大数据模糊K-means算法比较

翟俊海¹, 田石¹, 张素芳², 王谟瀚¹, 宋丹丹¹

收稿日期:2019-09-09 出版日期:2020-07-25 发布日期:2020-07-25
通讯作者: 张素芳(1966—),女,河北蠡县人, 中国气象局气象干部培训学院河北分院副教授,主要从事机器学习方向研究.E-mail:mczsf@hbu.cn
作者简介:翟俊海(1964—),男,河北易县人,河北大学教授,博士,主要从事云计算与大数据处理和机器学习方向研究. E-mail:mczjh@hbu.cn
基金资助:
河北省重点研发计划项目(19210310D);河北省自然科学基金资助项目(F2017201026);河北省研究生专业学位教学案例库建设项目(KCJSZ2018009);河北大学研究生创新项目(hbu2019ss077);河北大学工商学院第五批教育教学改革研究项目(JX201820)

A comparison on big data fuzzy K-means algorithm based on MapReduce and Spark

ZHAI Junhai¹, TIAN Shi¹, ZHANG Sufang², WANG Mohan¹, SONG Dandan¹

1. Hebei Key Laboratory of Machine Learning and Computational Intelligence, College of Mathematicsand Information Science, Hebei University, Baoding 071002, China; 2. Hebei Branch of ChinaMeteorological Administration Training Centre, China Meteorological Administration, Baoding 071000, China

Received:2019-09-09 Online:2020-07-25 Published:2020-07-25

摘要/Abstract

摘要： 从原理和实验2方面对基于MapReduce和Spark的大数据模糊K-均值算法进行分析比较,并对2种大数据开源平台的优缺点进行了总结.由于模糊K-均值算法是一种迭代算法,需要对部分数据进行重复操作以得到最终聚类结果,因此主要从算法执行时间、同步次数、文件数目、容错性能、资源消耗这5方面进行比较,得出的结论对从事大数据研究的人员具有较高的参考价值.

关键词: 大数据, 机器学习, 聚类算法, 模糊聚类算法, 迭代算法

Abstract: The two big data fuzzy K-means algorithms based on Hadoop and Spark are compared in principle and in experiment, and the advantages and disadvantages of the two big data open source platforms are summarized. As the fuzzy K-means is an iterative algorithm, some data need to be iteratively handled to obtain the final clustering results. Accordingly, the two algorthms are compared on five aspects: running time, number of synchronization of tasks, number of files, fault tolerance, and resource consumption. Some valuable conclusions were obtained, which can be very helpful to reseachers in related fields, especifically for the ones engaging in the study of big data machine learning.

Key words: big data, machine learning, clustering algorithm, fuzzy clustering algorithm, iterative algorithm

中图分类号:

TP181

翟俊海, 田石, 张素芳, 王谟瀚, 宋丹丹. 基于MapReduce和Spark的大数据模糊K-means算法比较[J]. 河北大学学报(自然科学版), 2020, 40(4): 433-440.

ZHAI Junhai, TIAN Shi, ZHANG Sufang, WANG Mohan, SONG Dandan. A comparison on big data fuzzy K-means algorithm based on MapReduce and Spark[J]. Journal of Hebei University (Natural Science Edition), 2020, 40(4): 433-440.

参考文献

[1] WU X, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge & Information Systems, 2007, 14(1): 1-37. DOI: 10.1007/s10115-007-0114-2.
[2] BEZDEK J C, EHRLICH R, FULL W. FCM: The fuzzy c-means clustering algorithm[J]. Computers & Geosciences, 1984, 10(2-3):191-203. DOI: 10.1016/0098-3004(84)90020-7.
[3] GUPTA A, DAATA S, DAS, S. Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering[J]. Pattern Recognition Letters, 2018, 116:72-79. DOI: 10.1016/j.patrec.2018.09.003.
[4] MASUD M A, HUANG J Z, WEI C H, et al. I-nice: A new approach for identifying the number of clusters and initial cluster centres[J]. Information Sciences, 2018, 466:129-151. DOI: 10.1016/j.ins.2018.07.034.
[5] LORD E, WILLEMS M, LAPOINTE F J, et al. Using the stability of objects to determine the number of clusters in datasets[J]. Information Sciences, 2017, 393: 29-46. DOI:10.1016/j.ins.2017.02.010.
[6] YU H, LIU Z G, WANG G Y. An automatic method to determine the number of clusters using decision-theoretic rough set[J]. International Journal of Approximate Reasoning, 2014, 55(1):101-115. DOI: 10.1016/j.ijar.2013.03.018.
[7] JOSÉ-GARCÍA A, GÓMEZ-FLORES W. Automatic clustering using nature-inspired metaheuristics: a survey[J]. Applied Soft Computing, 2016, 41: 192-213. DOI:10.1016/j.asoc.2015.12.001.
[8] HANCER E, KARABOGA D. A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number[J]. Swarm and Evolutionary Computation, 2017, 32:49-67. DOI: 10.1016/j.swevo.2016.06.004.
[9] HAVENS T C, BEZDEK J C, LECKIE C, et al. Fuzzy c-means algorithms for very large data[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(6): 1130-1146. DOI:10.1109/tfuzz.2012.2201485.
[10] LUDWIG S A. MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability[J]. International Journal of Machine Learning and Cybernetics, 2015, 6(6):923-934. DOI: 10.1007/s13042-015-0367-0.
[11] BHARILL N, TIWARI A, MALVIYA A. Fuzzy based scalable clustering algorithms for handling big data using apache spark[J]. IEEE Transactions on Big Data, 2016, 2(4): 339-352. DOI:10.1109/tbdata.2016.2622288.
[12] WU J J, WU Z A, CAO J, et al. Fuzzy consensus clustering with applications on big data[J]. IEEE Transactions on Fuzzy Systems, 2017, 25(6): 1430-1445. DOI:10.1109/tfuzz.2017.2742463.
[13] 王磊, 邹恩岑, 曾诚, 等. 基于Spark的大数据聚类研究及系统实现[J]. 数据采集与处理, 2018, 33(6): 1077-1085. DOI:10.16337/j.1004-9037.2018.06.016.
[14] 李应安. 基于MapReduce的聚类算法的并行化研究[D]. 广州: 中山大学, 2010.
[15] 阳美玲. 基于MapReduce的K-means聚类算法的FPGA加速研究[D]. 武汉: 华中科技大学, 2016.
[16] 张彬, 李继民, 张寿华, 等. 基于动态信任评估的政务数据云服务平台设计[J]. 河北大学学报(自然科学版), 2018, 38(4): 432-436. DOI:10.3969/j.issn.1000-1565.2018.04.014.
[17] 高学伟, 付忠广,孙力, 等. 基于Hadoop分布式支持向量机球磨机大数据建模[J]. 河北大学学报(自然科学版), 2017, 37(3): 309-315. DOI: 10.3969/j.issn.1000-1565.2017.03.014.
[18] 吴信东, 嵇圣硙. MapReduce与Spark用于大数据分析之比较[J]. 软件学报, 2008, 29(6): 1770-1791. DOI: 10.13328/j.cnki.jos.005557.
[19] 宋杰, 孙宗哲, 毛克明,等. MapReduce大数据处理平台与算法研究进展[J]. 软件学报, 2017, 28(3): 514-543. DOI: 10.13328/j.cnki.jos.005169.
[20] 翟俊海, 沈矗, 张素芳, 等. 基于Spark和SimHash的大数据K-近邻分类算法[J]. 河北大学学报(自然科学版), 2019, 39(2): 201-210. DOI:10.3969/j.issn.10001565.2019.02.014.
[21] 张素芳, 翟俊海, 王聪, 等. 大数据与大数据机器学习研究[J]. 河北大学学报(自然科学版), 2018, 38(3): 299-308. DOI: 10.3969/j.issn.1000-1565.2018.03.011.

基于MapReduce和Spark的大数据模糊K-means算法比较

A comparison on big data fuzzy K-means algorithm based on MapReduce and Spark

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics

本文评价

[1]	李骏. 基于MapReduce的大数据在线聚集优化设计[J]. 河北大学学报(自然科学版), 2021, 41(2): 212-217.
[2]	翟俊海,沈矗,张素芳,王婷婷. 基于Spark和SimHash的大数据K-近邻分类算法[J]. 河北大学学报(自然科学版), 2019, 39(2): 201-210.
[3]	张素芳, 翟俊海, 王聪, 沈矗, 赵春玲. 大数据与大数据机器学习[J]. 河北大学学报(自然科学版), 2018, 38(3): 299-308.
[4]	翟俊海,张素芳,郝璞. 卷积神经网络及其研究进展[J]. 河北大学学报(自然科学版), 2017, 37(6): 640-651.
[5]	马国富,王子贤,马胜利. 机器学习模型在预测服刑人员再犯罪危险性中的效用分析[J]. 河北大学学报(自然科学版), 2017, 37(4): 426-433.
[6]	罗文劼,袁方,杨秀丹. 基于建模技术构建运用大数据分析优化政务的环境[J]. 河北大学学报(自然科学版), 2017, 37(1): 101-107.
[7]	马国富,王子贤,马胜利. 基于大数据的服刑人员危险性预测[J]. 河北大学学报(自然科学版), 2016, 36(6): 657-666.
[8]	陈昀,毕海岩. 基于多特征融合的中文评论情感分类算法[J]. 河北大学学报(自然科学版), 2015, 35(6): 651-656.
[9]	哈艳,杜瑞忠,钟莲,张东琦,李森. 一种改进的网络突发话题检测算法[J]. 河北大学学报(自然科学版), 2015, 35(5): 526-531.
[10]	张丽娟,佟慧. 一类广义变分不等式组的迭代算法[J]. 河北大学学报(自然科学版), 2014, 34(3): 225-229.
[11]	李海峰,李纯果. 深度学习结构和算法比较分析[J]. 河北大学学报(自然科学版), 2012, 32(5): 538-544.
[12]	刘立红,周海云,陈东青. Hilbert空间中严格伪压缩映像有限族公共不动点的迭代算法[J]. 河北大学学报(自然科学版), 2011, 31(3): 236-239.