A comparison on big data fuzzy K-means algorithm based on MapReduce and Spark

doi:10.3969/j.issn.1000-1565.2020.04.015

Abstract

Abstract: The two big data fuzzy K-means algorithms based on Hadoop and Spark are compared in principle and in experiment, and the advantages and disadvantages of the two big data open source platforms are summarized. As the fuzzy K-means is an iterative algorithm, some data need to be iteratively handled to obtain the final clustering results. Accordingly, the two algorthms are compared on five aspects: running time, number of synchronization of tasks, number of files, fault tolerance, and resource consumption. Some valuable conclusions were obtained, which can be very helpful to reseachers in related fields, especifically for the ones engaging in the study of big data machine learning.

Key words: big data, machine learning, clustering algorithm, fuzzy clustering algorithm, iterative algorithm

CLC Number:

TP181

ZHAI Junhai, TIAN Shi, ZHANG Sufang, WANG Mohan, SONG Dandan. A comparison on big data fuzzy K-means algorithm based on MapReduce and Spark[J]. Journal of Hebei University (Natural Science Edition), 2020, 40(4): 433-440.

References

[1] WU X, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge & Information Systems, 2007, 14(1): 1-37. DOI: 10.1007/s10115-007-0114-2.
[2] BEZDEK J C, EHRLICH R, FULL W. FCM: The fuzzy c-means clustering algorithm[J]. Computers & Geosciences, 1984, 10(2-3):191-203. DOI: 10.1016/0098-3004(84)90020-7.
[3] GUPTA A, DAATA S, DAS, S. Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering[J]. Pattern Recognition Letters, 2018, 116:72-79. DOI: 10.1016/j.patrec.2018.09.003.
[4] MASUD M A, HUANG J Z, WEI C H, et al. I-nice: A new approach for identifying the number of clusters and initial cluster centres[J]. Information Sciences, 2018, 466:129-151. DOI: 10.1016/j.ins.2018.07.034.
[5] LORD E, WILLEMS M, LAPOINTE F J, et al. Using the stability of objects to determine the number of clusters in datasets[J]. Information Sciences, 2017, 393: 29-46. DOI:10.1016/j.ins.2017.02.010.
[6] YU H, LIU Z G, WANG G Y. An automatic method to determine the number of clusters using decision-theoretic rough set[J]. International Journal of Approximate Reasoning, 2014, 55(1):101-115. DOI: 10.1016/j.ijar.2013.03.018.
[7] JOSÉ-GARCÍA A, GÓMEZ-FLORES W. Automatic clustering using nature-inspired metaheuristics: a survey[J]. Applied Soft Computing, 2016, 41: 192-213. DOI:10.1016/j.asoc.2015.12.001.
[8] HANCER E, KARABOGA D. A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number[J]. Swarm and Evolutionary Computation, 2017, 32:49-67. DOI: 10.1016/j.swevo.2016.06.004.
[9] HAVENS T C, BEZDEK J C, LECKIE C, et al. Fuzzy c-means algorithms for very large data[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(6): 1130-1146. DOI:10.1109/tfuzz.2012.2201485.
[10] LUDWIG S A. MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability[J]. International Journal of Machine Learning and Cybernetics, 2015, 6(6):923-934. DOI: 10.1007/s13042-015-0367-0.
[11] BHARILL N, TIWARI A, MALVIYA A. Fuzzy based scalable clustering algorithms for handling big data using apache spark[J]. IEEE Transactions on Big Data, 2016, 2(4): 339-352. DOI:10.1109/tbdata.2016.2622288.
[12] WU J J, WU Z A, CAO J, et al. Fuzzy consensus clustering with applications on big data[J]. IEEE Transactions on Fuzzy Systems, 2017, 25(6): 1430-1445. DOI:10.1109/tfuzz.2017.2742463.
[13] 王磊, 邹恩岑, 曾诚, 等. 基于Spark的大数据聚类研究及系统实现[J]. 数据采集与处理, 2018, 33(6): 1077-1085. DOI:10.16337/j.1004-9037.2018.06.016.
[14] 李应安. 基于MapReduce的聚类算法的并行化研究[D]. 广州: 中山大学, 2010.
[15] 阳美玲. 基于MapReduce的K-means聚类算法的FPGA加速研究[D]. 武汉: 华中科技大学, 2016.
[16] 张彬, 李继民, 张寿华, 等. 基于动态信任评估的政务数据云服务平台设计[J]. 河北大学学报(自然科学版), 2018, 38(4): 432-436. DOI:10.3969/j.issn.1000-1565.2018.04.014.
[17] 高学伟, 付忠广,孙力, 等. 基于Hadoop分布式支持向量机球磨机大数据建模[J]. 河北大学学报(自然科学版), 2017, 37(3): 309-315. DOI: 10.3969/j.issn.1000-1565.2017.03.014.
[18] 吴信东, 嵇圣硙. MapReduce与Spark用于大数据分析之比较[J]. 软件学报, 2008, 29(6): 1770-1791. DOI: 10.13328/j.cnki.jos.005557.
[19] 宋杰, 孙宗哲, 毛克明,等. MapReduce大数据处理平台与算法研究进展[J]. 软件学报, 2017, 28(3): 514-543. DOI: 10.13328/j.cnki.jos.005169.
[20] 翟俊海, 沈矗, 张素芳, 等. 基于Spark和SimHash的大数据K-近邻分类算法[J]. 河北大学学报(自然科学版), 2019, 39(2): 201-210. DOI:10.3969/j.issn.10001565.2019.02.014.
[21] 张素芳, 翟俊海, 王聪, 等. 大数据与大数据机器学习研究[J]. 河北大学学报(自然科学版), 2018, 38(3): 299-308. DOI: 10.3969/j.issn.1000-1565.2018.03.011.