Journal of Hebei University (Natural Science Edition) ›› 2020, Vol. 40 ›› Issue (4): 433-440.DOI: 10.3969/j.issn.1000-1565.2020.04.015

Previous Articles     Next Articles

A comparison on big data fuzzy K-means algorithm based on MapReduce and Spark

ZHAI Junhai1, TIAN Shi1, ZHANG Sufang2, WANG Mohan1, SONG Dandan1   

  1. 1. Hebei Key Laboratory of Machine Learning and Computational Intelligence, College of Mathematicsand Information Science, Hebei University, Baoding 071002, China; 2. Hebei Branch of ChinaMeteorological Administration Training Centre, China Meteorological Administration, Baoding 071000, China
  • Received:2019-09-09 Online:2020-07-25 Published:2020-07-25

Abstract: The two big data fuzzy K-means algorithms based on Hadoop and Spark are compared in principle and in experiment, and the advantages and disadvantages of the two big data open source platforms are summarized. As the fuzzy K-means is an iterative algorithm, some data need to be iteratively handled to obtain the final clustering results. Accordingly, the two algorthms are compared on five aspects: running time, number of synchronization of tasks, number of files, fault tolerance, and resource consumption. Some valuable conclusions were obtained, which can be very helpful to reseachers in related fields, especifically for the ones engaging in the study of big data machine learning.

Key words: big data, machine learning, clustering algorithm, fuzzy clustering algorithm, iterative algorithm

CLC Number: