河北大学学报(自然科学版) ›› 2020, Vol. 40 ›› Issue (4): 433-440.DOI: 10.3969/j.issn.1000-1565.2020.04.015

• • 上一篇    下一篇

基于MapReduce和Spark的大数据模糊K-means算法比较

翟俊海1, 田石1, 张素芳2, 王谟瀚1, 宋丹丹1   

  • 收稿日期:2019-09-09 出版日期:2020-07-25 发布日期:2020-07-25
  • 通讯作者: 张素芳(1966—),女,河北蠡县人, 中国气象局气象干部培训学院河北分院副教授,主要从事机器学习方向研究.E-mail:mczsf@hbu.cn
  • 作者简介:翟俊海(1964—),男,河北易县人,河北大学教授,博士,主要从事云计算与大数据处理和机器学习方向研究. E-mail:mczjh@hbu.cn
  • 基金资助:
    河北省重点研发计划项目(19210310D);河北省自然科学基金资助项目(F2017201026);河北省研究生专业学位教学案例库建设项目(KCJSZ2018009);河北大学研究生创新项目(hbu2019ss077);河北大学工商学院第五批教育教学改革研究项目(JX201820)

A comparison on big data fuzzy K-means algorithm based on MapReduce and Spark

ZHAI Junhai1, TIAN Shi1, ZHANG Sufang2, WANG Mohan1, SONG Dandan1   

  1. 1. Hebei Key Laboratory of Machine Learning and Computational Intelligence, College of Mathematicsand Information Science, Hebei University, Baoding 071002, China; 2. Hebei Branch of ChinaMeteorological Administration Training Centre, China Meteorological Administration, Baoding 071000, China
  • Received:2019-09-09 Online:2020-07-25 Published:2020-07-25

摘要: 从原理和实验2方面对基于MapReduce和Spark的大数据模糊K-均值算法进行分析比较,并对2种大数据开源平台的优缺点进行了总结.由于模糊K-均值算法是一种迭代算法,需要对部分数据进行重复操作以得到最终聚类结果,因此主要从算法执行时间、同步次数、文件数目、容错性能、资源消耗这5方面进行比较,得出的结论对从事大数据研究的人员具有较高的参考价值.

关键词: 大数据, 机器学习, 聚类算法, 模糊聚类算法, 迭代算法

Abstract: The two big data fuzzy K-means algorithms based on Hadoop and Spark are compared in principle and in experiment, and the advantages and disadvantages of the two big data open source platforms are summarized. As the fuzzy K-means is an iterative algorithm, some data need to be iteratively handled to obtain the final clustering results. Accordingly, the two algorthms are compared on five aspects: running time, number of synchronization of tasks, number of files, fault tolerance, and resource consumption. Some valuable conclusions were obtained, which can be very helpful to reseachers in related fields, especifically for the ones engaging in the study of big data machine learning.

Key words: big data, machine learning, clustering algorithm, fuzzy clustering algorithm, iterative algorithm

中图分类号: