河北大学学报(自然科学版) ›› 2019, Vol. 39 ›› Issue (2): 201-210.DOI: 10.3969/j.issn.1000-1565.2019.02.014

• • 上一篇    下一篇

基于Spark和SimHash的大数据K-近邻分类算法

翟俊海1,沈矗1,张素芳2,王婷婷1   

  • 收稿日期:2018-10-15 出版日期:2019-03-25 发布日期:2019-03-25
  • 通讯作者: 张素芳(1966—),女,河北蠡县人,中国气象局气象干部培训学院副教授,主要从事机器学习方向研究.E-mail:mczsf@126.com
  • 作者简介:翟俊海(1964—),男,河北易县人,河北大学教授,博士,主要从事云计算与大数据处理和机器学习方向研究. E-mail:mczjh@hbu.cn
  • 基金资助:
    河北省自然科学基金资助项目(F2017201026);河北大学自然科学研究计划项目(799207217071);河北大学研究生创新项目(hbu2018ss47);河北省研究生专业学位教学案例库建设项目(KCJSZ2018009)

K-nearest neighbor algorithm for big data classification based on Spark and SimHash

ZHAI Junhai1, SHEN Chu1, ZHANG Sufang2, WANG Tingting1   

  1. 1. Key Laboratory of Machine Learning and Computational Intelligence of Hebei Province, College of Mathematics and Information Science, Hebei University, Baoding 071002, China; 2. Hebei Branch of ChinaMeteorological Administration Training Centre, China Meteorological Administration, Baoding 071000, China
  • Received:2018-10-15 Online:2019-03-25 Published:2019-03-25

摘要: 在笔者之前的工作中,提出了一种基于MapReduce和SimHash的大数据K-近邻算法(H-MR-K-NN).虽然该算法能够有效解决大数据K-近邻算法的计算效率问题,运行时间远远低于基于MapReduce的K-近邻(MR-K-NN)所用的运行时间.然而,用MapReduce处理大数据时,需要从磁盘读取数据,再将中间结果写回磁盘,导致系统的I/O开销极大,这大大降低了MapReduce的效率.与MapReduce不同,Spark是一种基于内存的计算框架,它将数据第1次从磁盘读入内存,生成一种抽象的内存对象RDD(resilient distributed datasets).此后,Spark只操作内存中的RDD,计算过程只涉及内存读写,因此大幅提升了数据处理效率.基于这一事实,对算法H-MR-K-NN进行了改进,提出了一种改进的算法(简记为H-Spark-K-NN),可以进一步提高大数据K-近邻分类的运行效率.

关键词: 内存计算框架, K-近邻, 哈希技术, 分类算法, 大数据集

Abstract: In our previous work, based on MapRecuce and SimHash, we proposed a K-Nearest Neighbor(K-NN)algorithm denoted by H-MR-K-NN for big data classification. Although H-MR-K-NN can effectively solve the computational efficiency problem of K-NN for big data classification, and the running time of H-MR-K-NN is far lower than the one of MapRecuce K-NN denoted by MR-K-NN. However, when one handles big data with MapReduce, it is inevitable to read data from disk and to write the intermediate result back to disk, which results in huge I/O overhead and greatly degenerates the efficiency of MapReduce. Different from MapReduce, Spark is a memory-based computing framework that reads data into memory from disk first and then generates an abstract memory object RDD(resilient distributed datasets). Therefore,Spark only manipulates RDD in memory, and the- DOI:10.3969/j.issn.1000-1565.2019.02.014基于Spark和SimHash的大数据K-近邻分类算法翟俊海1,沈矗1,张素芳2,王婷婷1 (1. 河北大学 数学与信息科学学院 河北省机器学习与计算智能重点实验室,河北保定 071002;2. 中国气象局气象干部培训学院 河北分院,河北 保定 071000)摘 要: 在笔者之前的工作中,提出了一种基于MapReduce和SimHash的大数据K-近邻算法(H-MR-K-NN).虽然该算法能够有效解决大数据K-近邻算法的计算效率问题,运行时间远远低于基于MapReduce的K-近邻(MR-K-NN)所用的运行时间.然而,用MapReduce处理大数据时,需要从磁盘读取数据,再将中间结果写回磁盘,导致系统的I/O开销极大,这大大降低了MapReduce的效率.与MapReduce不同,Spark是一种基于内存的计算框架,它将数据第1次从磁盘读入内存,生成一种抽象的内存对象RDD(resilient distributed datasets).此后,Spark只操作内存中的RDD,计算过程只涉及内存读写,因此大幅提升了数据处理效率.基于这一事实,对算法H-MR-K-NN进行了改进,提出了一种改进的算法(简记为H-Spark-K-NN),可以进一步提高大数据K-近邻分类的运行效率.关键词:内存计算框架;K-近邻;哈希技术;分类算法;大数据集中图分类号:TP181 文献标志码:A 文章编号:1000-1565(2019)02-0201-10K-nearest neighbor algorithm for big data classification based onSpark and SimHashZHAI Junhai1, SHEN Chu1, ZHANG Sufang2, WANG Tingting1(1. Key Laboratory of Machine Learning and Computational Intelligence of Hebei Province, College of Mathematics and Information Science, Hebei University, Baoding 071002, China; 2. Hebei Branch of ChinaMeteorological Administration Training Centre, China Meteorological Administration, Baoding 071000, China)Abstract:In our previous work, based on MapRecuce and SimHash, we proposed a K-Nearest Neighbor(K-NN)algorithm denoted by H-MR-K-NN for big data classification. Although H-MR-K-NN can effectively solve the computational efficiency problem of K-NN for big data classification, and the running time of H-MR-K-NN is far lower than the one of MapRecuce K-NN denoted by MR-K-NN. However, when one handles big data with MapReduce, it is inevitable to read data from disk and to write the intermediate result back to disk, which results in huge I/O overhead and greatly degenerates the efficiency of MapReduce. Different from MapReduce, Spark is a memory-based computing framework that reads data into memory from disk first and then generates an abstract memory object RDD(resilient distributed datasets). Therefore,Spark only manipulates RDD in memory, and the- 收稿日期:2018-10-15 基金项目:河北省自然科学基金资助项目(F2017201026);河北大学自然科学研究计划项目(799207217071);河北大学研究生创新项目(hbu2018ss47);河北省研究生专业学位教学案例库建设项目(KCJSZ2018009) 第一作者:翟俊海(1964—),男,河北易县人,河北大学教授,博士,主要从事云计算与大数据处理和机器学习方向研究.E-mail:mczjh@hbu.cn 通信作者:张素芳(1966—),女,河北蠡县人,中国气象局气象干部培训学院副教授,主要从事机器学习方向研究.E-mail:mczsf@126.com第2期翟俊海等:基于Spark和SimHash的大数据K-近邻分类算法calculation process involves only memory reads and writes. Consequently, it greatly improves efficiency of data processing. Based on this fact, we improved the algorithm H-MR-K-NN, and proposed an improved algorithm denoted shortly by H-Spark-K-NN which can further improve the efficiency of K-NN for big data classification.

Key words: memory computing framework, K-nearest neighbor, hash technology, classification algorithms, big data sets

中图分类号: