K-nearest neighbor algorithm for big data classification based on Spark and SimHash

doi:10.3969/j.issn.1000-1565.2019.02.014

Abstract

Abstract: In our previous work, based on MapRecuce and SimHash, we proposed a K-Nearest Neighbor(K-NN)algorithm denoted by H-MR-K-NN for big data classification. Although H-MR-K-NN can effectively solve the computational efficiency problem of K-NN for big data classification, and the running time of H-MR-K-NN is far lower than the one of MapRecuce K-NN denoted by MR-K-NN. However, when one handles big data with MapReduce, it is inevitable to read data from disk and to write the intermediate result back to disk, which results in huge I/O overhead and greatly degenerates the efficiency of MapReduce. Different from MapReduce, Spark is a memory-based computing framework that reads data into memory from disk first and then generates an abstract memory object RDD(resilient distributed datasets). Therefore,Spark only manipulates RDD in memory, and the- DOI:10.3969/j.issn.1000-1565.2019.02.014基于Spark和SimHash的大数据K-近邻分类算法翟俊海¹,沈矗¹,张素芳²,王婷婷¹ (1. 河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室,河北保定 071002;2. 中国气象局气象干部培训学院河北分院,河北保定 071000)摘要: 在笔者之前的工作中,提出了一种基于MapReduce和SimHash的大数据K-近邻算法(H-MR-K-NN).虽然该算法能够有效解决大数据K-近邻算法的计算效率问题,运行时间远远低于基于MapReduce的K-近邻(MR-K-NN)所用的运行时间.然而,用MapReduce处理大数据时,需要从磁盘读取数据,再将中间结果写回磁盘,导致系统的I/O开销极大,这大大降低了MapReduce的效率.与MapReduce不同,Spark是一种基于内存的计算框架,它将数据第1次从磁盘读入内存,生成一种抽象的内存对象RDD(resilient distributed datasets).此后,Spark只操作内存中的RDD,计算过程只涉及内存读写,因此大幅提升了数据处理效率.基于这一事实,对算法H-MR-K-NN进行了改进,提出了一种改进的算法(简记为H-Spark-K-NN),可以进一步提高大数据K-近邻分类的运行效率.关键词:内存计算框架;K-近邻;哈希技术;分类算法;大数据集中图分类号:TP181 文献标志码:A 文章编号:1000-1565(2019)02-0201-10K-nearest neighbor algorithm for big data classification based onSpark and SimHashZHAI Junhai¹, SHEN Chu¹, ZHANG Sufang², WANG Tingting¹(1. Key Laboratory of Machine Learning and Computational Intelligence of Hebei Province, College of Mathematics and Information Science, Hebei University, Baoding 071002, China; 2. Hebei Branch of ChinaMeteorological Administration Training Centre, China Meteorological Administration, Baoding 071000, China)Abstract:In our previous work, based on MapRecuce and SimHash, we proposed a K-Nearest Neighbor(K-NN)algorithm denoted by H-MR-K-NN for big data classification. Although H-MR-K-NN can effectively solve the computational efficiency problem of K-NN for big data classification, and the running time of H-MR-K-NN is far lower than the one of MapRecuce K-NN denoted by MR-K-NN. However, when one handles big data with MapReduce, it is inevitable to read data from disk and to write the intermediate result back to disk, which results in huge I/O overhead and greatly degenerates the efficiency of MapReduce. Different from MapReduce, Spark is a memory-based computing framework that reads data into memory from disk first and then generates an abstract memory object RDD(resilient distributed datasets). Therefore,Spark only manipulates RDD in memory, and the- 收稿日期:2018-10-15 基金项目:河北省自然科学基金资助项目(F2017201026);河北大学自然科学研究计划项目(799207217071);河北大学研究生创新项目(hbu2018ss47);河北省研究生专业学位教学案例库建设项目(KCJSZ2018009) 第一作者:翟俊海(1964—),男,河北易县人,河北大学教授,博士,主要从事云计算与大数据处理和机器学习方向研究.E-mail:mczjh@hbu.cn 通信作者:张素芳(1966—),女,河北蠡县人,中国气象局气象干部培训学院副教授,主要从事机器学习方向研究.E-mail:mczsf@126.com第2期翟俊海等:基于Spark和SimHash的大数据K-近邻分类算法calculation process involves only memory reads and writes. Consequently, it greatly improves efficiency of data processing. Based on this fact, we improved the algorithm H-MR-K-NN, and proposed an improved algorithm denoted shortly by H-Spark-K-NN which can further improve the efficiency of K-NN for big data classification.

Key words: memory computing framework, K-nearest neighbor, hash technology, classification algorithms, big data sets

CLC Number:

TP181

ZHAI Junhai, SHEN Chu, ZHANG Sufang, WANG Tingting. K-nearest neighbor algorithm for big data classification based on Spark and SimHash[J]. Journal of Hebei University (Natural Science Edition), 2019, 39(2): 201-210.

References

[1] HAR-PELED S, INDYK P, MOTWANI R. Approximate nearest neighbor:towards removing the curse of dimensionality [J].Theory of Computing, 2000(11):604-613. DOI:10.4086/toc.2012.v008a014.
[2] WAN J, TANG S, ZHANG Y, et al. HDIdx:High-dimensional indexing for efficient approximate nearest neighbor search [J].Neurocomputing, 2017, 237:401-404. DOI:10.1016/j.neucom.2015.11.104.
[3] 袁培森,沙朝锋,王晓玲,等.一种基于学习的高维数据c-近似最近邻查询算法[J].软件学报,2012,23(8):2018-2031.DOI:10.3724/SP.J.1001.2012.04166.
[4] WIESCHOLLEK P, WANG O, SORKINEHORNUNG A, et al. Efficient large-scale approximate nearest neighbor search on the GPU [Z].IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016, Las Vegas, Nevad. DOI:10.1109/CVPR.2016.223.
[5] PENG Y G, CUI J T, LI H, et al. A reusable and single-interactive model for secure approximate K-nearest neighbor query in cloud [J].Information Sciences, 2017, 387:146-164. DOI:10.1016/j.ins.2016.07.069.
[6] BASRI R, HASSNER T, ZELNIK-MANOR L. Approximate nearest subspace search [J].IEEE Transactions on Pattern Analysis & Machine Intelligence, 2011, 33(2):266-278. DOI:10.1109/TPAMI.2010.110.
[7] CAI Y, JI R, LI S. Dynamic programming based optimized product quantization for approximate nearest neighbor search [J].Neurocomputing, 2016, 217:110-118. DOI:10.1016/j.neucom.2016.01.112.
[8] OZAN E C, KIRANYAZ S, GABBOUJ M. Competitive quantization for approximate nearest neighbor search [J].IEEE Transactions on Knowledge & Data Engineering, 2016, 28(11):2882-2894. DOI:10.1109/TKDE.2016.2597834.
[9] OZAN E C, KIRANYAZ S, GABBOUJ M. K-subspaces quantization for approximate nearest neighbor search [J].IEEE Transactions on Knowledge & Data Engineering, 2016, 28(7):1722-1733. DOI:10.1109/TKDE.2016.2535287.
[10] DENG Z Y, ZHU X S, CHENG D B, et al. Efficient KNN classification algorithm for big data [J].Neurocomputing, 2016, 195:143-148. DOI:10.1016/j.neucom.2015.08.112.
[11] ESMAEILI M M, WARD R K, FATOURECHI M. A fast approximate nearest neighbor search algorithm in the hamming space [J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(12):2481-2488. DOI:10.1109/TPAMI.2012.170.
[12] ZHAO H, WANG Z, LIU P, et al. A fast binary encoding mechanism for approximate nearest neighbor search [J].Neurocomputing, 2015, 178:112-122. DOI:10.1016/j.neucom.2015.09.110.
[13] VERDOLIVA L, COZZOLINO D, POGGI G. A reliable order-statistics-based approximate nearest neighbor search algorithm [J].IEEE Transactions on Image Processing, 2017, 26(1):237-250. DOI:10.1109/TIP.2016.2624141.
[14] GUO Q Z, ZENG Z, ZHANG S, et al. Adaptive bit allocation hashing for approximate nearest neighbor search [J].Neurocomputing, 2015, 151:719-728. DOI:10.1016/j.neucom.2014.10.042.
[15] HOU G, CUI R, PAN Z, et al. Tree-based compact hashing for approximate nearest neighbor search [J].Neurocomputing, 2015, 166(C):271-281. DOI:10.1016/j.neucom.2015.04.012.
[16] ZHANG L, LU H, DU D, et al. Sparse hashing tracking [J].IEEE Transactions on Image Processing, 2016, 25(2):840-849. DOI:10.1109/TIP.2015.2509244.
[17] ZHANG L, ZHANG Y, GU X, et al. Scalable similarity search with topology preserving hashing [J].IEEE Transactions on Image Processing, 2014, 23(7):3025-3039. DOI:10.1109/TIP.2014.2326010.
[18] 文庆福, 王建民, 朱晗,等. 面向近似近邻查询的分布式哈希学习方法 [J].计算机学报, 2017(1):192-206. DOI:10.11897/SP.J.1016.2017.00192.
[19] 王忠伟, 陈叶芳, 钱江波,等. 基于LSH的高维大数据k近邻搜索算法 [J].电子学报, 2016, 44(4):906-912. DOI:10.3969/j.issn.0372-2112.2016.04.022.
[20] 翟俊海, 张明阳, 王婷婷,等. 基于哈希技术和MapReduce的大数据集K-近邻算法 [J].计算机科学, 2017, 44(7):210-214. DOI:10.11896/j.issn.1002-137X.2017.07.037.
[21] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark:cluster computing with working sets [Z]. Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, Boston, MA, 2010. DOI:10.1007/s00256-009-0861-0.
[22] SHYAM R, BHARATHI G H B, SACHIN K S, et al. Apache spark a big data analytics platform for smart grid [J].Procedia Technology, 2015, 21:171-178. DOI:10.1016/j.protcy.2015.10.085.
[23] MANKU G S, JAIN A, SARMA A D. Detecting near-duplicates for web crawling[Z].ACM International Conference on World Wide Web, Alberta, Canada, 2007. DOI:10.1145/1242572.1242592.