Experimental comparison of two acceleration approaches for K-nearest neighbors

doi:10.3969/j.issn.1000-1565.2016.06.013

Abstract

Abstract: K-NN(K-nearest neighbors)is a famous data mining algorithm with wide range of applications.The idea of K-NN is simple and it is easy to implement. Both computational time and space complexity of K-NN are all O(n),where,n is the number of instances in a training set.When K-NN encountered larger training sets,especially faced with big data sets,the efficiency of K-NN becomes very low,even K-NN is impracticable.Two acceleration approaches for K-nearest neighbors are experimentally compared on 8 data sets.The two acceleration approaches are the CNN and MapReduce based K-NN.Specifically,in Hadoop environment,this paper implements K-NN with MapReduce,and experimentally compares with CNN on 8 data sets. Some valuable conclusions are obtained,and may be useful for researchers in related fields.

Key words: K-nearest neighbors, data mining, MapReduce, Hadoop

CLC Number:

TP18

ZHAI Junhai,WANG Tingting,ZHANG Mingyang,WANG Yaoda,LIU Mingming. Experimental comparison of two acceleration approaches for K-nearest neighbors[J]. Journal of Hebei University (Natural Science Edition), 2016, 36(6): 650-656.

References

[1] COVER T,HART P.Nearest neighbor pattern classification [J].IEEE Transactions on Information Theory,1967,13(1):21-27.DOI:10.1109/TIT.1967.1053964.
[2] SAVCHENKO A V.Maximum-likelihood approximate nearest neighbor method in real-time image recognition [J].Pattern Recognition,2017,61:459-469.DOI:10.1016/j.patcog.2016.08.015.
[3] 霍亮,杨柳,张俊芝.贝叶斯与k-近邻相结合的文本分类方法[J].河北大学学报(自然科学版),2012,32(3):316-319.DOI:1000-1565(2012)03-0316-04. HUO L,YANG L,ZHANG J Z.On Bayesian combined with k-NN text classification method [J].Journal of Hebei University(Natural Science Edition),2012,32(3):316-319.DOI:1000-1565(2012)03-0316-04.
[4] 湛燕,陈昊,袁方,等.文本挖掘研究进展[J].河北大学学报(自然科学版),2003,23(2):221-226.DOI:1000 -1565(2003)02 -0221 -06. ZHAN Y,CHEN H,YUANG F,et al.The advance of research in text mining [J].Journal of Hebei University(Natural Science Edition),2003,23(2):221-226.DOI:1000 -1565(2003)02 -0221-06.
[5] BARALDI P,CANNARILE F,MAIO F D,et al.Hierarchical k-nearest neighbors classification and binary differential evolution for fault diagnostics of automotive bearings operating under variable conditions [J].Engineering Applications of Artificial Intelligence,2016,56:1-13.DOI:10.1016/j.engappai.2016.08.011.
[6] BELIAKOV G,LI G.Improving the speed and stability of the k-nearest neighbors method [J].Pattern Recognition Letters,2012,33(10):1296-1301.DOI:10.1016/j.patrec.2012.02.016.
[7] ANDONI A,INDYK P.Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions [J].Communication ACM,2008,51(1):117-122.DOI:10.1109/FOCS.2006.49.
[8] GU X G,ZHANG Y D,ZHANG Y,et al.An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features [J].Signal Processing,2013,93(8):2244-2255.DOI:10.1016/j.sigpro.2012.07.014.
[9] HERRANZ J,NIN J,SOLE M.KD-trees and the real disclosure risks of large statistical databases [J].Information Fusion,2012,13(4):260-273.DOI:10.1016/j.inffus.2011.03.001.
[10] LIU S G,WEI Y W.Fast nearest neighbor searching based on improved VP-tree [J].Pattern Recognition Letters,2015,60-61:8-15.DOI:10.1016/j.patrec.2015.03.017.
[11] 李武军,周志华.大数据哈希学习:现状与趋势[J].科学通报,2015,60(5):485-490.DOI:10.1360/N972014-00841. LI W J,ZHOU Z H.Learning to hash for big data:Current status and future trends [J].Chinese Science Bulletin,2015,60(5):485-490.DOI:10.1360/N972014-00841.
[12] 王建峰.基于哈希的最近邻查找[D].合肥:中国科学技术大学,2015.DOI:10.1145/2502081.2502100. WANG J F.Hashing-based nearest neighbor search [D].Hefei:University of Science and Technology of China,2015.DOI:10.1145/2502081.2502100.
[13] CHANG C C,WU T C.A hashing-oriented nearest neighbor searching scheme[J].Pattern Recognition Letter,1993,14(8):625-630.DOI:10.1016/0167-8655(93)90047-H.
[14] HOU G D,CUI R P,PAN Z,et al.Tree-based compact hashing for approximate nearest neighbor search [J].Neurocomputing,2015,166:271-281.DOI:10.1016/j.neucom.2015.04.012.
[15] SLANEY M,CASEY M.Locality-sensitive hashing for finding nearest neighbors [J].IEEE Signal Processing Magazine,2008,25:128-131.DOI:10.1109/MSP.2007.914237.
[16] PAULEVE L,JEGOU H,AMSALEG L.Locality sensitive hashing:A comparison of hash function types and querying mechanisms [J].Pattern Recognition Letters,2010,31(11):1348-1358.DOI:10.1007/978-3-319-13168-9_32.
[17] HART P E.The condensed nearest neighbor rule [J].IEEE Transaction on Information Theory,1968,14(5):515-516.DOI:10.1109/TIT.1968.1054155.
[18] GATES G W.The reduced nearest neighbor rule [J].IEEE Transactions on Information Theory,1972,18(3):431-433.DOI:10.1109/TIT.1972.1054809.
[19] WILSON D R,MARTINEZ T R.Reduction techniques for instance-based learning algorithms [J].Machine Learning,2000,38(3):257-286.DOI:10.1023/A:1007626913721.
[20] BRIGHTON B,MELLISH C.Advances in instance selection for instance-based learning algorithms [J].Data Mining and Knowledge Discovery,2002,6(2):153-172.DOI:10.1023/A:1014043630878.
[21] SALVADOR G,JOAQUIN D,JOSE R C,et al.Prototype selection for nearest neighbor classification:taxonomy and empirical study [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(3):417-435.DOI:10.1109/TPAMI.2011.142.
[22] DEAN J,GHEMAWAT S.MapReduce:Simplified data processing on large clusters [J].Communications of the ACM,2008,51(1):107-113.DOI:10.1145/1327452.1327492.