2种加速K-近邻方法的实验比较

doi:10.3969/j.issn.1000-1565.2016.06.013

河北大学学报(自然科学版) ›› 2016, Vol. 36 ›› Issue (6): 650-656.DOI: 10.3969/j.issn.1000-1565.2016.06.013

2种加速K-近邻方法的实验比较

翟俊海, 王婷婷, 张明阳, 王耀达, 刘明明

收稿日期:2016-07-11 出版日期:2016-11-25 发布日期:2016-11-25
作者简介:翟俊海(1964—),男,河北易县人,河北大学教授,博士,主要从事机器学习和数据挖掘方向研究. E-mail:mczjh@126.com
基金资助:
国家自然科学基金资助项目(71371063);河北省高等学校科学技术研究重点项目(ZD20131028);河北大学研究生创新项目(X2016059)

Experimental comparison of two acceleration approaches for K-nearest neighbors

ZHAI Junhai,WANG Tingting,ZHANG Mingyang,WANG Yaoda,LIU Mingming

College of Mathematics and Information Science, Hebei University, Baoding 071002, China

Received:2016-07-11 Online:2016-11-25 Published:2016-11-25

摘要/Abstract

摘要： K-近邻(K-NN:K-nearest neighbors)是著名的数据挖掘算法,应用非常广泛.K-NN思想简单,易于实现,其计算时间复杂度和空间复杂度都是O(n),n为训练集中包含的样例数.当训练集比较大时,特别是面对大数据集时,K-NN算法的效率会变得非常低,甚至不可行.本文用实验的方法比较了2种加速K-NN的方法,2种加速方法分别是压缩近邻(CNN:condensed nearest neighbor)方法和基于MapReduce的K-NN.具体地,在Hadoop环境下,用MapReduce编程实现了K-NN算法,并与CNN算法在8个数据集上进行了实验比较,得出了一些有价值的结论,对从事相关研究的人员具有一定的借鉴作用.

关键词: K-近邻, 数据挖掘, MapReduce, Hadoop

Abstract: K-NN(K-nearest neighbors)is a famous data mining algorithm with wide range of applications.The idea of K-NN is simple and it is easy to implement. Both computational time and space complexity of K-NN are all O(n),where,n is the number of instances in a training set.When K-NN encountered larger training sets,especially faced with big data sets,the efficiency of K-NN becomes very low,even K-NN is impracticable.Two acceleration approaches for K-nearest neighbors are experimentally compared on 8 data sets.The two acceleration approaches are the CNN and MapReduce based K-NN.Specifically,in Hadoop environment,this paper implements K-NN with MapReduce,and experimentally compares with CNN on 8 data sets. Some valuable conclusions are obtained,and may be useful for researchers in related fields.

Key words: K-nearest neighbors, data mining, MapReduce, Hadoop

中图分类号:

TP18

翟俊海, 王婷婷, 张明阳, 王耀达, 刘明明. 2种加速K-近邻方法的实验比较[J]. 河北大学学报(自然科学版), 2016, 36(6): 650-656.

ZHAI Junhai,WANG Tingting,ZHANG Mingyang,WANG Yaoda,LIU Mingming. Experimental comparison of two acceleration approaches for K-nearest neighbors[J]. Journal of Hebei University (Natural Science Edition), 2016, 36(6): 650-656.

参考文献

[1] COVER T,HART P.Nearest neighbor pattern classification [J].IEEE Transactions on Information Theory,1967,13(1):21-27.DOI:10.1109/TIT.1967.1053964.
[2] SAVCHENKO A V.Maximum-likelihood approximate nearest neighbor method in real-time image recognition [J].Pattern Recognition,2017,61:459-469.DOI:10.1016/j.patcog.2016.08.015.
[3] 霍亮,杨柳,张俊芝.贝叶斯与k-近邻相结合的文本分类方法[J].河北大学学报(自然科学版),2012,32(3):316-319.DOI:1000-1565(2012)03-0316-04. HUO L,YANG L,ZHANG J Z.On Bayesian combined with k-NN text classification method [J].Journal of Hebei University(Natural Science Edition),2012,32(3):316-319.DOI:1000-1565(2012)03-0316-04.
[4] 湛燕,陈昊,袁方,等.文本挖掘研究进展[J].河北大学学报(自然科学版),2003,23(2):221-226.DOI:1000 -1565(2003)02 -0221 -06. ZHAN Y,CHEN H,YUANG F,et al.The advance of research in text mining [J].Journal of Hebei University(Natural Science Edition),2003,23(2):221-226.DOI:1000 -1565(2003)02 -0221-06.
[5] BARALDI P,CANNARILE F,MAIO F D,et al.Hierarchical k-nearest neighbors classification and binary differential evolution for fault diagnostics of automotive bearings operating under variable conditions [J].Engineering Applications of Artificial Intelligence,2016,56:1-13.DOI:10.1016/j.engappai.2016.08.011.
[6] BELIAKOV G,LI G.Improving the speed and stability of the k-nearest neighbors method [J].Pattern Recognition Letters,2012,33(10):1296-1301.DOI:10.1016/j.patrec.2012.02.016.
[7] ANDONI A,INDYK P.Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions [J].Communication ACM,2008,51(1):117-122.DOI:10.1109/FOCS.2006.49.
[8] GU X G,ZHANG Y D,ZHANG Y,et al.An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features [J].Signal Processing,2013,93(8):2244-2255.DOI:10.1016/j.sigpro.2012.07.014.
[9] HERRANZ J,NIN J,SOLE M.KD-trees and the real disclosure risks of large statistical databases [J].Information Fusion,2012,13(4):260-273.DOI:10.1016/j.inffus.2011.03.001.
[10] LIU S G,WEI Y W.Fast nearest neighbor searching based on improved VP-tree [J].Pattern Recognition Letters,2015,60-61:8-15.DOI:10.1016/j.patrec.2015.03.017.
[11] 李武军,周志华.大数据哈希学习:现状与趋势[J].科学通报,2015,60(5):485-490.DOI:10.1360/N972014-00841. LI W J,ZHOU Z H.Learning to hash for big data:Current status and future trends [J].Chinese Science Bulletin,2015,60(5):485-490.DOI:10.1360/N972014-00841.
[12] 王建峰.基于哈希的最近邻查找[D].合肥:中国科学技术大学,2015.DOI:10.1145/2502081.2502100. WANG J F.Hashing-based nearest neighbor search [D].Hefei:University of Science and Technology of China,2015.DOI:10.1145/2502081.2502100.
[13] CHANG C C,WU T C.A hashing-oriented nearest neighbor searching scheme[J].Pattern Recognition Letter,1993,14(8):625-630.DOI:10.1016/0167-8655(93)90047-H.
[14] HOU G D,CUI R P,PAN Z,et al.Tree-based compact hashing for approximate nearest neighbor search [J].Neurocomputing,2015,166:271-281.DOI:10.1016/j.neucom.2015.04.012.
[15] SLANEY M,CASEY M.Locality-sensitive hashing for finding nearest neighbors [J].IEEE Signal Processing Magazine,2008,25:128-131.DOI:10.1109/MSP.2007.914237.
[16] PAULEVE L,JEGOU H,AMSALEG L.Locality sensitive hashing:A comparison of hash function types and querying mechanisms [J].Pattern Recognition Letters,2010,31(11):1348-1358.DOI:10.1007/978-3-319-13168-9_32.
[17] HART P E.The condensed nearest neighbor rule [J].IEEE Transaction on Information Theory,1968,14(5):515-516.DOI:10.1109/TIT.1968.1054155.
[18] GATES G W.The reduced nearest neighbor rule [J].IEEE Transactions on Information Theory,1972,18(3):431-433.DOI:10.1109/TIT.1972.1054809.
[19] WILSON D R,MARTINEZ T R.Reduction techniques for instance-based learning algorithms [J].Machine Learning,2000,38(3):257-286.DOI:10.1023/A:1007626913721.
[20] BRIGHTON B,MELLISH C.Advances in instance selection for instance-based learning algorithms [J].Data Mining and Knowledge Discovery,2002,6(2):153-172.DOI:10.1023/A:1014043630878.
[21] SALVADOR G,JOAQUIN D,JOSE R C,et al.Prototype selection for nearest neighbor classification:taxonomy and empirical study [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(3):417-435.DOI:10.1109/TPAMI.2011.142.
[22] DEAN J,GHEMAWAT S.MapReduce:Simplified data processing on large clusters [J].Communications of the ACM,2008,51(1):107-113.DOI:10.1145/1327452.1327492.

2种加速K-近邻方法的实验比较

Experimental comparison of two acceleration approaches for K-nearest neighbors

PDF (PC)

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 10

编辑推荐

Metrics

本文评价

[1]	李骏. 基于MapReduce的大数据在线聚集优化设计[J]. 河北大学学报(自然科学版), 2021, 41(2): 212-217.
[2]	宗晓萍,陶泽泽. 改进的K-近邻算法及其在学习预警中的应用[J]. 河北大学学报(自然科学版), 2020, 40(2): 193-199.
[3]	翟俊海,沈矗,张素芳,王婷婷. 基于Spark和SimHash的大数据K-近邻分类算法[J]. 河北大学学报(自然科学版), 2019, 39(2): 201-210.
[4]	高学伟,付忠广,孙力,张刚. 基于Hadoop分布式支持向量机球磨机大数据建模[J]. 河北大学学报(自然科学版), 2017, 37(3): 309-315.
[5]	霍亮,杨柳,张俊芝. 贝叶斯与k-近邻相结合的文本分类方法[J]. 河北大学学报(自然科学版), 2012, 32(3): 316-319.
[6]	唐皓,刘希玉. 基于密度流形上的空间聚类[J]. 河北大学学报(自然科学版), 2009, 29(6): 658-662.
[7]	张寿华,伊开,王振夺,任志利,刘振鹏. 计算机免疫系统 GECISM 中识别规则的挖掘[J]. 河北大学学报(自然科学版), 2007, 27(2): 204-208.
[8]	赵守伟. 数据挖掘在网络异常检测中的应用[J]. 河北大学学报(自然科学版), 2004, 24(4): 444-447.
[9]	周特,刘振鹏,刘迅芳,张寿华. 基于用户行为的Non-self集的构造方法[J]. 河北大学学报(自然科学版), 2004, 24(4): 434-437.
[10]	王熙照,王丽娟,袁方,湛燕. Web用户访问模式挖掘[J]. 河北大学学报(自然科学版), 2002, 22(4): 404-409.