河北大学学报(自然科学版) ›› 2016, Vol. 36 ›› Issue (6): 650-656.DOI: 10.3969/j.issn.1000-1565.2016.06.013

• • 上一篇    下一篇

2种加速K-近邻方法的实验比较

翟俊海, 王婷婷, 张明阳, 王耀达, 刘明明   

  • 收稿日期:2016-07-11 出版日期:2016-11-25 发布日期:2016-11-25
  • 作者简介:翟俊海(1964—),男,河北易县人,河北大学教授,博士,主要从事机器学习和数据挖掘方向研究. E-mail:mczjh@126.com
  • 基金资助:
    国家自然科学基金资助项目(71371063);河北省高等学校科学技术研究重点项目(ZD20131028);河北大学研究生创新项目(X2016059)

Experimental comparison of two acceleration approaches for K-nearest neighbors

ZHAI Junhai,WANG Tingting,ZHANG Mingyang,WANG Yaoda,LIU Mingming   

  1. College of Mathematics and Information Science, Hebei University, Baoding 071002, China
  • Received:2016-07-11 Online:2016-11-25 Published:2016-11-25

摘要: K-近邻(K-NN:K-nearest neighbors)是著名的数据挖掘算法,应用非常广泛.K-NN思想简单,易于实现,其计算时间复杂度和空间复杂度都是O(n),n为训练集中包含的样例数.当训练集比较大时,特别是面对大数据集时,K-NN算法的效率会变得非常低,甚至不可行.本文用实验的方法比较了2种加速K-NN的方法,2种加速方法分别是压缩近邻(CNN:condensed nearest neighbor)方法和基于MapReduce的K-NN.具体地,在Hadoop环境下,用MapReduce编程实现了K-NN算法,并与CNN算法在8个数据集上进行了实验比较,得出了一些有价值的结论,对从事相关研究的人员具有一定的借鉴作用.

关键词: K-近邻, 数据挖掘, MapReduce, Hadoop

Abstract: K-NN(K-nearest neighbors)is a famous data mining algorithm with wide range of applications.The idea of K-NN is simple and it is easy to implement. Both computational time and space complexity of K-NN are all O(n),where,n is the number of instances in a training set.When K-NN encountered larger training sets,especially faced with big data sets,the efficiency of K-NN becomes very low,even K-NN is impracticable.Two acceleration approaches for K-nearest neighbors are experimentally compared on 8 data sets.The two acceleration approaches are the CNN and MapReduce based K-NN.Specifically,in Hadoop environment,this paper implements K-NN with MapReduce,and experimentally compares with CNN on 8 data sets. Some valuable conclusions are obtained,and may be useful for researchers in related fields.

Key words: K-nearest neighbors, data mining, MapReduce, Hadoop

中图分类号: