河北大学学报(自然科学版) ›› 2018, Vol. 38 ›› Issue (3): 299-308.DOI: 10.3969/j.issn.1000-1565.2018.03.011
张素芳1, 翟俊海2, 王聪2, 沈矗2, 赵春玲2
收稿日期:
2017-12-23
出版日期:
2018-05-25
发布日期:
2018-05-25
通讯作者:
翟俊海(1964—),男,河北易县人,河北大学教授,博士,主要从事机器学习和数据挖掘方向研究.E-mail:mczjh@126.com
作者简介:
张素芳(1966—),女,河北蠡县人,中国气象局气象干部培训学院河北分院副教授,主要从事机器学习方向研究. E-mail: mczsf@126.com
基金资助:
ZHANG Sufang1, ZHAI Junhai2, WANG Cong2, SHEN Chu2, ZHAO Chunling2
Received:
2017-12-23
Online:
2018-05-25
Published:
2018-05-25
中图分类号:
张素芳, 翟俊海, 王聪, 沈矗, 赵春玲. 大数据与大数据机器学习[J]. 河北大学学报(自然科学版), 2018, 38(3): 299-308.
ZHANG Sufang, ZHAI Junhai, WANG Cong, SHEN Chu, ZHAO Chunling. Big data and big data machine learning[J]. Journal of Hebei University (Natural Science Edition), 2018, 38(3): 299-308.
[1] MANYIKA J, CHUI M, BROWN B, et al. Big data: The next frontier for innovation, competition, and productivity [R/OL]. https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/ big-data-the-next-frontier-for-innovation. [2] EMANI C K, CULLOT N, NICOLLE C. Understandable Big Data: A survey [J]. Computer Science Review, 2015, 17:70-81. DOI: 10.1016/j.cosrev.2015.05.002. [3] 孟小峰, 慈祥. 大数据管理:概念、技术与挑战 [J]. 计算机研究与发展, 2013, 50(1):146-169. DOI:10.7544/issn1000-1239.2013.20121130. MENG X F, CI X. Big data management: concept, techniques and challenges [J]. Journal of Computer Research and Development, 2013, 50(1):146-169. DOI:10.7544/issn1000-1239.2013.20121130. [4] STOREY V C, SONG I Y. Big data technologies and management: What conceptual modeling can do [J]. Data & Knowledge Engineering, 2017, 108:50-67. DOI: 10.1016/j.datak.2017.01.001. [5] MITCHELL T M. 机器学习[M].英文影印版.北京: 机械工业出版社, 2003. [6] MURPHY K. Machine learning: a probabilistic perspective [M]. Cambridge: MIT Press, 2012. [7] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016. [8] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks [J]. Science, 2006, 313:504-507. doi:10.1126/science.1127647. [9] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553):436-444. DOI:10.1038/nature14539. [10] 马世龙, 乌尼日其其格, 李小平. 大数据与深度学习综述[J]. 智能系统学报, 2016, 11(6):728-742. DOI: 10.11992/tis.201611021. MA S L, WUNIRI Q Q G, LI X P. Deep learning with big data: state of the art and development [J]. CAAI Transactions on Intelligent Systems, 2016, 11(6):728-742. DOI: 10.11992/tis.201611021. [11] GUO Y M, LIU Y, OERLEMANS A, et al. Deep learning for visual understanding: a review [J]. Neurocomputing, 2016, 187:27-48. DOI: 10.1016/j.neucom.2015.09.116. [12] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of go with deep neural networks and tree search [J]. Nature, 2016, 529(7587):484. DOI: 10.1038/nature16961. [13] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of Go without human knowledge [J]. Nature, 2017, 550(7676):354-359. DOI: 10.1038/nature24270. [14] JORDAN M I, MITCHELL T M. Machine learning: Trends, perspectives, and prospects [J]. Science, 2015, 349(6245):255-260. DOI: 10.1126/science.aaa8415. [15] 赵申剑, 黎彧君, 符天凡, 等. 深度学习 [M].北京: 人民邮电出版社, 2017. [16] CHA S H. Comprehensive survey on distance/similarity measures between probability density functions [J]. International Journal of Mathematical Models and Methods in Applied Sciences, 2007, 4(1):300-307. [17] 董西成. Hadoop技术内幕 [M]. 北京: 机械工业出版社, 2013. [18] 黄宜华, 苗凯翔. 深入理解大数据:大数据处理与编程实践[M]. 北京: 机械工业出版社, 2014. [19] 刘军, 林文辉, 方澄. Spark大数据处理-原理、算法与实例[M]. 清华大学出版社, 2016. [20] 樊哲. Mahout算法解析与案例实战[M]. 北京: 机械工业出版社, 2014. [21] NICK P. Spark机器学习[M].影印版.北京: 人民邮电出版社, 2015. [22] 何清, 李宁, 罗文娟,等. 大数据下的机器学习算法综述[J]. 模式识别与人工智能, 2014, 27(4):327-336. DOI:10.3969/j.issn.1003-6059.2014.04.007. HE Q, LI N, LUO W J, et al. A survey of machine learning algorithms for big data [J]. Pattern Recognition and Artificial Intelligence, 2014, 27(4):327-336. DOI:10.3969/j.issn.1003-6059.2014.04.007. [23] 黄宜华. 大数据机器学习系统研究进展[J]. 大数据, 2015, 1(1):28-47. DOI:10.11959/j.issn.2096-0271.2015004. HUANG Y H. Research progress on big data machine learning system [J]. Big Data, 2015, 1(1):28-47. DOI:10.11959/j.issn.2096-0271.2015004. [24] HEUREUX A, GROLINGER K, ELYAMANY H F, et al. Machine learning with big data: Challenges and approaches [J]. IEEE Access, 2017, 5:7776-7797. DOI: 10.1109/ACCESS.2017.2696365. [25] ZHOU L, PAN S, WANG J, et al. Machine learning on big data: Opportunities and challenges [J]. Neurocomputing, 2017, 237:350-361. DOI:10.1016/j.neucom.2017.01.026. [26] AL-JARRAH O Y, YOO P D, MUHAIDAT S, et al. Efficient machine learning for big data: a review [J]. Big Data Research, 2015, 2(3):87-93. DOI:10.1016/j.bdr.2015.04.001. [27] CHU C T, SANG K K, LIN Y A, et al. Map-reduce for machine learning on multicore [Z]. International Conference on Neural Information Processing Systems, Vancouver, Canada, 2006. [28] Apache. Hadoop [Z/OL]. [2017-12-01]. http://hadoop.apache.org/. [29] Apache. Spark [Z/OL]. [2017-12-05]. http://spark.apache.org/. [30] Apache. Mahout [Z/OL]. [2017-12-07]. http://mahout.apache.org/. [31] Apache. MLlib [Z/OL]. [2017-12-12]. http://spark.apache.org/mllib/. [32] CHEN X W, LIN X. Big data deep learning: challenges and perspectives [J]. IEEE Access, 2014, 2:514-525. DOI: 10.1109/ACCESS.2014.2325029. [33] ZHANG K, CHEN X W. Large-scale deep belief nets with MapReduce [J]. Access IEEE, 2015, 2(2):395-403. DOI: 10.1109/ACCESS.2014.2319813. [34] LV Y, DUAN Y, KANG W, et al. Traffic flow prediction with big data: a deep learning approach [J]. IEEE Transactions on Intelligent Transportation Systems, 2015, 16(2):865-873. DOI: 10.1109/TITS.2014.2345663. [35] BECHINI A, MARCELLONI F, SEGATORI A. A MapReduce solution for associative classification of big data [J]. Information Sciences, 2016, 332:33-55. DOI:10.1016/j.ins.2015.10.041. [36] LUDWIG S A. MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability [J]. International Journal of Machine Learning & Cybernetics, 2015, 6(6):923-934. DOI:10.1007/s13042-015-0367-0. [37] LI X, SONG J, ZHANG F, et al. MapReduce-based fast fuzzy C-means algorithm for large-scale underwater image segmentation [J]. Future Generation Computer Systems, 2016, 65:90-101. DOI:10.1016/j.future.2016.03.004. [38] XU Y, QU W, LI Z, et al. Efficient K-means++ approximation with MapReduce [J]. IEEE Transactions on Parallel & Distributed Systems, 2014, 25(12):3135-3144. DOI:10.1109/TPDS.2014.2306193. [39] PANIGRAHI S, LENKA R K, STITIPRAGYAN A. A hybrid distributed collaborative filtering recommender engine using Apache Spark [J]. Procedia Computer Science, 2016, 83:1000-1006. DOI:10.1016/j.procs.2016.04.214. [40] MAILLO J, RAMíREZ S, TRIGUERO I, et al. kNN-IS: an iterative spark-based design of the K-nearest neighbors classifier for big data [J]. Knowledge-Based Systems, 2017, 117:3-15. DOI:10.1016/j.knosys.2016.06.012. [41] 翟俊海, 王婷婷, 张明阳,等. 2种加速K-近邻方法的实验比较[J]. 河北大学学报(自然科学版), 2016, 36(6):650-656. DOI:10.3969/j.issn.1000-1565.2016.06.013. ZHAI J H, WANG T T, ZHANG M Y, et al. Experimental comparison of two acceleration approaches for K-nearest neighbors [J]. Journal of Hebei University(Natural Science Edition), 2016, 36(6):650-656. DOI:10.3969/j.issn.1000-1565.2016.06.013. [42] 高学伟, 付忠广, 孙力, 等. 基于Hadoop分布式支持向量机球磨机大数据建模[J]. 河北大学学报(自然科学版), 2017, 37(3):309-315. DOI:10.3969/j.issn.1000-1565.2017.03.014. GAO X W, FU Z G, SUN L, et al. Big data modeling of ball mill based on distributed support vector machine on Hadoop platform [J]. Journal of Hebei University(Natural Science Edition), 2017, 37(3):309-315. DOI:10.3969/j.issn.1000-1565.2017.03.014. [43] 罗文劼, 袁方, 杨秀丹. 基于建模技术构建运用大数据分析优化政务的环境[J]. 河北大学学报(自然科学版), 2017, 37(1):101-107. DOI:10.3969/j.issn.1000-1565.2017.01.015. LUO W J, YUAN F, YANG X D. Building platform for optimizing E-government business using big data analysis based on modeling technique [J]. Journal of Hebei University(Natural Science Edition), 2017, 37(1):101-107. DOI:10.3969/j.issn.1000-1565.2017.01.015. [44] 马国富, 王子贤, 马胜利. 基于大数据的服刑人员危险性预测[J]. 河北大学学报(自然科学版), 2016, 36(6):657-666. DOI:10.3969/j.issn.1000-1565.2016.06.014. MA G F, WANG Z X, MA S L. Prediction of the risk of offenders based on big data [J]. Journal of Hebei University(Natural Science Edition), 2016, 36(6):657-666. DOI:10.3969/j.issn.1000-1565.2016.06.014. [45] DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1):107-113. DOI: 10.1145/1327452.1327492. [46] OLVERA-LÓPEZ J A, CARRASCO-OCHOA J A, MARTíNEZ-TRINIDAD J F, et al. A review of instance selection methods [J]. Artificial Intelligence Review, 2010, 34(2):133-143. DOI:10.1007/s10462-010-9165-y. [47] TRIGUERO I, PERALTA D, BACARDIT J, et al. MRPR: A MapReduce solution for prototype reduction in big data classification [J]. Neurocomputing, 2015, 150:331-345. DOI:10.1016/j.neucom.2014.04.078. [48] ALVAR A G, JOSE-FRANCISCO D P, RODRíGUEZ J J, et al. Instance selection of linear complexity for big data [J]. Knowledge-Based Systems, 2016, 107:83-95. DOI: 10.1016/j.knosys.2016.05.056. [49] SI L, YU J, WU W, et al. RMHC-MR: Instance selection by random mutation hill climbing algorithm with MapReduce in big data [J]. Procedia Computer Science, 2017, 111:252-259. DOI: 10.1016/j.procs.2017.06.061. [50] ZHAI J H, WANG X Z, PANG X H. Voting-based instance selection from large data sets with Mapreduce and random weight networks [J]. Information Sciences, 2016, 367: 1066-1077. DOI:10.1016/j.ins.2016.07.026. [51] SRIVASTAVA N, SALAKHUTDINOV R. Multimodal learning with deep boltzmann machines [J]. Journal of Machine Learning Research, 2014, 15(8):1967-2006. [52] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning [Z].International Conference on Machine Learning, Washington, USA, 2011. [53] ZHENG Y. Methodologies for cross-domain data fusion: an overview [J]. IEEE Transactions on Big Data, 2015, 1(1):16-34. DOI: 10.1109/TBDATA.2015.2465959. [54] ZHANG D, WANG F, SI L. Composite hashing with multiple information sources [Z]. The 34th international ACM SIGIR conference on Research and development in Information Retrieval, Beijing, China, 2011. DOI: 10.1145/2009916.2009950. [55] WU B T, YANG Q, ZHENG W S, et al. Quantized correlation hashing for fast cross-modal search [Z]. International Joint Conferences on Artificial Intelligence, Buenos Aires, Argentina, 2015. [56] LIU X, HUANG L, DENG C, et al. Multi-view complementary hash tables for nearest neighbor search [Z]. IEEE International Conference on Computer Vision, Santiago Chile, 2015. DOI: 10.1109/ICCV.2015.132. [57] RAMIREZ-GALLEGO S, KRAWCZYK B, GARCIA S, et al. Nearest neighbor classification for high-speed big data streams using spark [J]. IEEE Transactions on Systems Man & Cybernetics Systems, 2017, 47(10):2727-2739. DOI: 10.1109/TSMC.2017.2700889. [58] LEKHA R N, SUJALA D S, SIDDHANTH D S. Applying spark based machine learning model on streaming big data for health status prediction [J]. Computers & Electrical Engineering. DOI: 10.1016/j.compeleceng.2017.03.009. [59] CARCILLOA F, POZZOLOA A D, BORGNEA Y A L, et al. SCARFF: A scalable framework for streaming credit card fraud detection with spark [J]. Information Fusion, 2018, 41:182-194. DOI: 10.1016/j.inffus.2017.09.005. [60] WU Y, HOI S C H, LIU C, et al. SOL: A library for scalable online learning algorithms [J]. Neurocomputing, 2017, 260:9-12. DOI: 10.1016/j.neucom.2017.03.077. [61] CONG Y, LIU J, FAN B, et al. Online similarity learning for big data with overfitting [J]. IEEE Transactions on Big Data, 2017. DOI: 10.1109/TBDATA.2017.2688360. [62] LIANG N Y, HUANG G B, SARATCHANDRAN P, et al. A fast and accurate online sequential learning algorithm for feedforward networks [J]. IEEE Transactions on Neural Networks, 2006, 17(6):1411-23. DOI: 10.1109/TNN.2006.880583. [63] ZHAI J H, WANG J G, HU W X. Combination of OSELM classifiers with fuzzy integral for large scale classification [J]. Journal of Intelligent & Fuzzy Systems, 2015, 28(5):2257-2268. DOI: 10.3233/IFS-141508. [64] WANG H, XU Z S, PEDRYCZ W. An overview on the roles of fuzzy set techniques in big data processing: Trends, challenges and opportunities [J]. Knowledge-Based Systems, 2017, 118:15-30. DOI: 10.1016/j.knosys.2016.11.008. [65] MAUGIS F A G. Big data uncertainties [J]. Journal of Forensic and Legal Medicine, 2016. DOI: 10.1016/j.jflm.2016.09.005. [66] HERRERA F. On the use of MapReduce for imbalanced big data using Random Forest [J]. Information Sciences, 2014, 285:112-137. DOI: 10.1016/j.ins.2014.03.043. [67] GHANAVATI M, WONG R K, CHEN F, et al. An effective integrated method for learning big imbalanced data [Z]. IEEE International Congress on Big Data, Alaska, USA, 2014. DOI: 10.1109/BigData.Congress.2014.102. [68] D'ADDABBO A, MAGLIETTA R. Parallel selective sampling method for imbalanced and large data classification [J]. Pattern Recognition Letters, 2015, 97:61-67. DOI: 10.1016/j.patrec.2015.05.008. [69] LOPEZ V, DEL RIO S, BENITEZ J M, et al. Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data [J]. Fuzzy Sets and Systems, 2015, 258:5-38. DOI: 10.1016/j.fss.2014.01.015. [70] ZHAI J H, ZHANG S F, WANG C X. The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers [J]. International Journal of Machine Learning & Cybernetics, 2017, 8(3):1009-1017. DOI: 10.1007/s13042-015-0478-7. [71] FERNANDEZ A, RIO S D, CHAWLA N V, et al. An insight into imbalanced big data classification: outcomes and challenges [J]. Complex & Intelligent Systems, 2017, 3(2):105-120. DOI: 10.1007/s40747-017-0037-9. |
[1] | 李骏. 基于MapReduce的大数据在线聚集优化设计[J]. 河北大学学报(自然科学版), 2021, 41(2): 212-217. |
[2] | 翟俊海, 田石, 张素芳, 王谟瀚, 宋丹丹. 基于MapReduce和Spark的大数据模糊K-means算法比较[J]. 河北大学学报(自然科学版), 2020, 40(4): 433-440. |
[3] | 翟俊海,沈矗,张素芳,王婷婷. 基于Spark和SimHash的大数据K-近邻分类算法[J]. 河北大学学报(自然科学版), 2019, 39(2): 201-210. |
[4] | 刘行简,魏旭光,康凯. 基于信息资源共享的云计算应用绩效关系模型[J]. 河北大学学报(自然科学版), 2018, 38(3): 327-336. |
[5] | 翟俊海,张素芳,郝璞. 卷积神经网络及其研究进展[J]. 河北大学学报(自然科学版), 2017, 37(6): 640-651. |
[6] | 马国富,王子贤,马胜利. 机器学习模型在预测服刑人员再犯罪危险性中的效用分析[J]. 河北大学学报(自然科学版), 2017, 37(4): 426-433. |
[7] | 罗文劼,袁方,杨秀丹. 基于建模技术构建运用大数据分析优化政务的环境[J]. 河北大学学报(自然科学版), 2017, 37(1): 101-107. |
[8] | 马国富,王子贤,马胜利. 基于大数据的服刑人员危险性预测[J]. 河北大学学报(自然科学版), 2016, 36(6): 657-666. |
[9] | 陈昀,毕海岩. 基于多特征融合的中文评论情感分类算法[J]. 河北大学学报(自然科学版), 2015, 35(6): 651-656. |
[10] | 李海峰,李纯果. 深度学习结构和算法比较分析[J]. 河北大学学报(自然科学版), 2012, 32(5): 538-544. |
[11] | 毕长泉,曹健,吴卫华,王艳红. 基于云计算的唐山市信息资源共享模式设计[J]. 河北大学学报(自然科学版), 2012, 32(1): 96-99. |
[12] | 陈昊,湛燕,张素芳. 电力智能服务系统的设计与实现[J]. 河北大学学报(自然科学版), 2010, 30(4): 434-438. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||