Big data and big data machine learning

doi:10.3969/j.issn.1000-1565.2018.03.011

Abstract

Abstract: Big data era has arrived. The big data refers to the data which is usually characterized by the 5 features: volume, variety, velocity, veracity, and value. In recent years, big data research is the hottest research topic in the field of information processing, and has drawn great attention from industrial communities, academic communities and governments because big value can be found in big data. It is of great significance for companies or governments to make decisions using the knowledge found from big data. Big data introduces many challenges to traditional machine learning, which can be analyzed by the 5 features of- DOI:10.3969/j.issn.1000-1565.2018.03.011大数据与大数据机器学习张素芳¹, 翟俊海², 王聪², 沈矗², 赵春玲²(1.中国气象局气象干部培训学院河北分院,河北保定 071000;2.河北省机器学习与计算智能重点实验室,河北大学数学与信息科学学院,河北保定 071002)摘要大数据时代已经到来,大数据是指具有海量(Volume)、多样(Variety)、时效(Velocity)、不精确(Veracity)和价值(Value)这5种特征的数据,大数据研究是近几年信息处理领域最热门的研究方向,已经引起了工业界、学术界乃至政府部门的高度关注.大数据之所以备受关注,是因为大数据里面蕴藏着巨大的价值.如何把蕴藏在大数据中的价值挖掘出来,为企业或政府部门提供决策支持具有重要的意义.大数据给传统的机器学习带来了许多挑战,这些挑战可以从大数据的5个特征或从5个不同的角度进行分析.本文首先介绍大数据的概念,并详细剖析大数据5种特征的内涵;然后在此基础上,重点分析大数据给机器学习带来的挑战及可能的解决方法.本文对从事大数据研究的人员,特别是从事大数据机器学习研究的人员具有较高的参考价值.关键词: 大数据;机器学习;云计算;决策支持中图分类号:TP181 文献标志码:A 【additional_page=336】文章编号:1000-1565(2018)03-0299-10Big data and big data machine learningZHANG Sufang¹, ZHAI Junhai², WANG Cong², SHEN Chu², ZHAO Chunling²(1.Hebei Branch of China Meteorological Administration Training Centre, China Meteorological Administration, Baoding 071000, China;2.Key Laboratory of Machine Learning and Computational Intelligence of Hebei Province, College of Mathematics and Information Science, Hebei University, Baoding 071002, China)Abstract: Big data era has arrived. The big data refers to the data which is usually characterized by the 5 features: volume, variety, velocity, veracity, and value. In recent years, big data research is the hottest research topic in the field of information processing, and has drawn great attention from industrial communities, academic communities and governments because big value can be found in big data. It is of great significance for companies or governments to make decisions using the knowledge found from big data. Big data introduces many challenges to traditional machine learning, which can be analyzed by the 5 features of- 收稿日期:2017-12-23 基金项目:国家自然科学基金资助项目(71371063);河北省自然科学基金资助项目(F2017201026);河北大学自然科学研究计划项目(799207217071);河北大学研究生创新资助项目(hbu2018ss47);河北大学大学生创新训练项目(2017071) 第一作者:张素芳(1966—),女,河北蠡县人,中国气象局气象干部培训学院河北分院副教授,主要从事机器学习方向研究.E-mail: mczsf@126.com 通信作者:翟俊海(1964—),男,河北易县人,河北大学教授,博士,主要从事机器学习和数据挖掘方向研究.E-mail:mczjh@126.com第3期张素芳等:大数据与大数据机器学习big data or from 5 different views.This paper firstly introduces the concept of big data,and carefully analyzes the connotations of the 5 features, and then mainly focuses on analyzing the challenges and the possible solutions. This paper can be very helpful to researchers in related fields, especially for the ones engaging in the study of big data machine learning.

Key words: big data, machine learning, cloud computing, decision making

CLC Number:

TP181

ZHANG Sufang, ZHAI Junhai, WANG Cong, SHEN Chu, ZHAO Chunling. Big data and big data machine learning[J]. Journal of Hebei University (Natural Science Edition), 2018, 38(3): 299-308.

References

[1] MANYIKA J, CHUI M, BROWN B, et al. Big data: The next frontier for innovation, competition, and productivity [R/OL]. https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/ big-data-the-next-frontier-for-innovation.
[2] EMANI C K, CULLOT N, NICOLLE C. Understandable Big Data: A survey [J]. Computer Science Review, 2015, 17:70-81. DOI: 10.1016/j.cosrev.2015.05.002.
[3] 孟小峰, 慈祥. 大数据管理:概念、技术与挑战 [J]. 计算机研究与发展, 2013, 50(1):146-169. DOI:10.7544/issn1000-1239.2013.20121130. MENG X F, CI X. Big data management: concept, techniques and challenges [J]. Journal of Computer Research and Development, 2013, 50(1):146-169. DOI:10.7544/issn1000-1239.2013.20121130.
[4] STOREY V C, SONG I Y. Big data technologies and management: What conceptual modeling can do [J]. Data & Knowledge Engineering, 2017, 108:50-67. DOI: 10.1016/j.datak.2017.01.001.
[5] MITCHELL T M. 机器学习[M].英文影印版.北京: 机械工业出版社, 2003.
[6] MURPHY K. Machine learning: a probabilistic perspective [M]. Cambridge: MIT Press, 2012.
[7] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[8] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks [J]. Science, 2006, 313:504-507. doi:10.1126/science.1127647.
[9] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553):436-444. DOI:10.1038/nature14539.
[10] 马世龙, 乌尼日其其格, 李小平. 大数据与深度学习综述[J]. 智能系统学报, 2016, 11(6):728-742. DOI: 10.11992/tis.201611021. MA S L, WUNIRI Q Q G, LI X P. Deep learning with big data: state of the art and development [J]. CAAI Transactions on Intelligent Systems, 2016, 11(6):728-742. DOI: 10.11992/tis.201611021.
[11] GUO Y M, LIU Y, OERLEMANS A, et al. Deep learning for visual understanding: a review [J]. Neurocomputing, 2016, 187:27-48. DOI: 10.1016/j.neucom.2015.09.116.
[12] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of go with deep neural networks and tree search [J]. Nature, 2016, 529(7587):484. DOI: 10.1038/nature16961.
[13] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of Go without human knowledge [J]. Nature, 2017, 550(7676):354-359. DOI: 10.1038/nature24270.
[14] JORDAN M I, MITCHELL T M. Machine learning: Trends, perspectives, and prospects [J]. Science, 2015, 349(6245):255-260. DOI: 10.1126/science.aaa8415.
[15] 赵申剑, 黎彧君, 符天凡, 等. 深度学习 [M].北京: 人民邮电出版社, 2017.
[16] CHA S H. Comprehensive survey on distance/similarity measures between probability density functions [J]. International Journal of Mathematical Models and Methods in Applied Sciences, 2007, 4(1):300-307.
[17] 董西成. Hadoop技术内幕 [M]. 北京: 机械工业出版社, 2013.
[18] 黄宜华, 苗凯翔. 深入理解大数据:大数据处理与编程实践[M]. 北京: 机械工业出版社, 2014.
[19] 刘军, 林文辉, 方澄. Spark大数据处理-原理、算法与实例[M]. 清华大学出版社, 2016.
[20] 樊哲. Mahout算法解析与案例实战[M]. 北京: 机械工业出版社, 2014.
[21] NICK P. Spark机器学习[M].影印版.北京: 人民邮电出版社, 2015.
[22] 何清, 李宁, 罗文娟,等. 大数据下的机器学习算法综述[J]. 模式识别与人工智能, 2014, 27(4):327-336. DOI:10.3969/j.issn.1003-6059.2014.04.007. HE Q, LI N, LUO W J, et al. A survey of machine learning algorithms for big data [J]. Pattern Recognition and Artificial Intelligence, 2014, 27(4):327-336. DOI:10.3969/j.issn.1003-6059.2014.04.007.
[23] 黄宜华. 大数据机器学习系统研究进展[J]. 大数据, 2015, 1(1):28-47. DOI:10.11959/j.issn.2096-0271.2015004. HUANG Y H. Research progress on big data machine learning system [J]. Big Data, 2015, 1(1):28-47. DOI:10.11959/j.issn.2096-0271.2015004.
[24] HEUREUX A, GROLINGER K, ELYAMANY H F, et al. Machine learning with big data: Challenges and approaches [J]. IEEE Access, 2017, 5:7776-7797. DOI: 10.1109/ACCESS.2017.2696365.
[25] ZHOU L, PAN S, WANG J, et al. Machine learning on big data: Opportunities and challenges [J]. Neurocomputing, 2017, 237:350-361. DOI:10.1016/j.neucom.2017.01.026.
[26] AL-JARRAH O Y, YOO P D, MUHAIDAT S, et al. Efficient machine learning for big data: a review [J]. Big Data Research, 2015, 2(3):87-93. DOI:10.1016/j.bdr.2015.04.001.
[27] CHU C T, SANG K K, LIN Y A, et al. Map-reduce for machine learning on multicore [Z]. International Conference on Neural Information Processing Systems, Vancouver, Canada, 2006.
[28] Apache. Hadoop [Z/OL]. [2017-12-01]. http://hadoop.apache.org/.
[29] Apache. Spark [Z/OL]. [2017-12-05]. http://spark.apache.org/.
[30] Apache. Mahout [Z/OL]. [2017-12-07]. http://mahout.apache.org/.
[31] Apache. MLlib [Z/OL]. [2017-12-12]. http://spark.apache.org/mllib/.
[32] CHEN X W, LIN X. Big data deep learning: challenges and perspectives [J]. IEEE Access, 2014, 2:514-525. DOI: 10.1109/ACCESS.2014.2325029.
[33] ZHANG K, CHEN X W. Large-scale deep belief nets with MapReduce [J]. Access IEEE, 2015, 2(2):395-403. DOI: 10.1109/ACCESS.2014.2319813.
[34] LV Y, DUAN Y, KANG W, et al. Traffic flow prediction with big data: a deep learning approach [J]. IEEE Transactions on Intelligent Transportation Systems, 2015, 16(2):865-873. DOI: 10.1109/TITS.2014.2345663.
[35] BECHINI A, MARCELLONI F, SEGATORI A. A MapReduce solution for associative classification of big data [J]. Information Sciences, 2016, 332:33-55. DOI:10.1016/j.ins.2015.10.041.
[36] LUDWIG S A. MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability [J]. International Journal of Machine Learning & Cybernetics, 2015, 6(6):923-934. DOI:10.1007/s13042-015-0367-0.
[37] LI X, SONG J, ZHANG F, et al. MapReduce-based fast fuzzy C-means algorithm for large-scale underwater image segmentation [J]. Future Generation Computer Systems, 2016, 65:90-101. DOI:10.1016/j.future.2016.03.004.
[38] XU Y, QU W, LI Z, et al. Efficient K-means++ approximation with MapReduce [J]. IEEE Transactions on Parallel & Distributed Systems, 2014, 25(12):3135-3144. DOI:10.1109/TPDS.2014.2306193.
[39] PANIGRAHI S, LENKA R K, STITIPRAGYAN A. A hybrid distributed collaborative filtering recommender engine using Apache Spark [J]. Procedia Computer Science, 2016, 83:1000-1006. DOI:10.1016/j.procs.2016.04.214.
[40] MAILLO J, RAMíREZ S, TRIGUERO I, et al. kNN-IS: an iterative spark-based design of the K-nearest neighbors classifier for big data [J]. Knowledge-Based Systems, 2017, 117:3-15. DOI:10.1016/j.knosys.2016.06.012.
[41] 翟俊海, 王婷婷, 张明阳,等. 2种加速K-近邻方法的实验比较[J]. 河北大学学报(自然科学版), 2016, 36(6):650-656. DOI:10.3969/j.issn.1000-1565.2016.06.013. ZHAI J H, WANG T T, ZHANG M Y, et al. Experimental comparison of two acceleration approaches for K-nearest neighbors [J]. Journal of Hebei University(Natural Science Edition), 2016, 36(6):650-656. DOI:10.3969/j.issn.1000-1565.2016.06.013.
[42] 高学伟, 付忠广, 孙力, 等. 基于Hadoop分布式支持向量机球磨机大数据建模[J]. 河北大学学报(自然科学版), 2017, 37(3):309-315. DOI:10.3969/j.issn.1000-1565.2017.03.014. GAO X W, FU Z G, SUN L, et al. Big data modeling of ball mill based on distributed support vector machine on Hadoop platform [J]. Journal of Hebei University(Natural Science Edition), 2017, 37(3):309-315. DOI:10.3969/j.issn.1000-1565.2017.03.014.
[43] 罗文劼, 袁方, 杨秀丹. 基于建模技术构建运用大数据分析优化政务的环境[J]. 河北大学学报(自然科学版), 2017, 37(1):101-107. DOI:10.3969/j.issn.1000-1565.2017.01.015. LUO W J, YUAN F, YANG X D. Building platform for optimizing E-government business using big data analysis based on modeling technique [J]. Journal of Hebei University(Natural Science Edition), 2017, 37(1):101-107. DOI:10.3969/j.issn.1000-1565.2017.01.015.
[44] 马国富, 王子贤, 马胜利. 基于大数据的服刑人员危险性预测[J]. 河北大学学报(自然科学版), 2016, 36(6):657-666. DOI:10.3969/j.issn.1000-1565.2016.06.014. MA G F, WANG Z X, MA S L. Prediction of the risk of offenders based on big data [J]. Journal of Hebei University(Natural Science Edition), 2016, 36(6):657-666. DOI:10.3969/j.issn.1000-1565.2016.06.014.
[45] DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1):107-113. DOI: 10.1145/1327452.1327492.
[46] OLVERA-LÓPEZ J A, CARRASCO-OCHOA J A, MARTíNEZ-TRINIDAD J F, et al. A review of instance selection methods [J]. Artificial Intelligence Review, 2010, 34(2):133-143. DOI:10.1007/s10462-010-9165-y.
[47] TRIGUERO I, PERALTA D, BACARDIT J, et al. MRPR: A MapReduce solution for prototype reduction in big data classification [J]. Neurocomputing, 2015, 150:331-345. DOI:10.1016/j.neucom.2014.04.078.
[48] ALVAR A G, JOSE-FRANCISCO D P, RODRíGUEZ J J, et al. Instance selection of linear complexity for big data [J]. Knowledge-Based Systems, 2016, 107:83-95. DOI: 10.1016/j.knosys.2016.05.056.
[49] SI L, YU J, WU W, et al. RMHC-MR: Instance selection by random mutation hill climbing algorithm with MapReduce in big data [J]. Procedia Computer Science, 2017, 111:252-259. DOI: 10.1016/j.procs.2017.06.061.
[50] ZHAI J H, WANG X Z, PANG X H. Voting-based instance selection from large data sets with Mapreduce and random weight networks [J]. Information Sciences, 2016, 367: 1066-1077. DOI:10.1016/j.ins.2016.07.026.
[51] SRIVASTAVA N, SALAKHUTDINOV R. Multimodal learning with deep boltzmann machines [J]. Journal of Machine Learning Research, 2014, 15(8):1967-2006.
[52] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning [Z].International Conference on Machine Learning, Washington, USA, 2011.
[53] ZHENG Y. Methodologies for cross-domain data fusion: an overview [J]. IEEE Transactions on Big Data, 2015, 1(1):16-34. DOI: 10.1109/TBDATA.2015.2465959.
[54] ZHANG D, WANG F, SI L. Composite hashing with multiple information sources [Z]. The 34th international ACM SIGIR conference on Research and development in Information Retrieval, Beijing, China, 2011. DOI: 10.1145/2009916.2009950.
[55] WU B T, YANG Q, ZHENG W S, et al. Quantized correlation hashing for fast cross-modal search [Z]. International Joint Conferences on Artificial Intelligence, Buenos Aires, Argentina, 2015.
[56] LIU X, HUANG L, DENG C, et al. Multi-view complementary hash tables for nearest neighbor search [Z]. IEEE International Conference on Computer Vision, Santiago Chile, 2015. DOI: 10.1109/ICCV.2015.132.
[57] RAMIREZ-GALLEGO S, KRAWCZYK B, GARCIA S, et al. Nearest neighbor classification for high-speed big data streams using spark [J]. IEEE Transactions on Systems Man & Cybernetics Systems, 2017, 47(10):2727-2739. DOI: 10.1109/TSMC.2017.2700889.
[58] LEKHA R N, SUJALA D S, SIDDHANTH D S. Applying spark based machine learning model on streaming big data for health status prediction [J]. Computers & Electrical Engineering. DOI: 10.1016/j.compeleceng.2017.03.009.
[59] CARCILLOA F, POZZOLOA A D, BORGNEA Y A L, et al. SCARFF: A scalable framework for streaming credit card fraud detection with spark [J]. Information Fusion, 2018, 41:182-194. DOI: 10.1016/j.inffus.2017.09.005.
[60] WU Y, HOI S C H, LIU C, et al. SOL: A library for scalable online learning algorithms [J]. Neurocomputing, 2017, 260:9-12. DOI: 10.1016/j.neucom.2017.03.077.
[61] CONG Y, LIU J, FAN B, et al. Online similarity learning for big data with overfitting [J]. IEEE Transactions on Big Data, 2017. DOI: 10.1109/TBDATA.2017.2688360.
[62] LIANG N Y, HUANG G B, SARATCHANDRAN P, et al. A fast and accurate online sequential learning algorithm for feedforward networks [J]. IEEE Transactions on Neural Networks, 2006, 17(6):1411-23. DOI: 10.1109/TNN.2006.880583.
[63] ZHAI J H, WANG J G, HU W X. Combination of OSELM classifiers with fuzzy integral for large scale classification [J]. Journal of Intelligent & Fuzzy Systems, 2015, 28(5):2257-2268. DOI: 10.3233/IFS-141508.
[64] WANG H, XU Z S, PEDRYCZ W. An overview on the roles of fuzzy set techniques in big data processing: Trends, challenges and opportunities [J]. Knowledge-Based Systems, 2017, 118:15-30. DOI: 10.1016/j.knosys.2016.11.008.
[65] MAUGIS F A G. Big data uncertainties [J]. Journal of Forensic and Legal Medicine, 2016. DOI: 10.1016/j.jflm.2016.09.005.
[66] HERRERA F. On the use of MapReduce for imbalanced big data using Random Forest [J]. Information Sciences, 2014, 285:112-137. DOI: 10.1016/j.ins.2014.03.043.
[67] GHANAVATI M, WONG R K, CHEN F, et al. An effective integrated method for learning big imbalanced data [Z]. IEEE International Congress on Big Data, Alaska, USA, 2014. DOI: 10.1109/BigData.Congress.2014.102.
[68] D'ADDABBO A, MAGLIETTA R. Parallel selective sampling method for imbalanced and large data classification [J]. Pattern Recognition Letters, 2015, 97:61-67. DOI: 10.1016/j.patrec.2015.05.008.
[69] LOPEZ V, DEL RIO S, BENITEZ J M, et al. Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data [J]. Fuzzy Sets and Systems, 2015, 258:5-38. DOI: 10.1016/j.fss.2014.01.015.
[70] ZHAI J H, ZHANG S F, WANG C X. The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers [J]. International Journal of Machine Learning & Cybernetics, 2017, 8(3):1009-1017. DOI: 10.1007/s13042-015-0478-7.
[71] FERNANDEZ A, RIO S D, CHAWLA N V, et al. An insight into imbalanced big data classification: outcomes and challenges [J]. Complex & Intelligent Systems, 2017, 3(2):105-120. DOI: 10.1007/s40747-017-0037-9.