Research on archives text classification based on Naive Bayes

doi:10.3969/j.issn.1000-1565.2018.05.016

Abstract

Abstract: This paper analyzes the data resources of archives in Gansu Province by combining with Naive Bayesian classification algorithm to realize the application of archives resource classification. According to the characteristics of the file data, the attribute that matches the text of the file text was selected, and the TFIDF algorithm in the file text feature attribute selection was used. The experimental results show that the classification model is suitable for the classification of archival text resources, and the function of automatic classification of archives is realized. Compared with the traditional Naive Bayesian classification method, the classification model proposed in this paper is 1%—2% for the classification efficiency of archives, it is thus a more effective classification model for the archives.

Key words: archives text resource, file feature, text classification, Naive Bayesian classification

CLC Number:

TP391.1

LIU Peixin,YU Hongzhi,XU Tao. Research on archives text classification based on Naive Bayes[J]. Journal of Hebei University (Natural Science Edition), 2018, 38(5): 549-554.

References

[1] 蔡天作.数据挖掘在高职院校招生中的研究与应用[D].长沙:湖南大学,2014. CAI T Z.Applying data mining techniques to vocational colleges enrollment[D].Changsha:Hunan University,2014.
[2] 邬萍.数据挖掘技术在档案管理中的应用研究[J].中国管理信息化,2017,20(8):159.DOI:10.3969/j.issn.1673-0194.2017.08.106. WU P.Research on application of data mining technology in archives management[J].China Management Informationization,2017,20(8):159.DOI:10.3969/j.issn.1673-0194.2017.08.106.
[3] 梁宏胜,徐建民,成岳鹏.一种改进的朴素贝叶斯文本分类方法[J].河北大学学报(自然科学版),2007,27(3):327-331.DOI:10.3969/j.issn.1000-1565.2007.03.024. LIANG H S,XU J M,CHENG Y P.An improving text categorization method of Naive Bayes[J].Journal of Hebei University(Natural Science Edition),2007,27(3):327-331.DOI:10.3969/j.issn.1000-1565.2007.03.024.
[4] 喻凯西.朴素贝叶斯分类算法的改进及其应用[D].北京:北京林业大学,2016. YU K X.Research on improving Naive Bayes classifiers and its application[D].Beijing:Beijing Forestry University,2016.
[5] 赵文涛,孟令军,赵好好,等.朴素贝叶斯算法的改进与应用[J].测控技术,2016,35(2):143-147.DOI:10.3969/j.issn.1000-8829.2016.02.036. ZHAO W T,MENG L J,ZHAO H H,et al.Improvement and application of the Naive Bayes algorithm[J].Measurement &Control Technology,2016,35(2):143-147.DOI:10.3969/j.issn.1000-8829.2016.02.036.
[6] 杜选.基于加权补集的朴素贝叶斯文本分类算法研究[J].计算机应用与软件,2014,31(9):253-255.DOI:10.3969/j.issn.1000-386x.2014.09.063. DU X.Research on weighted complement-based Naive Bayes text classification algorithm[J].Computer Applicationsand Software,2014,31(9):253-255.DOI:10.3969/j.issn.1000-386x.2014.09.063.
[7] 胡朝举,杨孟英.中文文本分类关键技术的研究[J].电脑编程技巧与维护,2016(14):14-15.DOI:10.3969/j.issn.1006-4052.2016.14.004. HU C J,YANG M Y.Research on key technologies of Chinese text classification[J].Computer Programming Skills & Maintenance,2016(14):14-15.DOI:10.3969/j.issn.1006-4052.2016.14.004.
[8] 张磊.文本分类及分类算法研究综述[J].电脑知识与技术,2016,12(34):225-226. ZHANG L.The research summary of text categorization and classification algorithms[J].Computer Knowledge and Technology,2016,12(34):225-226.
[9] 方玉萍.中文信息处理中的歧义问题分析[J].科技传播,2017,9(13):58-59.DOI:10.3969/j.issn.1674-6708.2017.13.043 FANG Y P.Analysis of ambiguity problems in Chinese information processing[J].Public Communication of Science & Technology,2017,9(13):58-59.DOI:10.3969/j.issn.1674-6708.2017.13.043.
[10] 黄世反,沈勇,康洪炜.基于KNN的烟草企业档案文本自动分类算法研究[J].计算机科学与应用,2014(4):,204-216. HUANG S F,SHEN Y,KANG H W.An approach for algorithm of tobacco enterprise archives text automatic classification based on KNN[J].Computer Science and Application,2014(4):204-216.
[11] 牟尧,李曦.关于文本自动分类算法的研究—以档案自动归类的应用为例[J].中国西部科技,2011,10(24):49-51.DOI:10.3969/j.issn.1671-6396.2011.24.022. MU Y,LI X.Research on automatic text classification algorithm-taking the application of automatic file classification as an example[J].Science and Technology of West China,2011,10(24):49-51.DOI:10.3969/j.issn.1671-6396.2011.24.022.
[12] 李丹.基于朴素贝叶斯方法的中文文本分类研究[D].保定:河北大学,2011. LI D.The study of Chinese text categorization based on Naive Bayes[D].Baoding:Hebei University,2011.
[13] 张庆莉.档案信息资源开发的影响因素及对策分析[J].档案学通讯,2013(1):39-42. ZHANG Q L.Influencing factors and countermeasures of development of archive information resources[J].Archives Science Bulletin,2013(1):39-42.
[14] 王海涛.常用数据挖掘算法研究[J].电子设计工程,2011,19(11):23-25.DOI:10.3969/j.issn.1674-6236.2011.11.028. WANG H T.Research of common data mining algorithm[J].Electronic Design Engineering,2011,19(11):23-25.DOI:10.3969/j.issn.1674-6236.2011.11.028.
[15] 法汉英.文本分类算法在山东女子学院档案管理的应用[J].科技视界,2016(24):219.DOI:10.3969/j.issn.2095-2457.2016.24.180. FA H Y.The application of text classification algorithm in the file management of Shandong Women's College[J].Science & Technology Vision,2016(24):219.DOI:10.3969/j.issn.2095-2457.2016.24.180.
[16] 叶晓龙.中文分词关键技术研究[J].湖北农机化,2017(6):54-55.DOI:10.3969/j.issn.1009-1440.2017.06.044. YE X L.Research on key technologies of Chinese word segmentation[J].Hubei Nongjihua,2017(6):54-55.DOI:10.3969/j.issn.1009-1440.2017.06.044.
[17] 张保富,施化吉,马素琴.基于TFIDF文本特征加权方法的改进研究[J].计算机应用与软件,2011,28(2):17-20.DOI:10.3969/j.issn.1000-386X.2011.02.006. ZHANG B F,SHI H J,MA S Q.An improved text feature weighting algorithm based on TFIDF[J].Computer Applicationsand Software,2011,28(2):17-20.DOI:10.3969/j.issn.1000-386X.2011.02.006.