河北大学学报(自然科学版) ›› 2018, Vol. 38 ›› Issue (5): 549-554.DOI: 10.3969/j.issn.1000-1565.2018.05.016

• • 上一篇    下一篇

基于朴素贝叶斯的档案分类研究

刘佩鑫,于洪志,徐涛   

  • 收稿日期:2017-06-05 出版日期:2018-09-25 发布日期:2018-09-25
  • 通讯作者: 徐涛(1987—),男,四川广安人,西北民族大学副教授,博士,主要从事自然语言处理方向研究.E-mail: 395373512@qq.com
  • 作者简介:刘佩鑫(1994—),男,河北衡水人,西北民族大学在读硕士研究生,主要从事自然语言处理方向研究. E-mail: 421642185@qq.com
  • 基金资助:
    甘肃省档案局档案资源挖掘平台(甘档发[2016]71号)

Research on archives text classification based on Naive Bayes

LIU Peixin,YU Hongzhi,XU Tao   

  1. Key Laboratory of China's Ethnic Languages and Information Technology, Northwest Minzu University, Lanzhou 730030, China
  • Received:2017-06-05 Online:2018-09-25 Published:2018-09-25

摘要: 通过对甘肃省档案局数据资源的分析研究,并与朴素贝叶斯分类算法相结合,实现对档案资源分类应用的研究.根据档案数据的特征,选用TFIDF(term frequency-inverse document frequency)算法进行选取符合档案文本主题的属性.样本实验结果证明,该分类模型适用于档案文本资源的分类,实现了档案资源自动分类的功能.相较于传统朴素贝叶斯分类方法,所提出的分类模型针对档案资源的分类效率提高了1%~2%.

关键词: 档案文本资源, 档案特征, 文本分类, 朴素贝叶斯分类器

Abstract: This paper analyzes the data resources of archives in Gansu Province by combining with Naive Bayesian classification algorithm to realize the application of archives resource classification. According to the characteristics of the file data, the attribute that matches the text of the file text was selected, and the TFIDF algorithm in the file text feature attribute selection was used. The experimental results show that the classification model is suitable for the classification of archival text resources, and the function of automatic classification of archives is realized. Compared with the traditional Naive Bayesian classification method, the classification model proposed in this paper is 1%—2% for the classification efficiency of archives, it is thus a more effective classification model for the archives.

Key words: archives text resource, file feature, text classification, Naive Bayesian classification

中图分类号: