河北大学学报(自然科学版) ›› 2017, Vol. 37 ›› Issue (1): 108-112.DOI: 10.3969/j.issn.1000-1565.2017.01.016

• • 上一篇    

基于术语同义关系的文档相似度研究

张锡忠1,徐建民2   

  • 收稿日期:2016-10-10 出版日期:2017-01-25 发布日期:2017-01-25
  • 通讯作者: 徐建民(1966—),男,河北馆陶人,河北大学教授,博士,主要从事为信息检索、不确定信息处理.E-mail:hbuxjm@hbu.edu.cn
  • 作者简介:张锡忠(1966—),男,河北衡水人,保定市教育考试院高级工程师,主要从事信息管理、管理信息系统研究. E-mail:13903223288@139.com
  • 基金资助:
    河北省自然科学基金资助项目(F2015201142);河北省社会科学基金资助项目(HB15SH064)

Research on document similarity based on terms synonymous relationship

ZHANG Xizhong1,XU Jianmin2   

  1. 1.Institute of Information Technology, Baoding Education Examinations Authority, Baoding 071000, China; 2.School of Computer Science and Technology, Hebei University, Baoding 071002, China
  • Received:2016-10-10 Online:2017-01-25 Published:2017-01-25

摘要: 基于向量空间的文档相似度算法假设特征元素间关系为正交,当2篇文档采用了具有相近语义的不同术语描述时,该方法不能准确反映二者的相似性.针对这种情况,文章利用词语的同义关系,在给出术语与术语组相似度、术语组和术语组间相似度的概念及算法的基础上,给出一种基于词语相似关系的文档相似度计算方法.实验采用科技文献类文档和新闻报道类文档作为测试集合,比较新方法和向量空间算法的分类性能,结果显示新方法可提高文档分类的准确性.

关键词: 同义词, 词语相似度, 文档相似度

Abstract: Because vector space model(VSM)assumes that terms in different documents is orthogonal,when different documents are described by different terms,VSM can’t accurately reflect the similarity between them.For this problem,based on giving definition and computing method of similarity between two terms set,this paper gives a quantification method to calculate similarity between two documents.Our experiments adopt science and technology literature documents and news stories to test the classification accuracy of VSM and the new method,results indicate that the new method can improve classification accuracy.

Key words: synonymous, similarity between two terms, similarity between two documents

中图分类号: