河北大学学报(自然科学版) ›› 2017, Vol. 37 ›› Issue (6): 652-661.DOI: 10.3969/j.issn.1000-1565.2017.06.013

• • 上一篇    下一篇

基于数学表达式特征的科技文档检索模型

田学东,崔晓娟   

  • 收稿日期:2017-09-18 发布日期:2017-11-25
  • 作者简介:田学东(1963—),男,天津人,河北大学教授,博士,主要从事模式识别、信息检索的研究. E-mail:xuedong_tian@126.com
  • 基金资助:
    国家自然科学基金资助项目(61375075);河北省教育厅河北省高等学校科学技术研究重点项目(ZD2017208)

A retrieval model of scientific documents based on mathematical expression features

TIAN Xuedong,CUI Xiaojuan   

  1. College of Computer Science and Technology, Hebei University, Baoding 071002, China
  • Received:2017-09-18 Published:2017-11-25

摘要: 现有全文检索技术多是以文本信息为处理对象,对于以数学表达式为主要成分的科技文档检索还处在探索阶段.为了使用户可以方便地以数学公式作为查询语言对科技文档进行检索,提出了一种基于数学表达式特征的科技文档检索模型.首先通过将公式解析为二叉树得到数学表达式的子式信息,利用数学表达式及子式构造检索特征向量;在索引阶段,利用所提取的文档特征向量构建分层结构的索引表;在匹配阶段,对文档向量采用tf-idf进行加权操作,利用余弦相似度对检索向量和文档向量进行相似度计算,得到一个有序的文档检索结果.实验选取了来自不同领域的期刊、学术网站以及公共数据集的5 017篇科技文档,其中包含了96 362条数学公式,平均检索时间为0.428 s,表明该模型达到了实现较高效率科技文档检索的目标.

关键词: 科技文档, 数学表达式, 检索, 索引, 匹配, 二叉树, 特征

Abstract: The existing full-text retrieval techniques are mostly targeting the text information.While the retrieval of the scientific documents with mathematical expressions as the main components is still in the exploration stage.In order to make the users can easily use the mathematical formula as the query language to retrieve the scientific and technical documents,a new scientific document retrieval model based on mathematical expression features was proposed.Firstly,the formulas were resolved into the subformulas forming the binary trees,which are used to generate the retrieval feature vectors.In the index phase,the index table with the hierarchical structure was constructed using the extracted document feature vectors.In the retrieval phase,the document vectors were weighted by tf-idf.The similarity between the retrieval vector and the document vector was calculated by using the cosine similarity,and an ordered document retrieval result was obtained.The experiment data was selected from different journals,academic website and public data set of 5 017 science and technology documents which contain 96 362 mathematical formulas.The average retrieval time was 0.428 s,which indicates that the proposed model achieved- DOI:10.3969/j.issn.1000-1565.2017.06.013基于数学表达式特征的科技文档检索模型田学东,崔晓娟(河北大学 计算机科学与技术学院,河北 保定 071002)摘 要:现有全文检索技术多是以文本信息为处理对象,对于以数学表达式为主要成分的科技文档检索还处在探索阶段.为了使用户可以方便地以数学公式作为查询语言对科技文档进行检索,提出了一种基于数学表达式特征的科技文档检索模型.首先通过将公式解析为二叉树得到数学表达式的子式信息,利用数学表达式及子式构造检索特征向量;在索引阶段,利用所提取的文档特征向量构建分层结构的索引表;在匹配阶段,对文档向量采用tf-idf进行加权操作,利用余弦相似度对检索向量和文档向量进行相似度计算,得到一个有序的文档检索结果.实验选取了来自不同领域的期刊、学术网站以及公共数据集的5 017篇科技文档,其中包含了96 362条数学公式,平均检索时间为0.428 s,表明该模型达到了实现较高效率科技文档检索的目标.关键词:科技文档;数学表达式;检索;索引;匹配;二叉树;特征中图分类号:TP391 文献标志码:A 文章编号:1000-1565(2017)06-0652-10A retrieval model of scientific documents based onmathematical expression featuresTIAN Xuedong,CUI Xiaojuan(College of Computer Science and Technology,Hebei University,Baoding 071002,China)Abstract: The existing full-text retrieval techniques are mostly targeting the text information.While the retrieval of the scientific documents with mathematical expressions as the main components is still in the exploration stage.In order to make the users can easily use the mathematical formula as the query language to retrieve the scientific and technical documents,a new scientific document retrieval model based on mathematical expression features was proposed.Firstly,the formulas were resolved into the subformulas forming the binary trees,which are used to generate the retrieval feature vectors.In the index phase,the index table with the hierarchical structure was constructed using the extracted document feature vectors.In the retrieval phase,the document vectors were weighted by tf-idf.The similarity between the retrieval vector and the document vector was calculated by using the cosine similarity,and an ordered document retrieval result was obtained.The experiment data was selected from different journals,academic website and public data set of 5 017 science and technology documents which contain 96 362 mathematical formulas.The average retrieval time was 0.428 s,which indicates that the proposed model achieved- 收稿日期:2017-09-18 基金项目:国家自然科学基金资助项目(61375075);河北省教育厅河北省高等学校科学技术研究重点项目(ZD2017208) 第一作者:田学东(1963—),男,天津人,河北大学教授,博士,主要从事模式识别、信息检索的研究. E-mail:xuedong_tian@126.com第6期田学东等:基于数学表达式特征的科技文档检索模型the goal of realizing mathematical expression retrieval with high efficiency.

Key words: scientific documents, mathematical expressions, retrieval, indexing, matching, binary tree, features

中图分类号: