基于向量空间模型结合语义的文本相似度算法Text similarity algorithm combining semantics based on vector space model
冯高磊,高嵩峰
摘要(Abstract):
针对向量空间模型方法忽略词语语义以及词语相互间结构关系,没有考虑词语表达的实际意义的缺点,提出一种新的文本相似度计算方法,该方法把语义相似度的计算融入到基于向量空间模型的文本相似度算法中,最终通过语义相似度和向量空间模型相似度加权得到文本相似度的结果。实验结果证明,所提出的相似度算法得到的召回率相比于向量空间模型方法以及现有的语义相似度算法都有不同程度的提高,从而证明了该算法的有效性。
关键词(KeyWords): 文本相似度;向量空间模型;语义;词频;召回率;特征项
基金项目(Foundation):
作者(Author): 冯高磊,高嵩峰
DOI: 10.16652/j.issn.1004-373x.2018.11.035
参考文献(References):
- [1]LI Hang,XU Jun.Semantic matching in search[R].Boston:NOW,2014.
- [2]程志强,闵华松.一种基于向量词序的句子相似度算法研究[J].计算机仿真,2014,31(7):419-424.CHENG Zhiqiang,MIN Huasong.A sentences similarity algorithm based on word order of vectors distance[J].Computer simulation,2014,31(7):419-424.
- [3]TATIANA A S C,PEREIRA C.Image retrieval using multiple evidence ranking[J].IEEE transactions on knowledge&data engineering,2004,16(4):408-417.
- [4]姜亚莉,关泽群.用于Web文档聚类的基于相似度的软聚类算法[J].计算机工程,2006,32(2):59-61.JIANG Yali,GUAN Zequn.A similarity-based soft clustering algorithm for Web documents[J].Computer engineering,2006,32(2):59-61.
- [5]郭庆琳,李艳梅,唐琦.基于VSM的文本相似度计算的研究[J].计算机应用研究,2008(11):3256-3258.GUO Qinglin,LI Yanmei,TANG Qi.Similarity computing of documents based on VSM[J].Application research of computers,2008(11):3256-3258.
- [6]金希茜.基于语义相似度的中文文本相似度算法研究[D].杭州:浙江工业大学,2009.JIN Xixi.Similarity algorithm of Chinese text based on semantic similarity[D].Hangzhou:Zhejiang University of Technology,2009.
- [7]LI Y,BANDAR Z A,MCLEAN D,et al.An approach for measuring semantic similarity between words using multiple information sources[J].IEEE transactions on knowledge and data engineering,2003,15(4):871-882.
- [8]刘青磊,顾小丰.基于《知网》的词语相似度算法研究[J].中文信息学报,2010,24(6):31-36.LIU Qinglei,GU Xiaofeng.Study on How Net-based word similarity algorithm[J].Journal of Chinese information processing,2010,24(6):31-36.
- [9]王小林,王东,杨思春,等.基于《知网》的词语语义相似度算法[J].计算机工程,2014,40(12):177-181.WANG Xiaolin,WANG Dong,YANG Sichun,et al.Word semantic similarity algorithm based on How Net[J].Computer engineering,2014,40(12):177-181.
- [10]金博,史彦军,滕弘飞.基于语义理解的文本相似度算法[J].大连理工大学学报,2005,45(2):291-297.JIN Bo,SHI Yanjun,TENG Hongfei.Text similarity algorithm based on semantic understanding[J].Journal of Dalian University of Technology,2005,45(2):291-297.
- [11]董强,董振东.知网简介[EB/OL].[2017-05-29].http://www.keenage.com/.DONG Qiang,DONG Zhendong.Introduction to How Net[EB/OL].[2017-05-29].http://www.keenage.com/.
- [12]刘群,李素建.基于知网的词汇语义相似度的计算[EB/OL].[2002-08-19].http://www.doc88.com/p-3714298265602.html.LIU Qun,LI Sujian.Word′s semantic similarity computation based on Hownet[EB/OL].[2002-08-19].http://www.doc88.com/p-3714298265602.html.
- [13]李峰,李芳.中文词语语义相似度计算:基于《知网》2000[J].中文信息学报,2007,21(3):99-105.LI Feng,LI Fang.A new approach measuring semantic similarity in How Net 2000[J].Journal of Chinese information processing,2007,21(3):99-105.
- [14]AGIRRE E,RIGAU G.A proposal for word sense disambiguation using conceptual distance[C]//1995 International Conference on Recent Advances in Natural Language Processing.[S.l.]:IEEE,1995:1-7.
- [15]张敏,王振辉,王艳丽.一种基于《知网》知识描述语言结构的词语相似度计算方法[J].计算机应用与软件,2013,30(7):265-267.ZHANG Min,WANG Zhenhui,WANG Yanli.A word similarity computation method based on knowledge description language structure in How Net[J].Computer applications and software,2013,30(7):265-267.
- [16]江敏,肖诗斌,王弘蔚,等.一种改进的基于《知网》的词语语义相似度计算[J].中文信息学报,2008(5):84-89.JIANG Min,XIAO Shibin,WANG Hongwei,et al.An improved word similarity computing method based on How Net[J].Journal of Chinese information processing,2008(5):84-89.
- [17]朱征宇,孙俊华.改进的基于《知网》的词汇语义相似度计算[J].计算机应用,2013,33(8):2276-2279.ZHU Zhengyu,SUN Junhua.Improved vocabulary semantic similarity calculation based on How Net[J].Journal of computer applications,2013,33(8):2276-2279.