基于贝叶斯公式的最小损失垃圾邮件过滤算法Minimizing Cost Filtering Algorithm for Spam E-mail Based on Bayesian
谢金晶,张艺濒
摘要(Abstract):
为了减少将合法邮件误判为垃圾邮件的误报率及将垃圾邮件误判为合法邮件的漏报率的损失,首先基于现有的文本特征提取评估函数:期望交叉熵及互信息提出一种新的评估函数。利用此函数可提取到更具有代表性的邮件特征向量。在此之上提出一种基于贝叶斯公式可减少损失的垃圾邮件过滤方法。经过仿真测试后,发现基于新评估函数的新方法可有效降低误报率和漏报率。
关键词(KeyWords): 贝叶斯公式;评估函数;最小损失;垃圾邮件
基金项目(Foundation): 湖北省自然科学基金(2005ABA238)资助
作者(Author): 谢金晶,张艺濒
参考文献(References):
- [1]Younghwa Lee.The CAN-SPAM Act:A Silver Bullet So-lution[J].Communications of the ACM,2005,48(6):131 132.
- [2]Ion Androutsopoulos,John Koutsias,Konstantinos V.AnExperimental Comparison of Naive Bayesian and Keyword-based Anti-spam Filtering with Personal E-mail Messa-ges.Annual ACM Conference on Research and Developmentin Information Retrieval,2000:160 167.
- [3]Daniel Grossman,Pedro Domingos.Learning Bayesian Net-work Classifiers by Maximizing Conditional Likelihood.ACM International Conference Proceeding Series,2004,69.
- [4]Drucker H,Wu Donghui,Vapnik V N.Support Vector Ma-chines for Spam Categorization.IEEE Transactions on Neu-ral Neworks,1999,10(5):1 048 1 054.
- [5]Androutsopoulos I,Paliouras G.Learning to Filter SpamEmail:A Comparison of a Naive Bayesian and a Memory-based Approach.In:Proc.of the Workshop Machine Learn-ing and Textual Information Access.4th European Conf.onPKDD-2000.France,2000.
- [6]李凡,鲁明羽,陆玉昌.关于文本特征抽取新方法的研究[J].清华大学学报,2001(7):98 101.
- [7]詹川,卢显良,周旭,等.基于贝叶斯公式的垃圾邮件过滤方法[J].计算机科学,2005(2):73 75.
- [8]刘震,佘堃,周明天.基于多级属性集的垃圾邮件过滤技术[J].计算机应用研究,2005(7):122 124.
- [9]丁文斌,李斌,罗浩.基于改进贝叶斯的垃圾邮件过滤系统设计与实现[J].计算机工程与应用,2005(18):127 130.