使用朴素贝叶斯过滤垃圾邮件
朴素贝叶斯分类器 (Naive Bayes Classifier, NBC) 发源于古典数学理论,有着坚实的数学基础,以及稳定的分类效率。同时,NBC 模型所需估计的参数很少,对缺失数据不太敏感,算法也比较简单。之所以成为 “朴素” 是因为整个形式化过程只做最原始、最简单的假设。朴素贝叶斯在数据较少的情况下仍然有效,可以处理多类别问题。
朴素贝叶斯算法详解:https://boywithacoin.cn/article/fen-lei-su...
电子邮件垃圾过滤,具体流程
- 先收集数据,具体数据在 https://github.com/Freen247/database/tree/...
- 将文本文件解析成词条向量
- 检查词条确保解析的正确性
- 训练 / 测试 / 使用算法
由于程序中更需要使用第三方库,我们需要先下载依赖包
pip install feedparser
0x00 实现词表到向量转换#
使用面向对象思路,构造 bayes 对象:
#!/usr/bin/python
# -*- coding: utf-8 -*-
#__author__ : stray_camel
#pip_source : https://mirrors.aliyun.com/pypi/simple
import sys,os
class Bayes():
def __init__(self,
absPath:"Directory of the current file"== os.path.dirname(os.path.abspath(__file__)),
):
self.absPath = absPath
创建函数返回一个包含所有文档中出现的不重复词的 list:
#contain all documents and list without duplicate words
def createVocabList(self,
dataSet:dict(type="", help = "the source data"),
)->dict(type=list, help = "Deduplicated list"):
vocabSet=set([])#creat an empty set,'set' is a list without duplicate words
for document in dataSet:
vocabSet=vocabSet|set(document) #create an union of two sets
return list(vocabSet)
同时我们还需要一个函数使用词汇表或想要检查的所有单词作为输入,然后为其中每一个单词构造一个特征。一旦给定一篇文档,该文档就会转换为词向量。
#determine if a term appears in the documents
def setOfWords2Vec(self,
vocabList = dict(type="", help="a glossary "),
inputSet = dict(type="", help="The word you want to detect"),
)-> = dict(type="", help="Word vector"):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:
print("单词: %s 不在我的词汇里面!" % word)#Returns a document vector indicating whether a word has appeared 1/0 in the input document
return returnVec
0x01 实现 bayes 分类器训练函数#
使用朴素贝叶斯分类器 训练函数:
#naive bayes classfication training function
def trainNB0(self,trainMatrix,trainCategory):
numTrainDocs=len(trainMatrix)
numWords=len(trainMatrix[0])
pAbusive=sum(trainCategory)/float(numTrainDocs)
p0Num = ones(numWords)
p1Num = ones(numWords)
p0Denom = 2.0
p1Denom = 2.0
for i in range(numTrainDocs):#Iterate through all documents
if trainCategory[i]==1:
p1Num+=trainMatrix[i]
p1Denom+=sum(trainMatrix[i])
else:
p0Num+=trainMatrix[i]
p0Denom+=sum(trainMatrix[i])
p1Vect = log(p1Num / p1Denom)
p0Vect = log(p0Num / p0Denom)
return p0Vect, p1Vect, pAbusive
0x02 实现垃圾邮件测试函数#
使用 spamTest () 对贝叶斯垃圾邮件分类器,进行自动化处理。导入文件夹 spam 和 ham 下的文本文件,并将他们解析成词列表。案例中共有 20 封电子邮件,其中 10 封邮件被随机选择为测试集,分类器所要的概率计算指利用训练集中的文档完成。这种随机选择一部分作为训练集,而剩余的部分作为测试集的过程称为留存交叉验证。
spamTest()
:
#filtering email, training+testing
def spamTest(self):
docList=[]; classList=[]; fullText=[]
# iterate through all the test files, A total of 26
for i in range(1,26):
wordList = self.textParse(open(self.absPath+'/email/spam/%d.txt' % i, "rb").read().decode('GBK', 'ignore'))
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = self.textParse(open(self.absPath+'/email/ham/%d.txt' % i, "rb").read().decode('GBK', 'ignore'))
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList=self.createVocabList(docList)
trainingSet = list(range(50))
testSet=[]
for i in range(10):
# random.uniform(x,y) Returns a float random number from x to y
randIndex=int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses=[]
for docIndex in trainingSet:
trainMat.append(self.setOfWords2Vec(vocabList,docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam=self.trainNB0(np.array(trainMat),np.array(trainClasses))
errorCount=0
for docIndex in testSet:
wordVector=self.setOfWords2Vec(vocabList,docList[docIndex])
if self.classifyNB(np.array(wordVector),p0V,p1V,pSpam)!=classList[docIndex]:
errorCount+=1
print("分类错误的是: %s" %vocabList[docIndex])
print('错误率是:',float(errorCount)/len(testSet))
最终函数运行结果如下图:
if __name__ == "__main__":
test = Bayes()
test.spamTest()
分类错误的是:
scifinance now automatically generates gpu-enabled pricing & risk model source code that runs up to 50-300
分类错误的是: tended in the latest release. this includes:
错误率是: 0.2
本作品采用《CC 协议》,转载必须注明作者和本文链接