使用朴素贝叶斯过滤垃圾邮件

朴素贝叶斯分类器 (Naive Bayes Classifier, NBC) 发源于古典数学理论,有着坚实的数学基础,以及稳定的分类效率。同时,NBC 模型所需估计的参数很少,对缺失数据不太敏感,算法也比较简单。之所以成为 “朴素” 是因为整个形式化过程只做最原始、最简单的假设。朴素贝叶斯在数据较少的情况下仍然有效,可以处理多类别问题。

朴素贝叶斯算法详解:https://boywithacoin.cn/article/fen-lei-su...

电子邮件垃圾过滤,具体流程

由于程序中更需要使用第三方库,我们需要先下载依赖包 pip install feedparser

0x00实现词表到向量转换

使用面向对象思路,构造bayes对象:

#!/usr/bin/python
# -*- coding: utf-8 -*-
#__author__ : stray_camel
#pip_source : https://mirrors.aliyun.com/pypi/simple
import sys,os

class Bayes():
    def __init__(self, 
    absPath:"Directory of the current file"== os.path.dirname(os.path.abspath(__file__)),
    ):
        self.absPath = absPath

创建函数返回一个包含所有文档中出现的不重复词的list:

 #contain all documents and list without duplicate words
    def createVocabList(self, 
    dataSet:dict(type="", help = "the source data"),
    )->dict(type=list, help = "Deduplicated list"):
        vocabSet=set([])#creat an empty set,'set' is a list without duplicate words
        for document in dataSet:
            vocabSet=vocabSet|set(document) #create an union of two sets
        return list(vocabSet)

同时我们还需要一个函数使用词汇表或想要检查的所有单词作为输入,然后为其中每一个单词构造一个特征。一旦给定一篇文档,该文档就会转换为词向量。


    #determine if a term appears in the documents
    def setOfWords2Vec(self, 
    vocabList = dict(type="", help="a glossary "), 
    inputSet = dict(type="", help="The word you want to detect"), 
    )-> = dict(type="", help="Word vector"):
        returnVec = [0]*len(vocabList)
        for word in inputSet:
            if word in vocabList:
                returnVec[vocabList.index(word)] = 1
            else:
                print("单词: %s 不在我的词汇里面!" % word)#Returns a document vector indicating whether a word has appeared 1/0 in the input document
        return returnVec

0x01实现bayes分类器训练函数

使用朴素贝叶斯分类器 训练函数:

    #naive bayes classfication training function
    def trainNB0(self,trainMatrix,trainCategory):
        numTrainDocs=len(trainMatrix)
        numWords=len(trainMatrix[0])
        pAbusive=sum(trainCategory)/float(numTrainDocs)
        p0Num = ones(numWords)
        p1Num = ones(numWords)
        p0Denom = 2.0
        p1Denom = 2.0
        for i in range(numTrainDocs):#Iterate through all documents
            if trainCategory[i]==1:
                p1Num+=trainMatrix[i]
                p1Denom+=sum(trainMatrix[i])
            else:
                p0Num+=trainMatrix[i]
                p0Denom+=sum(trainMatrix[i])

        p1Vect = log(p1Num / p1Denom)
        p0Vect = log(p0Num / p0Denom)
        return p0Vect, p1Vect, pAbusive

0x02实现垃圾邮件测试函数

使用spamTest()对贝叶斯垃圾邮件分类器,进行自动化处理。导入文件夹spam和ham下的文本文件,并将他们解析成词列表。案例中共有20封电子邮件,其中10封邮件被随机选择为测试集,分类器所要的概率计算指利用训练集中的文档完成。这种随机选择一部分作为训练集,而剩余的部分作为测试集的过程称为留存交叉验证。

spamTest()

 #filtering email, training+testing
    def spamTest(self):
        docList=[]; classList=[]; fullText=[]
        # iterate through all the test files, A total of 26
        for i in range(1,26):
            wordList = self.textParse(open(self.absPath+'/email/spam/%d.txt' % i, "rb").read().decode('GBK', 'ignore'))
            docList.append(wordList)
            fullText.extend(wordList)
            classList.append(1)
            wordList = self.textParse(open(self.absPath+'/email/ham/%d.txt' % i, "rb").read().decode('GBK', 'ignore'))
            docList.append(wordList)
            fullText.extend(wordList)
            classList.append(0)
        vocabList=self.createVocabList(docList)

        trainingSet = list(range(50))
        testSet=[]

        for i in range(10):
            # random.uniform(x,y)  Returns a float random number from x to y
            randIndex=int(random.uniform(0,len(trainingSet)))
            testSet.append(trainingSet[randIndex])
            del(trainingSet[randIndex])
        trainMat=[]; trainClasses=[]
        for docIndex in trainingSet:
            trainMat.append(self.setOfWords2Vec(vocabList,docList[docIndex]))
            trainClasses.append(classList[docIndex])
        p0V,p1V,pSpam=self.trainNB0(np.array(trainMat),np.array(trainClasses))
        errorCount=0
        for docIndex in testSet:
            wordVector=self.setOfWords2Vec(vocabList,docList[docIndex])
            if self.classifyNB(np.array(wordVector),p0V,p1V,pSpam)!=classList[docIndex]:
                errorCount+=1
                print("分类错误的是: %s" %vocabList[docIndex])
        print('错误率是:',float(errorCount)/len(testSet))

最终函数运行结果如下图:

if __name__ == "__main__":
    test = Bayes()
    test.spamTest()
分类错误的是:
scifinance now automatically generates gpu-enabled pricing & risk model source code that runs up to 50-300
分类错误的是: tended in the latest release. this includes:

错误率是: 0.2
本作品采用《CC 协议》,转载必须注明作者和本文链接
文章!!首发于我的博客Stray_Camel(^U^)ノ~YO
讨论数量: 0
(= ̄ω ̄=)··· 暂无内容!

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!