Joeyos's Blog

朴素贝叶斯的一般过程
从文本中构建词向量
测试分类函数
切分文本
测试算法(交叉验证)

朴素贝叶斯的一般过程

收集数据：可以使用任何方法
准备数据：需要数值型或者布尔型
分析数据：有大量特征时，绘制特征作用不大，此时使用直方图效果更好
训练算法：计算不同的独立特征的条件概率
测试算法：计算错误率
使用算法

从文本中构建词向量

def loadDataSet():文档数据来源于斑点狗爱好者论坛，类别有侮辱性和非侮辱性词汇，由人工标注，1代表侮辱性，0代表正常言论，这些标注信息用于训练程序，以便自动检测侮辱性留言。
def createVocabList(dataSet):创建一个空集，存储不重复词汇列表
def setOfWords2Vec(vocabList, inputSet):输入一个文档，创建一个词汇表长的向量，输出向量1,0表示该词汇在文档中是否出现。

检查词表，发现词表无重复单词，目前词表未排序，需要的话可以排序。

import bayes
listOPosts,listClasses = bayes.loadDataSet()
myVocabList = bayes.createVocabList(listOPosts)
print(myVocabList)

['worthless', 'take', 'licks', 'has', 'please', 'not', 'so', 'love', 'posting', 'is', 'stupid', 'problems', 'how', 'dalmation', 'maybe', 'garbage', 'I', 'flea', 'stop', 'him', 'mr', 'cute', 'steak', 'dog', 'help', 'ate', 'food', 'my', 'park', 'quit', 'buying', 'to']

bayes.setOfWords2Vec(myVocabList,listOPosts[0])#索引为0，cute

[0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0]

测试分类函数

bayes.testingNB()

['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1

切分文本

mySent='This book is the best book on Python or M.L. I have ever laid'
mySent.split()

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M.L.',
 'I',
 'have',
 'ever',
 'laid']

可以看到，切分的还不错，但标点符号也切进去了。可以使用正则化，其中分隔符是除单词、数字外的任意字符串。

import re
regEx = re.compile('\\W*')
listOfTokens = regEx.split(mySent)
listOfTokens

F:\Anaconda3.5.2\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: split() requires a non-empty pattern match.
  This is separate from the ipykernel package so we can avoid doing imports until





['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid']

去掉里面的空字符串，可以计算字符串的长度，只返回长度大于0的字符串：

[tok for tok in listOfTokens if len(tok) > 0]

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid']

可以将字符串全部转换为小写(.lower)或者大写(.upper)：

[tok.lower() for tok in listOfTokens if len(tok) > 0]

['this',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'python',
 'or',
 'm',
 'l',
 'i',
 'have',
 'ever',
 'laid']

下载email.zip

emailText = open('email/ham/6.txt').read()
listOfTokens = regEx.split(emailText)
#listOfTokens
#很多单词

F:\Anaconda3.5.2\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: split() requires a non-empty pattern match.

测试算法(交叉验证)

函数textParse()接受一个大字符串并将其解析为字符串列表，该函数去掉少于两个字符的字符串，并将所有字符串转换为小写。

函数spamTest()对贝叶斯垃圾邮件分类器进行自动化处理，导入spam与ham下的文本文件，并将它们解析为词列表。

接下来构建一个测试集与一个训练集，两个集合中的邮件都是随机选出的。此处有50份邮件，随机选10份作为测试集。

bayes.spamTest()

文件中有非法字符，不知道是哪个文件…….

F:\Anaconda3.5.2\lib\re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)



---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

<ipython-input-68-823d79549753> in <module>()
----> 1 bayes.spamTest()


C:\Users\Administrator\Desktop\bayes.py in spamTest()
     95     for i in range(1,26):
     96         wordList = textParse(open('email/spam/%d.txt' % i).read())
---> 97         docList.append(wordList)
     98         fullText.extend(wordList)
     99         classList.append(1)


UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal multibyte sequence

Python实现朴素贝叶斯分类

朴素贝叶斯的一般过程

从文本中构建词向量

测试分类函数

切分文本

测试算法(交叉验证)

Similar Posts

Comments