本文主要内容

UC Irvine具有用于存储各种数据的大型存储库。 本文使用鸢尾花数据集https://archive.ics.uci.edu/ml/datasets/Iris)进行实验。 为分类任务实施朴素贝叶斯分类器。 这项试验,随机抽取70%的实例进行训练,其余的则进行测试。 重复试验10次并计算平均准确度。 由于特征是连续变量,因此您可能需要在概率计算中使用高斯模型

0x01 实验流程

1.1 加载鸢尾花数据集

  • 采用sklearn.datasets模块导入load_iris数据集
1
2
3
# load the iris dataset 
from sklearn.datasets import load_iris
iris = load_iris()

1.2 存储特征矩阵和响应向量

1
2
3
# store the feature matrix (X) and response vector (y) 
X = iris.data
y = iris.target

1.3 将X和y分为训练和测试集

1
2
3
# splitting X and y into training and testing sets 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
sklearn.model_selection.train_test_split用法
  • 作用:将数组或矩阵拆分为训练集和测试集

  • 语法:sklearn.model_selection.train_test_split(*arrays, **options)

  • train_test_split里面常用的因数(arguments)介绍:

    • arrays:分割对象同样长度的列表或者numpy arrays,矩阵。
    • test_size:两种指定方法。1:指定小数。小数范围在0.0~0.1之间,它代表test集占据的比例。2:指定整数。整数的大小必须在这个数据集个数范围内,总不能指定一个数超出了数据集的个数范围吧。要是test_size在没有指定的场合,可以通过train_size来指定。(两个是对应关系)。如果train_size也没有指定,那么默认值是0.25.
    • train_size:和test_size相似。
    • random_state:这是将分割的training和testing集合打乱的个数设定。如果不指定的话,也可以通过numpy.random来设定随机数。
    • shuffle和straify不常用。straify就是将数据分层。
  • 返回值:将输入列表拆分为训练和测试集

train_test_split 用法举例:

这个数据集 4列,12行

  • 使用pandas模块,制作数据集
1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
from sklearn.model_selection import train_test_split
namelist = pd.DataFrame({
"name" : ["Suzuki", "Tanaka", "Yamada", "Watanabe", "Yamamoto",
"Okada", "Ueda", "Inoue", "Hayashi", "Sato",
"Hirayama", "Shimada"],
"age": [30, 40, 55, 29, 41, 28, 42, 24, 33, 39, 49, 53],
"department": ["HR", "Legal", "IT", "HR", "HR", "IT",
"Legal", "Legal", "IT", "HR", "Legal", "Legal"],
"attendance": [1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1]
})
namelist

name age department attendance
0 Suzuki 30 HR 1
1 Tanaka 40 Legal 1
2 Yamada 55 IT 1
3 Watanabe 29 HR 0
4 Yamamoto 41 HR 1
5 Okada 28 IT 1
6 Ueda 42 Legal 1
7 Inoue 24 Legal 0
8 Hayashi 33 IT 0
9 Sato 39 HR 1
10 Hirayama 49 Legal 1
11 Shimada 53 Legal 1
  • 将testing数据指定为0.3(test_size=0.3),从而将testing和training 集合分开。
1
namelist_train, namelist_test = train_test_split(namelist, test_size=0.3)
1
namelist_train

name age department attendance
3 Watanabe 29 HR 0
5 Okada 28 IT 1
1 Tanaka 40 Legal 1
4 Yamamoto 41 HR 1
10 Hirayama 49 Legal 1
9 Sato 39 HR 1
11 Shimada 53 Legal 1
2 Yamada 55 IT 1
1
namelist_test

name age department attendance
7 Inoue 24 Legal 0
6 Ueda 42 Legal 1
8 Hayashi 33 IT 0
0 Suzuki 30 HR 1
  • 接下来是将testing数据指定为具体数目。test_size=5
1
2
namelist_train, namelist_test = train_test_split(namelist, test_size=5)
namelist_test

name age department attendance
0 Suzuki 30 HR 1
1 Tanaka 40 Legal 1
6 Ueda 42 Legal 1
10 Hirayama 49 Legal 1
11 Shimada 53 Legal 1
  • 接下来将training data 指定为0.5(training_size=0.5)
1
2
namelist_train, namelist_test = train_test_split(namelist, test_size=None, train_size=0.5)
namelist_train

name age department attendance
7 Inoue 24 Legal 0
6 Ueda 42 Legal 1
5 Okada 28 IT 1
3 Watanabe 29 HR 0
0 Suzuki 30 HR 1
9 Sato 39 HR 1
  • shuffle功能
1
2
namelist_train, namelist_test = train_test_split(namelist, shuffle=False)
namelist_test

name age department attendance
9 Sato 39 HR 1
10 Hirayama 49 Legal 1
11 Shimada 53 Legal 1

1.4 使用高斯模型训练数据集

1
2
3
4
5
6
7
8
# training the model on training set 

#引入高斯朴素贝叶斯
from sklearn.naive_bayes import GaussianNB
# 实例化
gnb = GaussianNB()
#训练数据 fit相当于train
gnb.fit(X_train, y_train)
GaussianNB(priors=None, var_smoothing=1e-09)

1.5 测试集预测

1
2
# making predictions on the testing set 
y_pred = gnb.predict(X_test)

1.6 比较预测值和实际值

1
2
3
# comparing actual response values (y_test) with predicted response values (y_pred) 
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)
Gaussian Naive Bayes model accuracy(in %): 93.33333333333333

1.7 完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# load the iris dataset 
from sklearn.datasets import load_iris
iris = load_iris()

# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target

# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# making predictions on the testing set
y_pred = gnb.predict(X_test)

# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)
Gaussian Naive Bayes model accuracy(in %): 93.33333333333333