正文

10 数据的空间变换——核函数变换

10.1 相关知识简介

10.1.1 超平面

超平面（Hyper Plane）的本质是自由度比所在空间的维度小1，也就是（n-1）维度

n维空间$F^n$中超平面表示为$a _ 1 x _ 1 + ... + a _ n x _n =b$定义的子集, 其中$a _ 1,...,a _ n \in F$是不全为零的常数

也可表示为$\mathbf{w\cdot x}+b=0$, 其中$\mathbf{w}$与$\mathbf{x}$是n维列向量, $\mathbf{w}=[w _ 1,w _ 2,...,w _ n]^T,\mathbf{x}=[x _ 1,x _ 2,...,x _ n]^T$

$\mathbf{w}$既可以看作超平面的法向量, 也可以看作是参数, 决定了超平面的方向
$\mathbf{x}$为超平面上的点
$b$是一个实数, 代表超平面到原点的距离
$\mathbf{w\cdot x}$表示向量$\mathbf{w}$与$\mathbf{x}$的内积, 结果为一个标量
向量的内积可以转换为矩阵的乘积, 所以$\mathbf{w\cdot x}=\mathbf{w}^T\mathbf{x}, \mathbf{w}^T$表示$\mathbf{w}$的转置
超平面将空间划分为3部分, 即超平面本身$\mathbf{w\cdot x}+b=0$, 超平面上部$\mathbf{w\cdot x}+b>0$, 超平面下部$\mathbf{w\cdot x}+b<0$

10.1.2 线性分类

若一个分类超平面可以将两类样本完全分开, 则称这些样本是"线性可分"的, 椭圆在二维空间内不是分类超平面(不是一维), 不是线性可分的

10.1.3 升维

把样本从原输入低维空间向高维特征空间作映射,使得数据的维度增大

非线性可分问题可以通过升维, 找到合适的映射函数将低维的向量$\mathbf{x}$变换为高维的向量$\mathbf{x}'$, 然后在高维空间中,求向量$\mathbf{x'}$与向量$\mathbf{w}$的内积, 再与b相加, 得到分类超平面以及线性模型, 从而进行分类或回归, 使低维输入空间非线性可分问题变为高维特征空间的线性可分

10.2 核函数的引入

10.3.1 核函数定义

设$\chi$是输入空间(欧氏空间或离散集合), H为特征空间(希尔伯特空间, (可以直接理解成更高维的空间?)), 若存在一个从$\chi$到H的映射,$f(\mathbf{x}):\chi \to H$,使得对所有的$\mathbf{x},\mathbf{y}\in \chi$函数$K(\mathbf{x},\mathbf{y})=f(\mathbf{x})\cdot f(\mathbf{y})$, 则称$K(\mathbf{x}, \mathbf{y})$为核函数

任何半正定($\ge 0$)的函数都可以作为核函数

10.3.3 核函数的特点

在原空间进行计算, 避免"维数灾难", 大大减小计算量, 有效处理高维输入

10.4 常用核函数

名称	说明
线性核函数	对数据不进行任何变换, 不需要设置任何参数, 速度快, 用于线性可分, 适用于维度很大、样本数量差不多的数据集, 也可手动升维, 再使用线性核函数
多项式核函数	偏线性, 非常适合用于图像处理, 可调节参数获得好的结果
高斯径向基核函数	偏非线性, 适用范围较广, 是SVM的默认核函数, 适用于维度较低和样本数量一般的数据集

10.4.1 线性核函数

$K(\mathbf{x},\mathbf{y})=\mathbf{x}\cdot \mathbf{y}$

线性核函数是最简单的核函数, 此时的映射函数为$f(\mathbf{z})=z$

10.4.2 多项式核函数

$K(\mathbf{x},\mathbf{y})=\left [ \gamma (\mathbf{x}\cdot \mathbf{y})+c \right ]^d$

$\gamma > 0$, 一般等于1 /类别数, 表示对内积$(\mathbf{x}\cdot \mathbf{y})$进行放缩
c代表常数项, c>0时称为非齐次多项式
d代表项式的阶次, 一般设d=2, 若d取值过高, 学习的复杂性也会过高, 容易出现过拟合的现象.
多项式核函数对应的映射后的特征维度为$C ^d _ {n+d}$, n为$\mathbf{x}$的维度

常用的多项式核函数

$K(\mathbf{x},\mathbf{y})=\left [ \gamma (\mathbf{x}\cdot \mathbf{y})+1 \right ]^2$

$K(\mathbf{x},\mathbf{y})=\left [ (\mathbf{x}\cdot \mathbf{y})+1\right ]^2=(\Sigma^n _ {i=1}\mathbf{x} _ i \mathbf{y} _ i + 1)^2$

$={\color{Red}{\Sigma^n _ {i=1} \mathbf{x}^2 _ i \mathbf{y}^2 _ i}} + {\color{Blue}{\Sigma^n _ {i=2}\Sigma^{i-1} _ {j=1}(\sqrt 2 \mathbf{x} _ i\mathbf{x} _ j)(\sqrt 2 \mathbf{y} _ i\mathbf{y} _ j)}}+ {\color{Green}{\Sigma^n _ {i=1}(\sqrt 2 \mathbf{x} _ i)(\sqrt 2 \mathbf{y} _ i)}} + {\color{Purple}1}$

所以$f(\mathbf{z})=\left[{\color{Red}{z^2 _ n}}, {\color{Blue}{z^2 _ {n-1},...,z^2 _ 1, \sqrt 2 z _ n z _ {n-1},...,\sqrt 2 z _2 z _1, \sqrt 2z _n}}{\color{Green}{,\sqrt 2z _ {n-1}}}{\color{Purple}{,1}}\right]$

使用该函数, 设向量$\mathbf{X}=[1, 2, 3, 4], \mathbf{Y}=[5, 6, 7, 8]$,原输入空间的维度为4, 通过映射后特征维度将达到$C ^4 _ {4+2}=15$, 验证$K(\mathbf{x},\mathbf{y})=f(\mathbf{x})\cdot f(\mathbf{y})$

import numpy as np


def f(Z):
    """
    映射函数, 时间复杂度O(n^2)?
    """
    Z1 = Z ** 2
    Z_shape = np.shape(Z)[1] - 1
    Z0 = []
    for i in range(Z_shape, 0, -1):
        for j in range(i - 1, -1, -1):
            xy = Z[0, i] * Z[0, j] * 2 ** 0.5
            Z0.append(xy)
    Z2 = np.array(Z0).reshape(1, -1)
    Z3 = Z * 2 ** 0.5
    return np.hstack((Z1 ,Z2, Z3, [[1]]))


X = np.array([[1,2,3,4]])  # 4维行向量
Y = np.array([[5,6,7,8]])
# 使用多项式核函数计算
XY_poly = (X.dot(Y.T) + 1) ** 2
print("使用多项式核函数计算的结果为：", XY_poly)
# 使用映射的计算
X1 = f(X)
Y1 = f(Y)
print("使用映射计算的结果为：", X1.dot(Y1.T))
print("输出X的映射值为：\n",X1)
print("输出Y的映射值为：\n",Y1)
print("原输入空间的维度为：", np.shape(X)[1])
print("映射后特征空间的维度为：", np.shape(X1)[1])

使用多项式核函数计算的结果为： [[5041]]
使用映射计算的结果为： [[5041.]]
输出X的映射值为：
 [[ 1.          4.          9.         16.         16.97056275 11.3137085
   5.65685425  8.48528137  4.24264069  2.82842712  1.41421356  2.82842712
   4.24264069  5.65685425  1.        ]]
输出Y的映射值为：
 [[25.         36.         49.         64.         79.19595949 67.88225099
  56.56854249 59.39696962 49.49747468 42.42640687  7.07106781  8.48528137
   9.89949494 11.3137085   1.        ]]
原输入空间的维度为： 4
映射后特征空间的维度为： 15

10.4.3 高斯径向基核函数

$K(\mathbf{x},\mathbf{y})=e^{(-\frac{\left \|\mathbf{x}-\mathbf{y}\right \| ^2}{2\sigma^2})}$

此时映射函数映射之后是无穷维的

10.6 SVM原理

SVM = Support Vector Machine 是支持向量

SVC = Support Vector Classification就是支持向量机用于分类

SVR = Support Vector Regression.就是支持向量机用于回归分析

参考: python机器学习 | SVM算法介绍及实现

10.6.7 线性可分SVM的实现

给定训练数据集, 其正例点是$x _ 1 = (4, 3),x _ 2 = (3, 3)$,负例点是$x _ 3 = (1, 1)$, 利用sklearn中的SVC库, 求出支持向量机, 支持向量机的个数、参数, 并对点$(4, 5)$、$(0, 0)$和$(1, 3)$进行预测.

import numpy as np
from sklearn.svm import SVC  # 导入SVC模型
import matplotlib.pyplot as plt
import matplotlib as mpl

# 导入数据
train_x = np.array([[4, 3], [3, 3], [1, 1]])
train_y = np.array([1, 1, -1])  # 写出对应的类别
print("训练集(最右一列为标签):\n", np.hstack((x, y.reshape(3, 1))))

# 调用SVC, 训练算法
model = SVC(kernel="linear")  # 实例化, 设置的核函数为线性核函数
model.fit(x, y)  # 用训练集数据训练模型, 和上一句配合使用

# 预测数据
test_x = np.array([[4, 5], [0, 0], [1, 3]])
test_y = model.predict(test_x)
print("预测数据[4, 5], [0, 0], [1, 3]的类型值分别是", test_y)

# 相关方法和返回值
w = model.coef_[0]  # 获取w
a = -w[0] / w[1]  # 斜率
b = model.intercept_
print("支持向量：\n",model.support_vectors_)  # 打印支持向量
print("支持向量的标号：",model.support_)  # 打印支持向量的标号
print("每类支持向量的个数：",model.n_support_)  # 每类支持向量的个数
print("数据集X到分类超平面的距离：",model.decision_function(x))
print("参数（法向量）w =", w)
print("分类线的斜率a =", a)
print("分类平面截距b：", b)  # 超平面的截距值（常数值）。
print("系数",model.coef_)  # 每个特征系数（重要性），只有LinearSVC核函数可用
print("超平面方程为{}x + {}y {} = 0".format(w[0], w[1], b[0]))

# 绘图
mpl.rcParams["font.sans-serif"] = ["Microsoft YaHei"]
mpl.rcParams['axes.unicode_minus'] = False
plt.figure()
plt.axis("equal")
for i in range(0, len(train_x)):
    plt.scatter(train_x[i][0], train_x[i][1], color="red", marker=["x", "o"][int(train_y[i] * 0.5 + 0.5)])
for i in range(0, len(test_x)):
    plt.scatter(test_x[i][0], test_x[i][1], color="blue", marker=["x", "o"][int(test_y[i] * 0.5 + 0.5)])
plt.plot(np.linspace(0, 4, 2), a * np.linspace(0, 4, 2) - b / w[1])
l1 = plt.scatter(0, 0, color="red")  # 设置图例
l2 = plt.scatter(0, 0, color='red')
plt.legend(handles=[plt.scatter(0, 0, color="red", marker="o"), plt.scatter(0, 0, color="red", marker="x"),
                    plt.scatter(0, 0, color="blue", marker="o"), plt.scatter(0, 0, color="blue", marker="x")],
           labels=['训练集正例点', '训练集负例点', '测试集正例点', '测试集负例点'] , loc='best')  # 显示图例
plt.show()

训练集(最右一列为标签):
 [[ 4  3  1]
 [ 3  3  1]
 [ 1  1 -1]]
预测数据[4, 5], [0, 0], [1, 3]的类型值分别是 [ 1 -1  1]
支持向量：
 [[1. 1.]
 [3. 3.]]
支持向量的标号： [2 1]
每类支持向量的个数： [1 1]
数据集X到分类超平面的距离： [ 1.5  1.  -1. ]
参数（法向量）w = [0.5 0.5]
分类线的斜率a = -1.0
分类平面截距b： [-2.]
系数 [[0.5 0.5]]
超平面方程为0.5x + 0.5y -2.0 = 0

10.7 非线性SVM与核函数的引入

参考: 核函数与非线性支持向量机(SVM)

10.7.2 非线性SVM的实现

(1) 调用相关的库

#调用相关的库：
import numpy as np
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles  # 画圆圈的库
import matplotlib as mpl

mpl.rcParams['font.sans-serif'] = ['SimHei']

(2) 通过函数plot_decision_boundary()实现散点图和支持向量的绘图

def plot_decision_boundary (model, X, y, h=0.03, draw_SV=True, title='decision_boundary'):
    """
    画分类数据集
    :param model:
    :param X:
    :param y:
    :param h: 步长
    :param draw_SV:
    :param title: 标题
    """
    X_min, X_max = X[:,0].min() - 1, X[:,0].max() + 1
    y_min, y_max = X[:,1].min() - 1, X[:, 1].max() + 1
    # 画决策边界，需要有网格，利用np.meshgrid()生成一个坐标矩阵
    """
    语法：X,Y = numpy.meshgrid(x, y)
    输入的x，y，就是网格点的横纵坐标列向量（非矩阵）
    输出的X，Y，就是坐标矩阵。
    """
    xx, yy = np.meshgrid(np.arange(X_min, X_max, h),np.arange(y_min, y_max, h))
    # 预测坐标矩阵中每个点所属的类别
    label_predict = model.predict(np.stack((xx.flat, yy.flat), axis=1))       
    # 将结果放入彩色图中
    label_predict = label_predict.reshape(xx.shape)   # 使之与输入的形状相同
    plt.title(title)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())  # 隐藏坐标轴
    plt.yticks(())
    # contour和contourf都是画三维等高线图的，不同点在于contour() 是绘制轮廓线，contourf()会填充轮廓。
    plt.contourf(xx, yy, label_predict, alpha=0.5)  # 用contourf()函数为坐标矩阵中不同类别填充不同颜色    
    markers = ['x', '^', 'o']
    colors = ['b', 'r', 'c'] # 蓝, 红, 青
    # 对于一维数组或者列表，unique函数去除其中重复的元素，
    # 并按元素由大到小返回一个新的无元素重复的元组或者列表
    classes = np.unique(y)
    # 画出每一类数据的散点图
    for label in classes:
        plt.scatter(X[y == label][:, 0], X[y == label][:, 1], 
                    c=colors[label], s=60, marker=markers[label])
    # 标记出支持向量，将两类支持向量机用不同颜色表示出来
    if draw_SV:
        SV = model.support_vectors_  # 获取支持向量
        n = model.n_support_[0]  # 第一类支持向量个数
        plt.scatter(SV[:n, 0],SV[:n, 1], s=15, c='black', marker='o')
        plt.scatter(SV[n:, 0],SV[n:, 1], s=15, c='g', marker='o')

(3) 生成模拟分类数据集，并画出数据集

"""
make_circles: 
    n_samples ： int，optional（默认值= 100）
生成的总点数。如果是奇数，则内圆将比外圆具有一个点。
    shuffle ： bool，optional（默认值= True）
是否洗牌样品。
    noise： 双倍或无（默认=无）
高斯噪声的标准偏差加到数据上。
    random_state ： int，RandomState实例或None（默认）
确定数据集重排和噪声的随机数生成。传递一个int，用于跨多个函数调用的可重现输出。见术语表。
    factor ： 0  < double < 1（默认值= .8）
内圈和外圈之间的比例因子。
"""
X, y = make_circles(200,factor=0.1,noise=0.1) # 产生样本点
plt.scatter(X[y == 0, 0], X[y == 0, 1], c='b', s=20, marker = 'x')  # 第一类
plt.scatter(X[y == 1, 0], X[y == 1, 1], c='r', s=20, marker = '^')  # 第二类
plt.xticks(())
plt.yticks(())
plt.title('数据集')
plt.show()  # 画出数据集

(4) 通过调用SVM函数, 分别构造线性核函数和三阶多项式核函数SVM, 把运算的结果用图形描绘出来

plt.figure(figsize=(12, 10), dpi=200)
# 使用线性核函数进行分类
model_linear = SVC(C=1.0, kernel='linear')  # 实例化，设置的核函数为线性核函数
model_linear.fit(X, y)  # 用训练集数据训练模型，和上一句配合使用

# 画出使用线性核函数的分类边界
plt.subplot(2, 2, 1)
plot_decision_boundary(model_linear, X, y, title='线性核函数')  # 调用画图函数
print("采用线性核函数生成的支持向量个数：", model_linear.n_support_)

# 使用多项式核函数进行分类
model_poly = SVC(C=1.0, kernel='poly', degree=3, gamma="auto") # 实例化，设置的核函数为多项式核函数
model_poly.fit(X, y)  # 用训练集数据训练模型
# 画出使用多项式核函数的分类边界
plt.subplot(2, 2, 2)
plot_decision_boundary(model_poly, X, y, title='多项式核函数')  # 调用画图函数
print("采用多项式函数生成的支持向量个数：", model_poly.n_support_)
plt.show()

采用线性核函数生成的支持向量个数： [100 100]
采用多项式函数生成的支持向量个数： [100 100]

(5) 通过调用SVC(), 分别构造4个高斯径向基核函数的SVM, 对应的分别为10, 1, 0.1, 0.01, 把运算的结果用图形描绘出来

plt.figure(figsize=(12, 10), dpi=200)
# enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，
# 同时列出数据和数据下标，一般用在 for 循环当中。
for j, gamma in enumerate((10, 1, 0.1, 0.01)):
    plt.subplot(2, 2, j+1)
    model_rtf= SVC(C=1.0, kernel='rbf', gamma=gamma)
    model_rtf.fit(X,y)  # 高斯核函数
    #调用画图函数
    plot_decision_boundary(model_rtf, X, y, title='rbf函数，'
                                                  '参数gamma='+str(gamma))
    print("rbf函数，参数gamma=",str(gamma),"支持向量个数：",model_rtf.n_support_)
plt.show()

rbf函数，参数gamma= 10 支持向量个数： [30  7]
rbf函数，参数gamma= 1 支持向量个数： [9 8]
rbf函数，参数gamma= 0.1 支持向量个数： [96 96]
rbf函数，参数gamma= 0.01 支持向量个数： [100 100]

(6) 引申

from sklearn.model_selection import GridSearchCV

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1, 0.1, 0.01],'C': [0.1, 1, 10]},
                 {'kernel': ['linear'], 'C': [0.1, 1, 10]},
                 {'kernel': ['poly'],'gamma': [1, 0.1, 0.01],
                  'C': [0.1, 1, 10]}]
"""
GridSearchCV()函数能实现自动调参, 把参数输进去, 就能给出最优的结果和参数
https://blog.csdn.net/weixin_41988628/article/details/83098130
"""
model_grid = GridSearchCV(SVC(), tuned_parameters, cv=5)
model_grid.fit(X, y)
print("The best parameters are %s with a score of %0.2f"
      % (model_grid.best_params_, model_grid.best_score_))

The best parameters are {'C': 0.1, 'gamma': 1, 'kernel': 'rbf'} with a score of 1.00

10.8 综合实例——利用SVM构建分类问题

准备工作: 导入需要的模块

import numpy as np
from sklearn import svm
from sklearn.svm import SVC  # 导入SVM模型
from sklearn.model_selection import train_test_split  # 导入测试库
from sklearn.datasets import load_wine  # 导入wine数据集
from time import time

(1)导入数据集

要将数据转换为SVM支持的数据格式: [ 1类别标号 ] [ 特征1 ] : [ 特征值 ] [ 特征2 ] : [ 特征值 ]...

sklearn自带经典的wine数据集, 通过load_wine()函数导入

wine数据集: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html

属性	值
类	3
每类样品	[59,71,48]
样品总数	178
维度	13

wine = load_wine()
wine_data = wine.data
wine_label = wine.target
wine_data, wine_label

(array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2]))

(2) 数据预处理

使用数据预处理中标准化类StandardScaler对数据进行标准化, 以避免数据存在严重的量纲不一致的问题

数据的标准化(normalization)是将数据按比例缩放，使之落入一个小的特定区间。在某些比较和评价的指标处理中经常会用到，去除数据的单位限制，将其转化为无量纲的纯数值，便于不同单位或量级的指标能够进行比较和加权。

from sklearn.preprocessing import StandardScaler

wine_data = StandardScaler().fit_transform(wine_data)  # 对数据进行标准化
wine_data

array([[ 1.51861254, -0.5622498 ,  0.23205254, ...,  0.36217728,
         1.84791957,  1.01300893],
       [ 0.24628963, -0.49941338, -0.82799632, ...,  0.40605066,
         1.1134493 ,  0.96524152],
       [ 0.19687903,  0.02123125,  1.10933436, ...,  0.31830389,
         0.78858745,  1.39514818],
       ...,
       [ 0.33275817,  1.74474449, -0.38935541, ..., -1.61212515,
        -1.48544548,  0.28057537],
       [ 0.20923168,  0.22769377,  0.01273209, ..., -1.56825176,
        -1.40069891,  0.29649784],
       [ 1.39508604,  1.58316512,  1.36520822, ..., -1.52437837,
        -1.42894777, -0.59516041]])

(3) 分离数据

将数据划分为训练集和测试集, 训练集: 测试集 = 80%: 20%

sklearn的train_test_split()各函数参数含义解释（非常全）

wine_train, wine_test, wine_train_label, wine_test_label = \
train_test_split(wine_data, wine_label, test_size=0.2, random_state=100)

(4) 以默认的SVM参数, 对训练数据集进行训练, 产生训练模型(以默认的rbf为例)

time0 = time()
model = SVC()
model.fit(wine_train, wine_train_label)
time1 = time()

(5) 结果及分析

def result_show_analyse(test,test_label):
    """
    预测结果并进行分析
    """
    from datetime import datetime
    
    # 1、预测结果
    print("---------测试集的结果--------")
    test_pred = model.predict(test)
    print("测试集的真实结果为：\n", test_label)
    print("测试集的预测结果为：\n", test_pred)
    # 求出预测和真实一样的数目
    true = np.sum(test_pred == test_label)
    print("预测对的结果数目为：", true)
    print("预测错的结果数目为：", test_label.shape[0] - true)
    print("训练时间：", datetime.fromtimestamp(time1-time0).strftime("%M:%S:%f"))
    # 2、结果分析，给出准确率、精确率、召回率、F1值、Cohen’s Kappa系数
    print("---------测试集的结果分析--------")
    print("使用SVM预测wine数据的准确率是：%f"
              % (accuracy_score(test_label, test_pred)))
    print("使用SVM预测wine数据的精确率是：%f"
              % (precision_score(test_label, test_pred, average="macro")))
        # 对多分类要加average="macro"
    print("使用SVM预测wine数据的召回率是：%f"
              % (recall_score(test_label, test_pred, average="macro")))
    print("使用SVM预测wine数据的F1值是：%f"
              % (f1_score(test_label, test_pred, average="macro")))
    print("使用SVM预测wine数据的Cohen’s Kappa系数是：%f"
              % (cohen_kappa_score(test_label, test_pred)))
    print("使用SVM预测wine数据的分类报告为：\n",
              classification_report(test_label, test_pred))
    # 3、画出预测结果和真实结果对比的图
    print("---------测试集的结果图--------")
    plt.plot(test_pred,'bo', label="预测")
    plt.plot(test_label,'r*', label="真实")
    plt.xlabel(r'测试集样本',color='r', fontsize=18)
    plt.ylabel(r'类别标签',color='r', fontsize=18, rotation=360)
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    plt.title('测试集的实际分类和预测分类图', fontsize=18)
    plt.show()
    
    
# 调用结果函数
# 调用相关库
from sklearn.metrics import accuracy_score,precision_score, \
        recall_score,f1_score,cohen_kappa_score
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

# 图表中显示中文
from pylab import *

mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
result_show_analyse(wine_test,wine_test_label)  # 调用结果模块

---------测试集的结果--------
测试集的真实结果为：
 [1 2 0 1 2 2 1 1 1 1 2 1 2 2 2 0 2 0 1 0 2 0 1 1 0 0 1 1 1 2 2 1 0 1 2 2]
测试集的预测结果为：
 [1 2 0 1 1 2 1 1 1 1 2 1 2 2 2 0 2 0 1 0 2 0 1 1 0 0 1 1 1 2 2 1 0 1 2 2]
预测对的结果数目为： 35
预测错的结果数目为： 1
训练时间： 00:00:003162
---------测试集的结果分析--------
使用SVM预测wine数据的准确率是：0.972222
使用SVM预测wine数据的精确率是：0.979167
使用SVM预测wine数据的召回率是：0.974359
使用SVM预测wine数据的F1值是：0.975914
使用SVM预测wine数据的Cohen’s Kappa系数是：0.956938
使用SVM预测wine数据的分类报告为：
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         8
           1       0.94      1.00      0.97        15
           2       1.00      0.92      0.96        13

    accuracy                           0.97        36
   macro avg       0.98      0.97      0.98        36
weighted avg       0.97      0.97      0.97        36

---------测试集的结果图--------

【Sklearn】sklearn.metrics中的评估方法

通常以关注的类为正类，其他类为负类。分类器在测试数据集上预测要么正确要么不正确。4种情况出现的总数分别记作：

名称	说明
tp（true positive）	将正类预测为正类
fn（false negative）	将正类预测为负类
fp（false positive）	将负类预测为正类
tn（true negative）	将负类预测为负类

分类0混淆矩阵:

	预测属于分类0	预测不属于分类0
实际属于分类0	tp = 8	fn = 0
实际不属于分类0	fp = 0	tn = 28

precision	recall	f1-score
$P=\frac{tp}{tp+fp}=1$	$R=\frac{tp}{tp+fn}=1$	$\frac{2PR}{P+R}=1$

分类1混淆矩阵:

	预测属于分类1	预测不属于分类1
实际属于分类1	tp = 15	fn = 0
实际不属于分类1	fp = 1	tn = 20

precision	recall	f1-score
$P=\frac{tp}{tp+fp}=\frac{15}{16}=0.9375$	$R=\frac{tp}{tp+fn}=1$	$\frac{2PR}{P+R}=\frac{30}{31}=0.9677$

分类2混淆矩阵:

	预测属于分类2	预测不属于分类2
实际属于分类2	tp = 12	fn = 1
实际不属于分类2	fp = 0	tn = 23

precision	recall	f1-score
$P=\frac{tp}{tp+fp}=1$	$R=\frac{tp}{tp+fn}=\frac{12}{13}=0.923$	$\frac{2PR}{P+R}=\frac{24}{25}=0.96$

(6) 分类结果的混淆矩阵及图表显示

from sklearn import metrics


def cm_plot(y,yp):
    conf_mx = metrics.confusion_matrix(y, yp) # 模型对于测试集的混淆矩阵
    print("测试集的混淆矩阵：\n",conf_mx)
    # 画混淆矩阵图，配色风格使用cm.Greens
    # (太丑了, 我要用Oranges, https://blog.csdn.net/weixin_51111267/article/details/122605388)
    plt.matshow(conf_mx,cmap=plt.cm.Oranges)
    plt.colorbar()# 颜色标签
    for x in range(len(conf_mx)):
        for y in range(len(conf_mx)):
            plt.annotate(conf_mx[x,y],xy=(x,y),horizontalalignment='center',
                         verticalalignment='center')
            plt.ylabel('True label')# 坐标轴标签
            plt.xlabel('Predicted label')# 坐标轴标签
    return plt


wine_test_pred=model.predict(wine_test)
cm_plot(wine_test_label, wine_test_pred).show()

测试集的混淆矩阵：
 [[ 8  0  0]
 [ 0 15  0]
 [ 0  1 12]]

10.9 高手点拨

10.9.1 SMO算法

SVM对应的优化算法, 以牺牲精度换取时间

Sequential Minimal Optimism

序列最小最优化算法

10.9.3 核函数的选取

对于高斯径向基核函数, 可以通过求准确率, 画学习曲线来调整gamma值

#取不同gamma值得到的准确率
score = []
gamma_range = np.logspace(-10, 1, 50) # 得到不同的gamma值即对数刻度上均匀间隔的数
for i in gamma_range:
    model = SVC(kernel="rbf",gamma = i, cache_size=5000)
    model.fit(wine_train, wine_train_label)
    score_gamma = model.score(wine_test, wine_test_label)
    score.append(score_gamma)
print("最大的准确率为：",max(score))
print("对应的gamma值", gamma_range[score.index(max(score))])
plt.xlabel("gamma取值")
plt.ylabel("准确率")
plt.title("gamma的学习曲线")
plt.plot(gamma_range,score)
plt.show()

最大的准确率为： 1.0
对应的gamma值 0.020235896477251554

10.9.4 多分类ROC曲线的绘制

【小学生都会的机器学习】一个视频帮各位总结好了混淆矩阵、召回率、精准率、ROC等...

ROC曲线绘制原理及如何用SPSS绘制ROC曲线

ROC曲线越接近左上角, 代表模型性能越好

from itertools import cycle
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from numpy import interp


def plot_roc(test, test_label, test_pred):
    """
    :param test: 测试样本的数据
    :param test_label: 测试样本的标签
    :param test_pred: 测试样本的预测值
    """
    class_num = sum(unique(test_label))  # 类别数
    Y_pred = test_pred
    # 对输出进行二值化
    # Y_label样例真实标签，Y_pred学习器预测的标签
    Y_label = label_binarize(test_label, classes=[i for i in range(class_num)])
    Y_pred = label_binarize(Y_pred, classes=[i for i in range(class_num)])
    # 计算每一类的ROC
    # dict() 用于创建一个字典
    fpr = dict()  # 假正例率（False Positive Rate , FPR）
    tpr = dict()  # 真正例率（True Positive Rate , TPR）
    roc_auc = dict()  # ROC曲线下方的面积
    for i in range(class_num):
        fpr[i], tpr[i], _ = roc_curve(Y_label[:, i], Y_pred[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])
    # 计算micro-average ROC 曲线和ROC 面积
    fpr["micro"], tpr["micro"], _ = roc_curve(Y_label.ravel(), Y_pred.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

    # 计算 macro-average ROC 曲线 and ROC 面积
    # 第一步：汇总所有误报率 aggregate all false positive rates
    all_fpr = np.unique(np.concatenate([fpr[i] for i in range(class_num)]))
    
    # 第二步：在此点插值所有 ROC 曲线 interpolate all ROC curves at this points
    mean_tpr = np.zeros_like(all_fpr)
    for i in range(class_num):
        mean_tpr += interp(all_fpr, fpr[i], tpr[i])
    # 第三步：最后对其进行平均并计算AUC Finally average it and compute AUC
    mean_tpr /= class_num
    fpr["macro"] = all_fpr
    tpr["macro"] = mean_tpr
    roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
    # 画出具体的某一类的ROC曲线，如第一类
    plt.figure()
    lw = 2
    plt.plot(fpr[1], tpr[2], color="darkorange",
             lw=lw, label="ROC curve (area = %0.2f)" % roc_auc[1])
    plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel("假正例率False Positive Rate（FPR）")
    plt.ylabel("真正例率True Positive Rate（TPR）")
    plt.title("Receiver operating characteristic example")
    plt.legend(loc="lower right")
    plt.show()

    # 画出所有类的ROC曲线
    lw = 2  # line width
    plt.figure()
    plt.plot(fpr["micro"], tpr["micro"],
             label="micro-average ROC 曲线 (area = {0:0.2f})"
                   "".format(roc_auc["micro"]),
             color="deeppink", linestyle=":", linewidth=4)

    plt.plot(fpr["macro"], tpr["macro"],
             label="macro-average ROC 曲线 (area = {0:0.2f})"
                   "".format(roc_auc["macro"]),
             color="navy", linestyle=":", linewidth=4)
    colors = cycle(["aqua", "darkorange", "cornflowerblue"])
    for i, color in zip(range(class_num), colors):
        plt.plot(fpr[i], tpr[i], color=color, lw=lw,
                 label="ROC curve of class {0} (area = {1:0.2f})"
                       "".format(i, roc_auc[i]))

    plt.plot([0, 1], [0, 1], "k--", lw=lw)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel("假正例率False Positive Rate（FPR）")
    plt.ylabel("真正例率True Positive Rate（TPR）")
    plt.title('Some extension of Receiver operating characteristic'
              'to multi-class')
    plt.legend(loc="lower right")
    plt.show()


# 调用画ROC曲线的函数
model = SVC()  # 实例化，设置模型参数
model.fit(wine_train, wine_train_label)
wine_test_pred = model.predict(wine_test)
plot_roc(wine_test, wine_test_label, wine_test_pred)

10.10 习题构建基于iris数据集的SVM分类模型

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html

import numpy as np
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import matplotlib as mpl
from time import time

# (1) 读取数据集, 区分标签和数据
iris = load_iris()
iris_data = iris.data
iris_label = iris.target
# (2) 标准化数据集
iris_data = StandardScaler().fit_transform(iris_data)
# (3) 将数据集划分为训练集和测试集
iris_train, iris_test, iris_train_label, iris_test_label = \
train_test_split(iris_data, iris_label, test_size=0.2)
# (4) 构建SVM模型
model = SVC()
model.fit(iris_train, iris_train_label)
iris_test_pred = model.predict(iris_test)
# (5) 输出预测测试集结果, 评价分类模型性能, 输出测试报告
print(classification_report(iris_test_label, iris_test_pred))
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False
print("---------测试集的结果图--------")
plt.plot(iris_test_pred,'bo', label="预测")
plt.plot(iris_test_label,'r*', label="真实")
plt.xlabel(r'测试集样本',color='r', fontsize=18)
plt.ylabel(r'类别标签',color='r', fontsize=18, rotation=360)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('测试集的实际分类和预测分类图', fontsize=18)
plt.show()

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         9
           1       0.92      0.92      0.92        13
           2       0.88      0.88      0.88         8

    accuracy                           0.93        30
   macro avg       0.93      0.93      0.93        30
weighted avg       0.93      0.93      0.93        30

---------测试集的结果图--------

正文