Home
avatar

dxk

深度学习(一)

1. Introduction to Deep Learning

什么是神经网络(What is Neural Network)

比如,通过一个房屋“尺寸-价格”的线性函数,来预测一个新的数据的价格:

sizepricesize \rightarrow \bigcirc \rightarrow price

这个函数就是一个神经元(neuron)。

用神经网络进行监督学习(Supervised Learning with Neural Networks)

监督学习:一组输入对应一组输出

结构化数据和非结构化数据

结构化数据(Structured Data): 具有类似数据库结构的数据,比如房屋信息,用户信息

非结构化数据(Unstructured Data): 类似于音频、图片、文本这样的原始数据。

为什么深度学习会兴起(Why are the deep learning only just now taking off)

  1. 更多的数据(Data)

  2. 更快的计算(Fast Computation)能力

  3. 更好的算法(Algorithms)

    比如SigmoidReLUSigmoid \rightarrow ReLU:Sigmoid的问题是它的梯度极难下降到0,造成了很大的性能损失

2. Basics of Neural Network Programming

二分分类(Binary Classification)

在神经网络中,我们使用与机器学习相反的矩阵排列(n * m):

[X0(1)X0(2)..X0(m)X1(1)X1(2)..X1(m)..Xn(1)Xn(2)..Xn(m)]\left[ \begin{matrix} X^{(1)}_0 & X^{(2)}_0 & .. & X^{(m)}_0 \\ X^{(1)}_1 & X^{(2)}_1 & .. & X^{(m)}_1 \\ .. \\ X^{(1)}_n & X^{(2)}_n & .. & X^{(m)}_n \end{matrix} \right]

就是将每一个数据放在列上。

M代表样本总数,N代表特征总数。

逻辑回归(Logistic Regression)

在神经网络中,常使用下面的方式计算逻辑回归:

z=wTx+by^=σz=σ(wTx+b)z = w^T x + b \\ \hat{y} = \sigma z = \sigma (w^T x + b)

而不是把w和b合并为θ\theta

代价函数(Cost Function)

也叫做损失函数。

准确来说,损失函数用于描述单一样本,代价函数用于描述整个数据集。

目标是使平方误差尽可能小(To be Square Error as small as possible):

12(y^y)2\frac12 (\hat{y} - y)^2

所以对每一个样本作用下面的公式:

Loss=ylogy^(1y)log(1y^)y1时,L=logy^如果y^1L0;如果y^0L无限大y0时,L=log(1y^)如果y^1L为无限大;如果y^0L0Loss = -y log \hat{y} - (1-y) log(1-\hat{y}) \\ y为1时,L = -log \hat{y},\\ 如果\hat{y}为1,L为0;如果\hat{y}为0,L无限大 \\ y为0时,L = - log(1-\hat{y}),\\ 如果\hat{y}为1,L为无限大;如果\hat{y}为0,L为0

所以损失函数是:

Cost=1mi=1m[y(i)logy^(i)+(1y(i))log(1y^(i))]Cost = - \frac1m \sum_{i=1}^m [y^{(i)} log \hat{y}^{(i)} + (1 - y^{(i)}) log (1 - \hat{y}^{(i)})]

梯度下降法(Gradient Descent)

概述(Recap):

y^=σ(wTx+b),σ(z)=11+ezJ(w,b)=1mi=1mLoss(y^(i),y(i))=1mi=1m[y(i)logy^(i)+(1y(i))log(1y^(i))]\hat{y} = \sigma (w^T x + b), \sigma(z) = \frac1{1 + e^{-z}} \\ J(w, b) = \frac1m \sum_{i=1}^m Loss(\hat{y}^{(i)}, y^{(i)}) \\ = - \frac1m \sum_{i=1}^m [y^{(i)} log \hat{y}^{(i)} + (1 - y^{(i)}) log (1 - \hat{y}^{(i)})]

有最小值的函数称为凸函数(convex function),它不会有很多不同的局部最优。

w就是梯度,即多个维度的斜率(slope)。梯度下降:

w=wαdww = w - \alpha \cdot dw

其中ww是预测函数的梯度,是预测函数的系数列表;α\alpha是学习率;dwdwJJww的求导,就是损失函数的当前斜率。

我们的目的是让损失函数达到最小值,即这个凸函数的dwdw趋近于0。

dwdw趋近于0的表现是:两次ww的值趋近于相等。

所以梯度下降的代码大概就是:

last_w = w 
w = w - alpha * dw 
b = b - alpha * db 
if last_w - w < 0.0001:
		break  # 0.0001是随便写的值

dw表示对w求导;w表示对w求偏导,此时函数应当有多个变量。dw表示对w求导;\partial{w}表示对w求偏导,此时函数应当有多个变量。

导数(Derivatives)

f(x)=3x,函数的导数是:df(x)dx=3f(x) = 3x,函数的导数是:\\ \frac{df(x)}{dx} = 3

导数的含义是:当x增加一个无限小的值时,f(x)会增加3倍这个无限小的值。

计算图(Computation Graph)

比如J = 3(a + bc),计算图表示就是:

u=bcv=a+uJ=3vu = bc \\ v = a + u \\ J = 3v

计算图的导数(Derivatives with a Computation Graph)

主要是说明导数的链式法则(Chain Rule)。

dJda=3dJdb=3cdJdc=3b\frac{dJ}{da} = 3 \\ \frac{dJ}{db} = 3c \\ \frac{dJ}{dc} = 3b

逻辑回归的梯度下降(Implement Gradient Descent for Logistic Regression)

上面已经写出了公式,求出dw就行了

计算图:

w,x,bz=wx+by^=a=σ(z)L(a,y)w, x , b \longrightarrow z=wx+b \longrightarrow \hat{y}=a=\sigma(z) \longrightarrow L(a, y)

L对a(即y_hat, sigmoid)求导:

Loss=ylogy^(1y)log(1y^),a=y^dLda=ya+1y1a=aya(1a)Loss = -y log \hat{y} - (1-y) log(1-\hat{y}), a=\hat{y} \\ \frac{dL}{da} = - \frac{y}{a} + \frac{1-y}{1-a} = \frac{a-y}{a(1-a)} \\

a对z求导:

a=σ(z)=11+et=(1+et)1dadz=(1+et)2et1=(1+et)2et=a2(1a1)=aa2a = \sigma(z) = \frac1{1+e^{-t}} = (1+e^{-t})^{-1} \\ \frac{da}{dz} = -(1+e^{-t})^{-2} \cdot e^{-t} \cdot -1 = (1+e^{-t})^{-2} e^{-t} \\ = a^2 (\frac{1}{a} - 1) = a-a^2 \\

z对w求导:

z=wx+bdzdw=xz = wx + b \\ \frac{dz}{dw} = x

所以,L对w求导:

dLdw=dLdadadzdzdw=(ay)x\frac{dL}{dw} = \frac{dL}{da} \cdot \frac{da}{dz} \cdot \frac{dz}{dw} = (a-y)x

所以,代价函数的导数,称为dw,它的求解是:

dw=dCdw=1mi=1m(a(i)y(i))x(i)dw = \frac{dC}{dw} = \frac{1}{m} \sum_{i=1}^m (a^{(i)} - y^{(i)}) \cdot x^{(i)}

向量化(Vectorization)

Whenever possible, avoid explicit for-loops.

在numpy中,可以直接用向量化计算来代替for循环的计算,计算速度提升非常大。

如计算z=wTx+bz = w^T x + b:(其中w和x都是列向量)

z = np.dot(w, x) + b 

举个例子

v=[v1v2..vn]求得:u=[ev1ev2..evn]v = \left[ \begin{matrix} v1 \\ v2 \\ .. \\ vn \end{matrix} \right] \\ 求得: \\ u = \left[ \begin{matrix} e^{v1} \\ e^{v2} \\ .. \\ e^{vn} \end{matrix} \right]

写成代码就是:

u = np.zeros((n, 1))  # n行1列的0
for i in range(n):
  u[i] = math.exp(v[i])

但使用向量化的写法就是:

import numpy as np
u = np.exp(v)

向量化的逻辑回归(Vectorizing Logistic Regression)

n表示n个特征,m表示m组数据。

w.shape(n,1)X.shape(n,m)b.shape(1,1),它是个实数w.shape 是 (n, 1) \\ X.shape 是 (n, m) \\ b.shape 是 (1, 1),它是个实数

在numpy中,向量和实数进行相加时,会自动把实数扩展为向量。这个操作称为广播(broadcasting)。

求z = wx + b的代码:

z = np.dot(w.T, X) + b  # shape是(m, 1)
a = sigmoid(z)  # a就是y_hat

求dw的代码:

dw = np.sum(np.dot((a - y), x)) / m

广播(Broadcasting in Python)

比如,用m * n的矩阵,和1 * n的矩阵相加,会自动把1 * n的矩阵扩展为m * n的矩阵。

示例:

import numpy as np

# Apples Beef Eggs Potatoes
X = np.array([
    [56.0, 0.0, 4.4, 68.0],  # Carb 碳水化合物
    [1.2, 104.0, 52.0, 8.0],  # Protein 蛋白质
    [1.8, 135.0, 99.0, 0.9]  # Fat 热量
])
X.shape  # (3, 4)

cal = X.sum(axis=0)
cal.shape  # (4,)
print(X / cal)

cal = cal.reshape(1, 4)  # 和上面的cal是一样的
cal.shape  # (1, 4)
print(X / cal)

"""
[[0.94915254 0.         0.02831403 0.88426528]
 [0.02033898 0.43514644 0.33462033 0.10403121]
 [0.03050847 0.56485356 0.63706564 0.01170351]]
 """

为不产生奇怪的bug,在使用矩阵时,确保其的秩(rank)不为1(就是shape不能是 (n,) 这样的形式)。比如创建一个列向量:

# a column vector
np.random.randn(5, 1)  # 不要写:np.random.randn(5)

为保证矩阵的shape是正确的,可以使用:

assert(a.shape == (5, 1))

3. One hidden layer Neural Network

神经网络表示(Neural Network Representation)

有输入层(Input Layer of the Neural Network)、隐藏层(Hidden Layer)和输出层(Output Layer)。

输入层的不同节点对应的是不同的特征。

表示:

X输入层:写作a[0] a表示activate隐藏层:写作a[1]输出层:写作a[2]该神经网络一共两层X输入层:写作a^{[0]} \space a 表示 activate \\ 隐藏层:写作a^{[1]} \\ 输出层:写作a^{[2]} \\ 该神经网络一共两层

不清楚的话看图片哦。

在隐藏层有多个节点,它用来计算当前层的任务:

z1[1]=w1[1]Tx+b1[1],a1[1]=σ(z1[1])z2[1]=w2[1]Tx+b2[1],a2[1]=σ(z2[1])z3[1]=w3[1]Tx+b3[1],a3[1]=σ(z3[1])z4[1]=w4[1]Tx+b4[1],a4[1]=σ(z4[1])z^{[1]}_1 = w^{[1]T}_1x + b^{[1]}_1, a^{[1]}_1 = \sigma(z^{[1]}_1) \\ z^{[1]}_2 = w^{[1]T}_2x + b^{[1]}_2, a^{[1]}_2 = \sigma(z^{[1]}_2) \\ z^{[1]}_3 = w^{[1]T}_3x + b^{[1]}_3, a^{[1]}_3 = \sigma(z^{[1]}_3) \\ z^{[1]}_4 = w^{[1]T}_4x + b^{[1]}_4, a^{[1]}_4 = \sigma(z^{[1]}_4) \\

所以Hidden Layer的输出是:

z[1]=W[1]x+b[1]a[1]=σ(z[1])z^{[1]} = W^{[1]}x + b^{[1]} \\ a^{[1]} = \sigma(z^{[1]})

Output Layer的输出是:

z[2]=W[2]a[1]+b[2]y^=a[2]=σ(z[2])z^{[2]} = W^{[2]}a^{[1]} + b^{[2]} \\ \hat{y} = a^{[2]} = \sigma(z^{[2]})

多个例子的向量化(Vectorizing across multiple examples)

a[2](i)表示第i个训练实例的预测值,a[2]就是y^a^{[2](i)} 表示第i个训练实例的预测值,a^{[2]}就是\hat{y}。

分析参数是啥

首先,对于Input Layer来说,x1,x2x_1, x_2,它们是x的n个特征。我们暂时不去管这些特征。

现在我们有一批数据:

X=[X(1)X(2)..X(m)]X = \left[ \begin{matrix} X^{(1)} & X^{(2)} & .. & X^{(m)} \end{matrix} \right]

在Hidden Layer有多组w和b,所以:

z[1](1)=W[1]X(1)+b[1]z^{[1](1)} = W^{[1]}X^{(1)} + b^{[1]}

这里的W是若干组系数,注意每一行是一组:

W[1]=[w1w2..wnw1w2..wn..w1w2..wn]W^{[1]} = \left[ \begin{matrix} w_1 & w_2 & .. & w_n \\ w_1 & w_2 & .. & w_n \\ .. \\ w_1 & w_2 & .. & w_n \end{matrix} \right]

这里的X是一条数据,有n个特征:

X(1)=[x1x2..xn]X^{(1)} = \left[ \begin{matrix} x_1 \\ x_2 \\ .. \\ x_n \end{matrix} \right]

这里的b也是若干组系数,注意每一行是一组:

b[1]=[bb..b]b^{[1]} = \left[ \begin{matrix} b \\ b \\ .. \\ b \end{matrix} \right]

激活函数(Activation Functions)

目前我们在Hidden Layer和Output Layer都使用了Sigmoid函数作为激活函数。

把Hidden Layer的激活函数,换成tanh将是更优的选择。它使得数据分布在0的左右,其值是更有意义的。

tanh=ezezez+eztanh = \frac{e^z - e^{-z}}{e^z + e^{-z}}

但是tanh和Sigmoid共同的缺点是,当传入的值很大或很小时,tanh和Sigmoid的结果会很小,梯度下降的就会很慢。

所以更好的方法是:使用ReLU:

a=max(0,z)a = max(0, z)

使用ReLU的学习速度会比Sigmoid快很多。

结论:对于二分类工作,使用Relu作为Hidden Layer的激活函数,Sigmoid作为Output Layer的激活函数。对于非二分类工作,不要使用Sigmoid函数。

leaky ReLU: a = max(0.01z, z)

为什么要使用非线性激活函数(Why does a neural network need a non-linear activation function)

因为两个线性函数的结合仍是线性的。

比如对于上面的模型:

z[1]=W[1]x+b[1]a[1]=σ(z[1])z[2]=W[2]a[1]+b[2]y^=a[2]=σ(z[2])z^{[1]} = W^{[1]}x + b^{[1]} \\ a^{[1]} = \sigma(z^{[1]}) \\ z^{[2]} = W^{[2]}a^{[1]} + b^{[2]} \\ \hat{y} = a^{[2]} = \sigma(z^{[2]})

我们把Sigmoid换成线性的,如y=x:

z[1]=W[1]x+b[1]a[1]=z[1]z[2]=W[2]a[1]+b[2]y^=a[2]=z[2]求得:a[2]=W[2]a[1]+b[2]=W[2](W[1]x+b[1])+b[2]所以:a[2]=kx+lz^{[1]} = W^{[1]}x + b^{[1]} \\ a^{[1]} = z^{[1]} \\ z^{[2]} = W^{[2]}a^{[1]} + b^{[2]} \\ \hat{y} = a^{[2]} = z^{[2]} \\ 求得: \\ a^{[2]} = W^{[2]}a^{[1]} + b^{[2]} = W^{[2]}(W^{[1]}x + b^{[1]}) + b^{[2]} \\ 所以:a^{[2]} = kx + l

这样一来,再多的隐藏层都是无意义的。它们其实只是计算了一个线性的值。除非是在计算回归问题(如房价预测)。通常,只能在Hidden Layer使用非线性激活函数。

激活函数的导数(Derivatives of activation functions)

Sigmoid:

g(z)=11+ezg(z)=1(1+ez)2ez=g(z)2(1g(z)1)=g(z)(1g(z))g(z) = \frac1{1 + e^{-z}} \\ g\prime(z) = - \frac1{(1 + e^{-z})^2} \cdot - e^{-z} = g(z)^2 \cdot (\frac1{g(z)} - 1) = g(z) (1 - g(z))

tanh:

g(z)=ezezez+ezg(z)=g(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} \\ g\prime(z) =

ReLU:

g(z)=max(0,z)g(z)={0,if(z<0)1,if(z>0)undefined,if(z=0) 编程把这个归结到上面任意一个里面去}g(z) = max(0, z) \\ g\prime(z) = \left\{ \begin{matrix} 0, if (z < 0) \\ 1, if (z > 0) \\ undefined, if (z = 0) \space 编程把这个归结到上面任意一个里面去 \end{matrix} \right\}

leaky ReLU:

g(z)=max(0.01z,z)g(z)={0.01,if(z<0)1,if(z>0)undefined,if(z=0) 编程把这个归结到上面任意一个里面去}g(z) = max(0.01z, z) \\ g\prime(z) = \left\{ \begin{matrix} 0.01, if (z < 0) \\ 1, if (z > 0) \\ undefined, if (z = 0) \space 编程把这个归结到上面任意一个里面去 \end{matrix} \right\}

神经网络的梯度下降法(Gradient Descent for Neural Networks)

现在呢,Input Layer输出n[0]n^{[0]}个特征,Hidden Layer输出n[1]n^{[1]}个特征,Output Layer输出n[2]n^{[2]}个特征。

所以对于w和b:

w[1]是:(n[1],n[0])b[1]是:(n[1],1)可以让n[1]1,体会一下:w就是1n[0]列,表示n[0]个特征对应的n[0]个系数;b就是1行,表示n[0]个特征对应1bn[1]这个参数是共有n[1]wbw^{[1]}是:(n^{[1]}, n^{[0]}) \\ b^{[1]}是:(n^{[1]}, 1) \\ 可以让n^{[1]}为1,体会一下: \\ w就是1行n^{[0]}列,表示n^{[0]}个特征对应的n^{[0]}个系数;\\ b就是1行,表示n^{[0]}个特征对应1个b; \\ n^{[1]}这个参数是共有n^{[1]}组w和b。

继续:

w[1]是:(n[1],n[0])b[1]是:(n[1],1)w[2]是:(n[2],n[1])b[2]是:(n[2],1)w^{[1]}是:(n^{[1]}, n^{[0]}) \\ b^{[1]}是:(n^{[1]}, 1) \\ w^{[2]}是:(n^{[2]}, n^{[1]}) \\ b^{[2]}是:(n^{[2]}, 1)

所以Cost Function是:

J(w[1],b[1],w[2],b[2])=1mi=1m[y(i)logy^(i)+(1y(i))log(1y^(i))]J(w^{[1]}, b^{[1]}, w^{[2]}, b^{[2]}) = - \frac1m \sum_{i=1}^m [y^{(i)} log \hat{y}^{(i)} + (1 - y^{(i)}) log (1 - \hat{y}^{(i)})]

J对w[2]w^{[2]}的导数(别忘了a是y_hat):

J是损失函数,a是y_hat、在这里是激活函数(Sigmoid),z是激活函数的变量,w是要求的变量。a指的是a[2]a^{[2]}

dw[2]=Jaazzw[2]=aya(1a)a(1a)a[1]=(a[2]y)a[1]dw^{[2]} = \frac{\partial{J}}{\partial{a}} \cdot \frac{\partial{a}}{\partial{z}} \cdot \frac{\partial{z}}{\partial{w^{[2]}}} \\ = \frac{a-y}{a(1-a)} \cdot a(1 - a) \cdot a^{[1]} \\ = (a^{[2]} - y) \cdot a^{[1]}

直接写老师计算的结果:

dz[2]=a[2]ydw[2]=dz[2]a[1]Tdb[2]=dz[2]dz[1]=W[2]Tdz[2]g[1](z[1])dW[1]=dz[1]xTdb[1]=dz[1]dz^{[2]} = a^{[2]} - y \\ dw^{[2]} = dz^{[2]} a^{[1]T} \\ db^{[2]} = dz^{[2]} \\ dz^{[1]} = W^{[2]T}dz^{[2]} * g^{[1]}\prime(z^{[1]}) \\ dW^{[1]} = dz^{[1]} xT \\ db^{[1]} = dz^{[1]}

放到numpy中:

dZ[2]=A[2]YdW[2]=1mdZ[2]A[1]Tdb[2]=1mnp.sum(dZ[2],axis=1,keepdims=True)dZ[1]=W[2]TdZ[2]g[1](Z[1])dw[1]=1mdZ[1]XTdb[1]=1mnp.sum(dZ[1],axis=1,keepdims=True)dZ^{[2]} = A^{[2]} - Y \\ dW^{[2]} = \frac1m dZ^{[2]} A^{[1]T} \\ db^{[2]} = \frac1m np.sum(dZ^{[2]}, axis=1, keepdims=True) \\ dZ^{[1]} = W^{[2]T}dZ^{[2]} * g^{[1]}\prime(Z^{[1]}) \\ dw^{[1]} = \frac1m dZ^{[1]}X^T \\ db^{[1]} = \frac1m np.sum(dZ^{[1]}, axis=1, keepdims=True) \\

随机初始化

对于每一层的节点,应该使用随机初始化的权重参数。如

w1 = np.random.randn(2, 2) * 0.01  # 因为z = w1 * x + b,z作用于sigmoid,值应该小,否则梯度下降很慢
b1 = np.zeros((2, 1))

4. Deep L-Layer Neural Network

个人复习逻辑回归

y_hat = a = sigmoid(z), z = wX + b

Coss = J = - 1/m * (yloga + (1-y)log(1-a))

dw = dJ / dw = (a - y) * X / m

db = dJ / db = (a - y) / m

求最合适的w和b,使得Coss最小,即Coss(y_hat_last, y_hat) -> 0时

已有X和y,设w、b和alpha(学习率)

w_last = w

b_last = b

w = w - alpha * dw

b = b - alpha * db

计算y_hat和y_hat_last

Cost(y, y_hat) - Cost(y, y_hat_last) -> 0,欧了

此时的w和b就是我们要的最优w和b

预测:

y_predict = sigmoid(wX_test + b)

y_predict_int = np.array(y_predict > 0.5, dtype=int) # 预测结果

accuracy = np.sum(y_predict_int == y_true) / y_true.shape[1]

符号表示(Deep neural network notation)

L: 层数(#layers)。注意层数的计算包括Output Layer(Ln),不包括Input Layer(L0)

n: 单元数(#units)。如n[1]n^{[1]},表示第1层的神经元数量

a: 激活函数(activation),也是每一层的输出。如Input Layer(L0)的a等于x,Output Layer(Ln)的a等于y^\hat{y}

前向传播(Forward propagation in a deep network)

对单层进行for循环:

Z[i]=w[i]A[i1]+b[i]A[i]=g[i](Z[i])Z^{[i]} = w^{[i]}A^{[i - 1]} + b^{[i]} \\ A^{[i]} = g^{[i]}(Z^{[i]})

假设5层的神经网络,其神经元的数量分别是:[2(Input), 3, 5, 4, 2, 1(Output)]

计算维度(Dimensions):

w[l]:维度是(n[l],n[l1])b[l]:维度是(n[l],1)X[l]:维度是(n[l1],m)Z[l]:维度是(n[l],m)w^{[l]}: 维度是(n^{[l]}, n^{[l-1]}) \\ b^{[l]}: 维度是(n^{[l]}, 1) \\ X^{[l]}: 维度是(n^{[l-1]}, m) \\ Z^{[l]}: 维度是(n^{[l]}, m)

搭建深层神经网络块(Building blocks of deep neural networks)

见图片。所有的blocks连接起来,就是一次梯度下降。它包括前向传播(forward propagation)和反向传播(backward propagation)。

见图片。

超参数(Hyperparameters)

生成参数的过程中,需要添加的辅助参数,如学习率(alpha)、迭代次数(number of iterations)、隐藏层数量(#hidden layers)、神经元数量(#hidden units)等。

作业2: 逻辑回归实现猫图片识别

代码

from lr_utils import load_dataset
from matplotlib import pyplot as plt
import numpy as np

X_train_origin, y_train_origin, X_test_origin, y_test_origin, classes = load_dataset()

# 查看图片
plt.imshow(X_train_origin[12])
plt.show()

# 把 X_train 的 shape 由 (209, 64, 64, 3) 转为 (12288, 209)
X_train = X_train_origin.reshape((X_train_origin.shape[0], -1)).T
# 把 X_test 的 shape 由 (50, 64, 64, 3) 转为 (12288, 50)
X_test = X_test_origin.reshape((X_test_origin.shape[0], -1)).T

# y_train 和 y_test 的 shape 在 load_dataset 函数中处理过了
y_train = y_train_origin
y_test = y_test_origin

# 特征数量
n_dim = X_test.shape[0]

# 定义w和b
w = np.zeros((n_dim, 1))
b = 0

# 对数据进行标准化处理:将值投影到0-1之间
X_train = X_train / 255
X_test = X_test / 255


# Sigmoid
def sigmoid(z):
    return 1 / (1 + np.exp(-z))


# y_hat 即 a
def y_hat(w, X, b):
    assert X.shape[0] == w.shape[0]
    return sigmoid(w.T.dot(X) + b)


# 代价函数
def J(y, a):
    assert y.shape == a.shape
    m = y.shape[1]
    return -1 / m * np.sum(y * np.log(a) + (1 - y) * np.log(1 - a))


# 代价函数对w的求导
def dw(X, y, a):
    assert y.shape == a.shape
    m = y.shape[1]
    return 1 / m * X.dot((a - y).T)


# 代价函数对b的求导
def db(y, a):
    assert y.shape == a.shape
    m = y.shape[1]
    return 1 / m * np.sum(a - y)


# 前向传播
def propagate(w, b, X, y):
    assert w.shape[0] == X.shape[0] & X.shape[1] == y.shape[1]
    z = np.dot(w.T, X) + b
    a = sigmoid(z)
    return a


# 实现梯度下降法求最优W和b
def gradient_descent(X, y, w_init, b_init, n_iters=3000, eta=0.01, epsilon=1e-8):
    """
    梯度下降:m是样本个数,n是特征数量
    :param X: X训练集,shape为(n, m)
    :param y: y训练集,shape为(1, m)
    :param w_init: 起始的w值,shape为(n, 1)
    :param b_init: 起始的b值,int
    :param n_iters: 最大循环次数
    :param eta: 学习率
    :param epsilon: 认为是0的值
    :return:
    """
    assert X.shape[1] == y.shape[1] and X.shape[0] == w_init.shape[0]

    w = w_init
    b = b_init

    i = 0

    while i < n_iters:
        last_w = w
        last_b = b
        w = w - eta * dw(X, y, y_hat(w, X, b))
        b = b - eta * db(y, y_hat(w, X, b))
        i += 1
        v = J(y, y_hat(w, X, b)) - J(y, y_hat(last_w, X, last_b))
        if i % 100 == 0:
            print('第 i 次', i)
            print('epsilon', v)
        if abs(v) < epsilon:
            break

    return w, b


w_w, b_b = gradient_descent(X_train, y_train, w, b)

def predict(X, y_true):
    y_predict = y_hat(w_w, X, b_b)
    return np.sum(np.array(y_predict > 0.5, dtype=int) == y_true) / y_true.shape[1]

print('成功率约是' ,predict(X_test, y_test))

img_url = '/Users/goyave/Desktop/cat2.jpg'
img = plt.imread(img_url)

"""预测新图片"""

# 尺寸变换:由 (500, 500, 3) 变成 (64, 64, 3)
from skimage import transform
new_img = transform.resize(img, (64, 64, 3))
plt.imshow(new_img)
plt.show()
new_img = new_img.reshape(-1, 1)
print('shape', new_img.shape)
print(y_hat(w_w, new_img, b_b))

作业3:One Hidden Layer实现二分类

代码

# coding: utf-8

# In[2]:


import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model
from testCases import *
from planar_utils import load_planar_dataset, load_extra_datasets, sigmoid, plot_decision_boundary

np.random.seed(1)


# In[6]:


X, Y = load_planar_dataset()

# Visualize the data:
plt.scatter(X[0, :], X[1, :], c=Y, s=40, cmap=plt.cm.Spectral);


# In[8]:


# X.shape is (2, 400), Y.shape is (1, 400)
# the number of training dataset is 400, the number of features is 2
# means m = 400, n^[0] = 2


# In[15]:


# first see how logistic regression performs on this problem: 
clf = sklearn.linear_model.LogisticRegressionCV()
clf.fit(X.T, Y.T.reshape(-1))
print(clf.score(X.T, Y.T))  # 0.47

# Plot the decision boundary for logistic regression
plot_decision_boundary(lambda x: clf.predict(x), X, Y)
plt.title("Logistic Regression")


# In[16]:


# define the size of units
def layer_sizes(X, Y):
    n_x = X.shape[0]  # size of the Input Layer
    n_h = 4  # size of the Hidden Layer
    n_y = Y.shape[0]  # size of the Output Layer
    return (n_x, n_h, n_y)


# In[21]:


# initialize the model's parameters
def initialize_parameters(n_x, n_h, n_y):
    """
    RETURNS:
    params -- python dictionary containing your parameters:
        GIVEN X1 -- (n_x, m)
        W1 -- weight matrix of shape (n_h, n_x)
        b1 -- bias vector of shape (n_h, 1) 
        
        GIVEN X2 -- (n_h, m)
        W2 -- weight matrix of shape (n_y, n_h)
        b2 -- bias vector of shape (n_y, 1)
    """
    np.random.seed(2)
    
    # 生成0为均值、1为标准差的正态分布,再乘以0.01
    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros(shape=(n_h, 1))
    W2 = np.random.randn(n_y, n_h) * 0.01
    b2 = np.zeros(shape=(n_y, 1))
    
    params = {
        'W1': W1,
        'b1': b1,
        'W2': W2,
        'b2': b2
    }
    return params


# In[24]:


# one loop forward propagation
def forward_propagation(X, parameters):
    """
    Arguments:
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing your parameters (output of initialization function)
    
    Returns:
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
    """
    # retrieve each parameter from the dictionary "parameters"
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    
    # Implement forward propagation to calculate A2
    A0 = X
    Z1 = np.dot(W1, A0) + b1  # shape (n_h, m)
    A1 = np.tanh(Z1)  # shape (n_h, m)
    Z2 = np.dot(W2, A1) + b2  # shape (n_y, m)
    A2 = sigmoid(Z2)  # shape (n_y, m), means (1, m)
    
    assert(A2.shape == (1, X.shape[1]))
    
    cache = {
        'Z1': Z1,
        'A1': A1,
        'Z2': Z2,
        'A2': A2
    }
    return A2, cache


# In[36]:


# cost function
def compute_cost(A2, Y, parameters):
    """
    Arguments:
    A2 -- the sigmoid output of the second activation, of shape (1, #examples)
    Y -- "true" labels vector of shape (1, #examples)
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2

    Returns:
    cost -- cross-entropy cost given equation
    """
    m = Y.shape[1]
    
    # retrieve W1 and b1
    W1 = parameters['W1']
    b1 = parameters['b1']
    
    # compute the cross-entropy cost
    cost = -np.sum(Y * (np.log(A2)) + (1 - Y) * (np.log(1 - A2))) / m
    
    cost = np.squeeze(cost)  # makes sure "cost" is the dimension we expect
    assert(isinstance(cost, float))
    
    return cost


# In[48]:


# one loop backward propagation
def backward_propagation(parameters, cache, X, Y):
    """
    Arguments:
    parameters -- python dictionary containing "W1", "b1", "W2" and "b2"
    cache -- python dictionary containing "Z1", "A1", "Z2" and "A2"
    X -- input data of shape (n_x, #examples)
    Y -- "true" labels vector of shape (n_y, #examples)
    
    Returns:
    grads -- python dictionary containing gradients with respect to different parameters
    """
    m = X.shape[1]
    
    W1 = parameters['W1']
    W2 = parameters['W2']
    A1 = cache['A1']
    A2 = cache['A2']
    
    # backward propagation: calculate dw1, db1, dw2, db2
    
    # 这里说明一下:dx 表示 Loss 对 x 求导
    # note: 为什么求dw_n时要除以m,因为代价函数前面有一个1/m,就是MSE前面那个1/m、求导时自动延续下来了
    # -- dA2 = (A2 - Y) / ((1 - A2) * A2)
    # -- (A2对Z2求导) = a - a^2
    # -- dZ2 = dA2 * (A2对Z2求导) = A2 - Y
    
    dZ2 = A2 - Y
    # shape of dZ2 is (n_y, m), shape of (Z2对W2求导) 即A1 is (n_h, m)
    # shape of dW2 is (n_y, n_h)
    dW2 = np.dot(dZ2, A1.T) / m  # dZ2 * (Z2对W2求导)
    # shape of db2 is (n_y, 1)
    db2 = np.sum(dZ2, axis=1, keepdims=True) / m
    
    # dZ1是难点:
    # np.power 不会改变维度
    # dZ2 = A2 - Y, which of shape (n_y, m)
    # Z2对A1求导 = W2, which of shape (n_y, n_h)
    # A1对Z1求导 = 1 - A1^2, which of shape (n_h, m)
    # dZ1 = dZ2 * (Z2对A1求导) * (A1对Z1求导), shape (n_h, m)
    dZ1 = np.dot(W2.T, dZ2) * (1 - np.power(A1, 2))
    # shape of dW1 is (n_h, n_x), shape of X is (n_x, m)
    dW1 = np.dot(dZ1, X.T) / m
    db1 = np.sum(dZ1, axis=1, keepdims=True) / m
    
    grads = {
        'dW1': dW1,
        'db1': db1,
        'dW2': dW2,
        'db2': db2
    }
    return grads


# In[53]:


# gradient descent
def update_parameters(parameters, grads, learning_rate=1.2):
    """
    Arguments:
    parameters -- python dictionary containing parameters
    grads -- python dictionary containing gradients
    
    Returns:
    parameters -- python dictionary containing your updated parameters
    """
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    
    dW1 = grads['dW1']
    db1 = grads['db1']
    dW2 = grads['dW2']
    db2 = grads['db2']
    
    W1 = W1 - learning_rate * dW1
    b1 = b1 - learning_rate * db1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * db2
    
    parameters = {
        'W1': W1,
        'b1': b1,
        'W2': W2,
        'b2': b2
    }
    return parameters


# In[61]:


# integrate parts
def nn_model(X, Y, n_h, num_iterations, print_cost=False):
    """
    Arguments:
    X -- dataset of shape (n_x, #examples)
    Y -- labels of shape (n_y, #examples)
    n_h -- size of hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used by predict.
    """
    np.random.seed(3)
    n_x = layer_sizes(X, Y)[0]
    n_y = layer_sizes(X, Y)[2]
    
    # initialize parameters
    parameters = initialize_parameters(n_x, n_h, n_y)
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    
    # Loop (gradient descent)
    for i in range(0, num_iterations):
        # Forward Propagation. Inputs: "X, parameters". Outputs: "A2, cache".
        A2, cache = forward_propagation(X, parameters)
        
        # Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
        cost = compute_cost(A2, Y, parameters)
        
        # Backward Propagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
        grads = backward_propagation(parameters, cache, X, Y)
        
        # Gradient Descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
        parameters = update_parameters(parameters, grads)
        
        if (print_cost and i % 1000 == 0):
            print("Cost after iteration %i: %f" % (i, cost))  # i(int), f(float)
    
    return parameters


# In[86]:


# Predictions
def predict(parameters, X):
    A2, cache = forward_propagation(X, parameters)
    predictions = np.array(A2 > 0.5, dtype=int)  # 或者用np.round(A2),它返回float类型
    return predictions


# In[93]:


# Build a model with a n_h-dimensional hidden layer
parameters = nn_model(X, Y, n_h = 4, num_iterations=10000, print_cost=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), X, Y)
plt.title("Decision Boundary for hidden layer size " + str(4))


# In[97]:


predictions = predict(parameters, X)
print(np.sum(np.array(predictions == Y, dtype=int)) / Y.shape[1])

# 老师是这么写的:
print ('Accuracy: %d' % float((np.dot(Y, predictions.T) + np.dot(1 - Y, 1 - predictions.T)) / float(Y.size) * 100) + '%')
# 计算的是:(都为1时的数量 + 都为0时的数量) / len


# In[100]:


# Tuning hidden layer size to observe different behaviors of the model for various hidden layer sizes

plt.figure(figsize=(16, 32))
hidden_layer_sizes = [1, 2, 3, 4, 5, 20, 50]
for i, n_h in enumerate(hidden_layer_sizes):
    plt.subplot(5, 2, i + 1)
    plt.title('Hidden Layer of size %d' % n_h)
    parameters = nn_model(X, Y, n_h, num_iterations=5000)
    plot_decision_boundary(lambda x: predict(parameters, x.T), X, Y)
    predictions = predict(parameters, X)
    accuracy = float((np.dot(Y, predictions.T) + np.dot(1 - Y, 1 - predictions.T)) / float(Y.size) * 100)
    print ("Accuracy for {} hidden units: {} %".format(n_h, accuracy))


# In[ ]:


作业4: Building a deep neural network

代码

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)


def initialize_parameters(layer_dims):
    """
    Random Initialize Parameters.
    :param layer_dims: Python list containing the dimensions of each layer, including the Input Layer
                e.g. [3, 4, 5, 1], means 2 hidden layers
    :return: parameters -- Python dictionary containing parameters:
                W1 -- weight matrix of shape (n1, n0)
                b1 -- bias vector of shape (n1, 1)
                ..
                WL -- weight matrix of shape (nl, nl-1)
                bL -- bias vector of shape (nl, 1)
    """
    parameters = {}
    L_all = len(layer_dims)

    for l in range(1, L_all):
        parameters['W' + str(l)] = np.random.randn(
            layer_dims[l], layer_dims[l - 1]) / np.sqrt(
            layer_dims[l - 1])  # randn() returns the "standard normal" distribution
        parameters['b' + str(l)] = np.zeros(shape=(layer_dims[l], 1))

    return parameters


def forward_propagation(X, parameters):
    """
    Implement forward propagation for the [LINEAR->RELU]*(L-1) -> LINEAR -> SIGMOID computation.
    :param X: feature data, numpy array of shape (n0, m), n0 is size of input layer
    :param parameters: Python list containing all Ws and bs
    :return: AL, caches:
                AL -- last post-activation value, that is y_hat
                caches -- Python list of caches containing:
                            every cache of ReLU forward (indexed from 1 to L-1)
                            the cache of Sigmoid forward (indexed L)
                            cache[0] is empty Python tuple
    """
    caches = []
    L = len(parameters) // 2

    Al_prev = X

    def relu(Z):
        A = np.maximum(0, Z)
        return A

    def sigmoid(Z):
        A = 1 / (1 + np.exp(-Z))
        return A

    # cache of Input Layer is empty Python tuple
    caches.append(())

    # Wl of shape (nl, n(l-1)), A(l-1) of shape (n(l-1), m), bl of shape (nl, 1), Zl of shape (nl, m)
    for l in range(1, L + 1):
        Wl = parameters['W' + str(l)]
        bl = parameters['b' + str(l)]
        Zl = np.dot(Wl, Al_prev) + bl
        linear_cache = (Al_prev, Wl, bl)
        activation_cache = Zl
        cache = (linear_cache, activation_cache)
        caches.append(cache)
        # layer L, that is last layer if l != L
        Al_prev = relu(Zl) if l != L else sigmoid(Zl)

    AL = Al_prev

    m = X.shape[1]
    assert AL.shape == (1, m)

    return AL, caches


def compute_cost(AL, Y):
    """
    Compute cost.
    :param AL: probability vector corresponding to label predictions, of shape (1, m)
    :param Y: "true" label vector, of shape (1, m)
                e.g. containing 0 if non-cat, 1 if cat
    :return: cross-entropy cost. Python float
    """
    assert AL.shape == Y.shape
    m = AL.shape[1]

    AL[AL == 0] = 0.00000001
    AL[AL == 1] = 0.99999999

    # cost = -np.sum(Y * np.log(AL) + (1 - Y) * np.log(1 - AL)) / m
    cost = (-1 / m) * np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1 - Y, np.log(1 - AL)))
    return cost


def backward_propagation(AL, Y, caches):
    """
    Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group.
    :param AL: probability vector corresponding to label predictions, of shape (1, m)
    :param Y: "true" label vector, of shape (1, m)
    :param caches: Python list of caches containing:
                            every cache of ReLU forward (indexed from 1 to L-1)
                            the cache of Sigmoid forward (indexed L)
    :return: gradients -- Python dictionary with the gradients
    """
    gradients = {}
    dAL = (1 - Y) / (1 - AL) - Y / AL
    L = len(caches) - 1
    m = AL.shape[1]

    def d_relu(dA, Z):
        dZ = np.array(dA, copy=True)
        dZ[Z <= 0] = 0
        return dZ

    def d_sigmoid(dA, Z):
        A = 1 / (1 + np.exp(-Z))
        dZ = (A - A * A) * dA
        return dZ

    dAl = dAL
    # Wl of shape (nl, n(l-1)), A(l-1) of shape (n(l-1), m), bl of shape (nl, 1), Zl of shape (nl, m)
    for l in range(L, 0, -1):
        linear_cache, activation_cache = caches[l]
        Al_prev, Wl, bl = linear_cache
        Zl = activation_cache
        dZl = d_relu(dAl, Zl) if l != L else d_sigmoid(dAl, Zl)

        gradients['dW' + str(l)] = np.dot(dZl, Al_prev.T) / m
        gradients['db' + str(l)] = np.sum(dZl, axis=1, keepdims=True) / m

        dAl = np.dot(Wl.T, dZl)

    return gradients


def update_parameters(parameters, gradients, learning_rate):
    """
    Update parameters using gradient descent.
    :param parameters: Python list containing all Ws and bs
    :param gradients: Python dictionary with the gradients
    :param learning_rate:
    :return: updated parameters
    """
    L = len(parameters) // 2

    for l in range(1, L + 1):
        parameters['W' + str(l)] = parameters['W' + str(l)] - gradients['dW' + str(l)] * learning_rate
        parameters['b' + str(l)] = parameters['b' + str(l)] - gradients['db' + str(l)] * learning_rate

    return parameters


def L_layer_model(X, Y, layer_dims, learning_rate=0.0075, num_iterations=3000, print_cost=False):
    """
    L layers neural network.
    :param X:
    :param Y:
    :param layer_dims:
    :param learning_rate:
    :param num_iterations:
    :param print_cost:
    :return: parameters -- parameters learnt by the model
    """
    costs = []
    parameters = initialize_parameters(layer_dims)

    for i in range(num_iterations):
        AL, caches = forward_propagation(X, parameters)
        cost = compute_cost(AL, Y)
        grads = backward_propagation(AL, Y, caches)
        parameters = update_parameters(parameters, grads, learning_rate=learning_rate)
        if print_cost and i % 100 == 0:
            print("Cost after iteration %i: %f" % (i, cost))
            costs.append(cost)

    # plot the cost
    plt.plot(np.squeeze(costs))
    plt.ylabel('cost')
    plt.xlabel('iterations (per tens)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

    return parameters


def score(X, Y, parameters):
    AL, caches = forward_propagation(X, parameters)
    AL_trans = np.array(AL > 0.5, dtype=int)
    return np.sum(AL_trans == Y) * 1.0 / Y.shape[1]


from dnn_app_utils_v2_2 import load_data

# train_x_orig of shape (209, 64, 64, 3), train_y of shape (1, 209)
train_x_orig, train_y, test_x_orig, test_y, classes = load_data()

# Reshape the training and test examples
train_x_flatten = train_x_orig.reshape(train_x_orig.shape[0],
                                       -1).T  # The "-1" makes reshape flatten the remaining dimensions
test_x_flatten = test_x_orig.reshape(test_x_orig.shape[0], -1).T

# Standardize data to have feature values between 0 and 1.
train_X = train_x_flatten / 255
test_X = test_x_flatten / 255

m_train = train_x_orig.shape[0]
num_px = train_x_orig.shape[1]
m_test = test_x_orig.shape[0]

train_x_flatten = train_x_orig.reshape(train_x_orig.shape[0],
                                       -1).T  # The "-1" makes reshape flatten the remaining dimensions
test_x_flatten = test_x_orig.reshape(test_x_orig.shape[0], -1).T

# Standardize data to have feature values between 0 and 1.
train_x = train_x_flatten / 255.
test_x = test_x_flatten / 255.

layers_dims = [12288, 20, 7, 5, 1]  # 5-layer model

params = L_layer_model(train_x, train_y, layers_dims, num_iterations=3000, print_cost=True)
print(score(train_x, train_y, params))
print(score(test_X, test_y, params))
深度学习