Convolutional networks, also known as convilutional neural networks or CNNs, are a sepcialized kind of neural network for processing data that has a known, grid-like topology. Examples include time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image data, which can be thought of as a 2D grid of pixels.

卷积

In its most general form, convolution is an operation on two functions of a real-valued argument. Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output x(t), the position of the spaceship as time t. Both x and t are real-valued, i.e., we can get a different reading from the laser sensor at any instant in time.

Now suppose than out laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship’s position, we would like to average together several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function w(a), where a is the age of a measurement. If we apply such a weighted average operation at every moment, we obtain a new function s providing a smoothed estimate of the position of the spaceship:

$$s(t) = \int{x(a)w(t-a)}da$$

This operation is called convolution. The convolution operation is typically denoted with a asterisk:

$$s(t)=(x*w)(t)$$

In our example, w needs to be a valid probability density function, or the output is not a weighted average. Also, w need to be 0 for all negative arguments, or it will look into the future. In general, convolution is defined for any functions for which the above integral is defined, and may be used for other purposes besides taking weighted averages.

In convolutional network terminology, the first argument to the convolution is often referred to as the input and the second argument as the kernel. The output is sometimes referred to as the feature map

In our example, it might be more realistic to assume that our laser provides a measurement once per second. The time index t can than take on only integer values. If we now assume that x and w are defined only on integer t, we can define the discrete convolution:

$$s(t)=(x*w)(t)=\sum_{a=-\infty}^{\infty}x(a)w(t-a)$$

We usually assume that these functions are zero everywhere but the finite set of points for which we store the values. This meaus that in practice we can implement the infinite summation as a summation over a finite number of array elements.

Finally, we often use convolutions over more than one axis at a time. For example, if we use a two-dimensional image I ass our input, we probably also want to use a two-dimensional kernel K:

$$S(i,j)=(I*K)(i,j) = \sum_m\sum_nI(m,n)K(i-m,j-n)$$

Convolution is commutative, meaning we can equivalently write:

$$S(i,j)=(I*K)(i,j) = \sum_m\sum_nI(i-m,j-n)K(m,n)$$

Usually the latter formula is moew straighforward to implement in a machine learing library, because there is less ariation in the range of valid values of m and n. (注: 如果K的取值范围比较小, 那么在第二种方式下的m和n则始终是在一个固定的返回取值, 而第一种方式下m和n的取值范围受到i和j的影响, 因此从编程实现上来说, 第二种方式更容易实现.)

The commutative property of convolution arises beacuse we have flipped the kernel relative to the input, in the sense that as m increases, the index into the input increases, but the index into the kernel decreases. The only reson to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation. Instead, many neural network libraries implement a related function called the cross-correlation, which is the same as convolution but without filpped the kernel:

$$S(i,j)=(I*K)(i,j) = \sum_m\sum_nI(i+m,j+n)K(m,n)$$

优化

Convolution leverages three important ideas that can help improve a machine learning system: sparse interations, paramteter sharing and equivariant representations.

Convolutional networks typically have sparse interactions. This is accomplished by making the kernel smaller than the input. For example, when processing an image, the input image might have thousands or millions of pixels, but we can detect small, meaningful features such as edges with kernels that occupy only ten or hundreds of pixels. This means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency..

Paramteter sharing regers to using the same parameter for more than one function in a model. In a convolutional neural net, each member of the kernel is used at every positon of the input. (注:以上两个特性都是卷积核逐一扫描输入面的直接结果)

In the case of convolution, the particular form of parameter sharing causes the layer to have a property called equivariance to translation. To say a function is equivariant means that if the input changes, the output changes in the same way.

池化

A pooling function repalces the output of the net at a certain location with a summary statistic of the nearby outputs. For example, the max pooling opearation reports the maximum output within a rectangular neighborhood. Other popular pooling function include the average of a recangular neighbothood, the \( L^2 \) norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel.

In all cases, pooling helps to make the representation become approximately invariant to small translations of th input. Invariance to translation means that if we translate the input by a small amount, the values of most of the pooled outputs do not change. Invariance to local translation can be a very useful property if we care more about whether some feature is present than exactly where it is.

基本卷积函数

CNN的基本结构由输入层, 卷积层, 池化层, 全连接层和输出层组成. 卷积层和池化层通常可以交替的出现多次.

卷积层由多个特征面组成, 每个特征面由多个神经元组成. 每个神经元通过卷积核与上个特征面的局部连接. 卷积核是一个权值矩阵

最后更新: 2024年04月24日 15:50

版权声明:本文为原创文章,转载请注明出处

原始链接: https://lizec.top/2019/02/07/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E4%B9%8B%E5%8D%B7%E7%A7%AF%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C/