Convolutional neural networks (CNN) is a computer vision and image processing method based on neural networks. In this article, we will introduce the principles of various layers in CNN.
The complete code for this chapter can be found in .
Table of Contents
Convolutions
Convolution in image processing is a mathematical operation used to modify or analyze images. It combines two functions (image and kernel) to produce a third function. The function represents how the shape or characteristics of one function are modified by another function. Convolution is widely used in tasks such as edge detection, blurring, sharpening, noise reduction, and feature extraction in computer vision .
The mathematical definition is as follows. The symbol for convolution operation is .
We use the example in the figure below to explain the convolution operation. The figure has a 5 x 5 input image, a 3 x 3 kernel, and a 3 x 3 output image. We imagine a sliding window the same size as the kernel over the input image. Sliding window starting from the upper left of the input image, the values in the window are element-wise multiplied by the kernel and summed up. This will be the value at the upper left of the output image, as shown in the blue part in the figure. Then, move the sliding window one position to the right and do the same multiplication and summation. This will be the second value of the output image, as shown in green. Sliding window move one position to the right and do the same multiplication and sum. Next, since the sliding window cannot move to the right, it returns to the beginning and moves down one position. Repeat until the sliding window can no longer move to the right and down.
Kernel
Different kernels are used for different tasks. As shown in the figure below, the kernel is the Prewitt operator, which is used for edge detection.
The size of the output image can be calculated from the size of the input image and the size of the kernel.
Padding
We can find that every time the convolution operation is executed, the output image will become smaller and smaller. We can first perform padding on the input image to make the input image larger, and then perform the convolution operation. When padding is one, a circle of 0s is added to the periphery of the input image, as follows.
So when we use padding, the size of the output image can be calculated as follows.
Stride
The number of cells that the sliding window moves each time is called stride. The following is how the sliding window moves when stride is equal to two.
So when we use stride, the size of the output image can be calculated as follows.
Convolutional Neural Networks
Now we have understood what convolutions are. However, convolutional neural networks (CNN) apply convolution to neural networks. A CNN has many layers just like a neural network, as shown in the figure below. There are three main layers in CNN, and we will introduce them one by one next.
If you are not familiar with neural networks yet, please refer to the following article first.
Convolution Layers
A convolution layer contains several kernels. Here we will use filters to call kernels. The input data may be 3D data. After performing a convolution operation with a filter, a 2D data is output. However, a convolution layer can contain several filters, so after performing convolution operation with several filters, we will get several 2D data. Then, we stack these 2D data together, and finally the convolution layer will output a 3D data, as shown below.
In a convolution layer, training parameters are filters, corresponding to W
and b
in neural networks. In addition, in this convolution layer, we also need to set its hyper parameters, which have been introduced previously. Therefore, a convolution layer contains training parameters and hyper parameters, so we must be careful about the dimensionality of each part.
Pooling Layers
Compared with convolution layers, pooling layers are relatively simple. In a pooling layer, we need to set two hyper parameters and decide which pooling to use. In the picture below, the filter size is 2 and the stride is 2. We imagine there is a sliding window whose size is filter size 2. This pooling layer is a max pooling layer, so in the sliding window, the maximum value is output. Then, move the stride number to the right, that is, two spaces, and then output the maximum value in the sliding window.
Therefore, the pooling layer only has hyper parameters but no training parameters. In other words, there are no parameters to be trained in the pooling layers.
Fully Connected Layers (FC layers)
Fully connected layers are like traditional neural networks, as shown below. Therefore, there are training parameters W
, b
in fully connected layers.
The following formula may be easier to understand.
Below is an example of a fully connected layer.
In CNN, before entering fully connected layers, multi-dimensional data will be converted into n x 1 data what we call flatten. Therefore, the size of parameters W
and b
in fully connected layers is very large. In a CNN, most of the training parameters will be in fully connected layers, while a small part will be in the filters of the convolution layer.
LeNet-5
LeNet-5 is a CNN architecture proposed by Yann LeCun in 1998. It is used to train the MNIST data set and has good results. Its architecture and the hyper parameters of each layer are as follows.
- The input is a 32 x 32 x 1 grayscale image.
- Conv layer: filter size is 5, and there are 6 filters, stride is 1, and padding is 0.
- Pooling layer: filter size is 2, stride is 2.
- Conv layer: filter size is 5, and there are 16 filters, stride is 1, and padding is 0.
- Pooling layer: filter size is 2, stride is 2.
- FC layer: input size is 400 (16 x 5 x 5), output size is 120.
- FC layer: input size is 120, output size is 84.
- FC layer: input size is 84, output size is 10.
Next, we will use PyTorch to implement LeNet-5. First, based on the above hyper parameters, we establish each layer. For pooling layers, we choose max pooling. After each conv layer, we normalize the output. We choose ReLU as activation function. Before entering the FC layer, we first flatten the output of the previous layer. The code implemented using PyTorch is quite streamlined.
class LeNet5(nn.Module): def __init__(self): super(LeNet5, self).__init__() self.conv1 = nn.Sequential( nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=0), nn.BatchNorm2d(num_features=6), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2), ) self.conv2 = nn.Sequential( nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0), nn.BatchNorm2d(num_features=16), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2), ) self.fc = nn.Sequential( nn.Flatten(), nn.Linear(in_features=16 * 5 * 5, out_features=120), nn.ReLU(), nn.Linear(in_features=120, out_features=84), nn.ReLU(), nn.Linear(in_features=84, out_features=10), ) def forward(self, x): x = self.conv1(x) x = self.conv2(x) x = self.fc(x) return x
PyTorch allows us to print the model’s architecture.
model = LeNet5() model # Output LeNet5( (conv1): Sequential( (0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1)) (1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (conv2): Sequential( (0): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1)) (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (fc): Sequential( (0): Flatten(start_dim=1, end_dim=-1) (1): Linear(in_features=400, out_features=120, bias=True) (2): ReLU() (3): Linear(in_features=120, out_features=84, bias=True) (4): ReLU() (5): Linear(in_features=84, out_features=10, bias=True) ) )
Next, we will use the NMIST dataset to train our LeNet-5. PyTorch provides built-in functions to load the NMIST dataset. Since the input received by LeNet-5 is a 32 x 32 grayscale image, we create a transform to convert the loaded NMIST image into a 32 x 32 size. The NMIST dataset itself is a grayscale image. Then, normalize the loaded NMIST image. Here, you can use mean=0.5 and std=0.5 to normalize the image. However, the mean=0.1307 and std=0.3081 used here are calculated based on the NMIST dataset, so using these values to normalize NMIST images will give better results.
transform = transforms.Compose( [ transforms.Resize((32, 32)), transforms.ToTensor(), transforms.Normalize(mean=(0.1307,), std=(0.3081,)), ] ) train_data = datasets.MNIST(root='./data', train=True, download=True, transform=transform) test_data = datasets.MNIST(root='./data', train=False, download=True, transform=transform) print(f'Train data: {len(train_data)}') print(f'Test data: {len(test_data)}') print(f'Image shape: {train_data[0][0].shape}') print(f'Classes: {train_data.classes}') train_loader = DataLoader(dataset=train_data, batch_size=64, shuffle=True) test_loader = DataLoader(dataset=test_data, batch_size=64, shuffle=False) # Output Train data: 60000 Test data: 10000 Image shape: torch.Size([1, 32, 32]) Classes: ['0 - zero', '1 - one', '2 - two', '3 - three', '4 - four', '5 - five', '6 - six', '7 - seven', '8 - eight', '9 - nine']
After preparing the dataset, we need to train our model. We use cross entropy loss as the cost function and set the learning rate to 0.001. When loading the dataset, we divide the dataset into a batch of 64 images. During training, we pass a batch into the forward propagation of the model, and then execute the back propagation of the model. After training all batches, train the entire dataset 10 times.
cost = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model.to(device) model.train() epochs = 10 for epoch in range(epochs): running_loss = 0.0 for batch, (inputs, targets) in enumerate(train_loader): inputs, targets = inputs.to(device), targets.to(device) # Forward pass outputs = model(inputs) loss = cost(outputs, targets) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() running_loss += loss.item() print(f"Epoch [{epoch + 1}/{epochs}], Loss: {running_loss / len(train_loader):.4f}") # Output Epoch [1/10], Loss: 0.1646 Epoch [2/10], Loss: 0.0545 Epoch [3/10], Loss: 0.0421 Epoch [4/10], Loss: 0.0341 Epoch [5/10], Loss: 0.0287 Epoch [6/10], Loss: 0.0251 Epoch [7/10], Loss: 0.0230 Epoch [8/10], Loss: 0.0186 Epoch [9/10], Loss: 0.0170 Epoch [10/10], Loss: 0.0157
Now the model has been trained. We use the test dataset to test the accuracy of the model.
model.eval() correct = 0 total = 0 with torch.no_grad(): for inputs, targets in test_loader: inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs) _, predicted = torch.max(outputs, 1) total += targets.size(0) correct += (predicted == targets).sum().item() accuracy = correct / total * 100 print(f"Test Accuracy: {accuracy:.2f}%") # Output Test Accuracy: 98.97%
PyTorch allows us to export models into files.
torch.save(model, 'lenet_mnist.pt')
Finally, we can load the model file and use it to predict other images.
_model = torch.load('lenet_mnist.pt', weights_only=False) with torch.no_grad(): for inputs, targets in test_loader: inputs, targets = inputs.to(device), targets.to(device) outputs = _model(inputs) _, predicted = torch.max(outputs, 1) total += targets.size(0) correct += (predicted == targets).sum().item() accuracy = correct / total * 100 print(f"Test Accuracy: {accuracy:.2f}%") # Output Test Accuracy: 98.97%
Conclusion
This article briefly introduces CNN and its various layers. In addition, we also implemented a simple LeNet-5. I believe you already have some conceptual understanding of CNN. CNN has been used in fields such as image recognition and video analysis, and has achieved quite good results compared with traditional methods.
Reference
- Andrew Ng, Deep Learning Specialization, Coursera.
- Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.