Convolutional Neural Networks (CNN)

Photo by Andrii Bondarenko on Unsplash
Photo by Andrii Bondarenko on Unsplash
Convolutional neural networks (CNN) is a computer vision and image processing method based on neural networks. In this article, we will introduce the principles of various layers in CNN.

Convolutional neural networks (CNN) is a computer vision and image processing method based on neural networks. In this article, we will introduce the principles of various layers in CNN.

The complete code for this chapter can be found in .

Convolutions

Convolution in image processing is a mathematical operation used to modify or analyze images. It combines two functions (image and kernel) to produce a third function. The function represents how the shape or characteristics of one function are modified by another function. Convolution is widely used in tasks such as edge detection, blurring, sharpening, noise reduction, and feature extraction in computer vision .

The mathematical definition is as follows. The symbol for convolution operation is \ast.

\displaystyle C(x,y)=(I\ast K)(x,y)=\sum_{a=-k}^{k}\sum_{b=-k}^{k}K(a,b)\cdot I(x-a,y-b) \\\\ I(x,y):\text{The intensity of the pixel at position }(x,y)\text{ in the image.} \\\\ K(a,b):\text{The kernel value at position }(a, b). \\\\ k:\text{The radius of the kernel (e.g., for a }3\times3\text{ kernel, }k=1\text{; for a }5\times5\text{ kernel, }k=2\text{).}

We use the example in the figure below to explain the convolution operation. The figure has a 5 x 5 input image, a 3 x 3 kernel, and a 3 x 3 output image. We imagine a sliding window the same size as the kernel over the input image. Sliding window starting from the upper left of the input image, the values ​​in the window are element-wise multiplied by the kernel and summed up. This will be the value at the upper left of the output image, as shown in the blue part in the figure. Then, move the sliding window one position to the right and do the same multiplication and summation. This will be the second value of the output image, as shown in green. Sliding window move one position to the right and do the same multiplication and sum. Next, since the sliding window cannot move to the right, it returns to the beginning and moves down one position. Repeat until the sliding window can no longer move to the right and down.

Convolution Operation.
Convolution Operation.

Kernel

Different kernels are used for different tasks. As shown in the figure below, the kernel is the Prewitt operator, which is used for edge detection.

Kernel.
Kernel.

The size of the output image can be calculated from the size of the input image and the size of the kernel.

\text{Input image}:w\times h \\\\ \text{Kernel}:f\times f \\\\ \text{Output image}:w-f+1 \times h-f+1

Padding

We can find that every time the convolution operation is executed, the output image will become smaller and smaller. We can first perform padding on the input image to make the input image larger, and then perform the convolution operation. When padding is one, a circle of 0s is added to the periphery of the input image, as follows.

Padding.
Padding.

So when we use padding, the size of the output image can be calculated as follows.

\text{Input image}:w\times h \\\\ \text{Kernel}:f\times f \\\\ \text{Padding}:p \\\\ \text{Output image}:w+2p-f+1\times h+2p-f+1

Stride

The number of cells that the sliding window moves each time is called stride. The following is how the sliding window moves when stride is equal to two.

Stride.
Stride.

So when we use stride, the size of the output image can be calculated as follows.

\text{Input image}:w\times h \\\\ \text{Kernel}:f\times f \\\\ \text{Padding}:p \\\\ \text{Stride}:s \\\\ \text{Output image}:\lfloor\frac{w+2p-f}{s}+1\rfloor \times \lfloor\frac{h+2p-f}{s}+1\rfloor

Convolutional Neural Networks

Now we have understood what convolutions are. However, convolutional neural networks (CNN) apply convolution to neural networks. A CNN has many layers just like a neural network, as shown in the figure below. There are three main layers in CNN, and we will introduce them one by one next.

A convolutional neural network.
A convolutional neural network.

If you are not familiar with neural networks yet, please refer to the following article first.

Convolution Layers

convolution layer contains several kernels. Here we will use filters to call kernels. The input data may be 3D data. After performing a convolution operation with a filter, a 2D data is output. However, a convolution layer can contain several filters, so after performing convolution operation with several filters, we will get several 2D data. Then, we stack these 2D data together, and finally the convolution layer will output a 3D data, as shown below.

Conv Layer.
Conv Layer.

In a convolution layer, training parameters are filters, corresponding to W and b in neural networks. In addition, in this convolution layer, we also need to set its hyper parameters, which have been introduced previously. Therefore, a convolution layer contains training parameters and hyper parameters, so we must be careful about the dimensionality of each part.

f^{[\ell]}:\text{filter size} \\\\ p^{[\ell]}:\text{padding} \\\\ s^{[\ell]}:\text{stride} \\\\ n_c^{[\ell]}:\text{number of filters} \\\\ \text{Each filter}:f^{[\ell]}\times f^{[\ell]}\times n_c^{[\ell-1]} \\\\ \text{Activations }a^{[\ell]}:n_h^{[\ell]}\times n_w^{[\ell]}\times n_c^{[\ell]} \\\\ \text{Weights }W^{[\ell]}:f^{[\ell]}\times f^{[\ell-1]}\times n_c^{[\ell-1]}\times n_c^{[\ell]} \\\\ \text{bias }b^{[\ell]}:1\times1\times1\times n_c^{[\ell]} \\\\ \text{Input }a^{[\ell-1]}:n_h^{[\ell-1]}\times n_w^{[\ell-1]}\times n_c^{[\ell-1]} \\\\ \text{Output }a^{[\ell]}:n_h^{[\ell]}\times n_w^{[\ell]}\times n_c^{[\ell]} \\\\ n_h^{[\ell]}=\lfloor\frac{n_h^{[\ell-1]}+2p^{[\ell]}-f^{[\ell]}}{s^{[\ell]}}+1\rfloor \\\\ n_w^{[\ell]}=\lfloor\frac{n_w^{[\ell-1]}+2p^{[\ell]}-f^{[\ell]}}{s^{[\ell]}}+1\rfloor

Pooling Layers

Compared with convolution layers, pooling layers are relatively simple. In a pooling layer, we need to set two hyper parameters and decide which pooling to use. In the picture below, the filter size is 2 and the stride is 2. We imagine there is a sliding window whose size is filter size 2. This pooling layer is a max pooling layer, so in the sliding window, the maximum value is output. Then, move the stride number to the right, that is, two spaces, and then output the maximum value in the sliding window.

Max pooling layer.
Max pooling layer.

Therefore, the pooling layer only has hyper parameters but no training parameters. In other words, there are no parameters to be trained in the pooling layers.

Fully Connected Layers (FC layers)

Fully connected layers are like traditional neural networks, as shown below. Therefore, there are training parameters W, b in fully connected layers.

Fully connected layer.
Fully connected layer.

The following formula may be easier to understand.

a^{[\ell-1]}:n_h^{[\ell-1]}\times1 \\\\ W^{[\ell]}:n_h^{[\ell]}\times n_h^{[\ell-1]} \\\\ b^{[\ell]}:n_n^{[\ell]}\times1 \\\\ a^{[\ell]}:n_h^{[\ell]}\times1 \\\\ a^{[\ell]}=\sigma(W^{[\ell]}\cdot a^{[\ell-1]}+b^{[\ell]})

Below is an example of a fully connected layer.

An example of fully connected layer.
An example of fully connected layer.

In CNN, before entering fully connected layers, multi-dimensional data will be converted into n x 1 data what we call flatten. Therefore, the size of parameters W and b in fully connected layers is very large. In a CNN, most of the training parameters will be in fully connected layers, while a small part will be in the filters of the convolution layer.

LeNet-5

LeNet-5 is a CNN architecture proposed by Yann LeCun in 1998. It is used to train the MNIST data set and has good results. Its architecture and the hyper parameters of each layer are as follows.

  • The input is a 32 x 32 x 1 grayscale image.
  • Conv layer: filter size is 5, and there are 6 filters, stride is 1, and padding is 0.
  • Pooling layer: filter size is 2, stride is 2.
  • Conv layer: filter size is 5, and there are 16 filters, stride is 1, and padding is 0.
  • Pooling layer: filter size is 2, stride is 2.
  • FC layer: input size is 400 (16 x 5 x 5), output size is 120.
  • FC layer: input size is 120, output size is 84.
  • FC layer: input size is 84, output size is 10.
LeNet-5 (source from GradientBased Learning Applied to Document
Recognition).
LeNet-5 (source from GradientBased Learning Applied to Document Recognition).

Next, we will use PyTorch to implement LeNet-5. First, based on the above hyper parameters, we establish each layer. For pooling layers, we choose max pooling. After each conv layer, we normalize the output. We choose ReLU as activation function. Before entering the FC layer, we first flatten the output of the previous layer. The code implemented using PyTorch is quite streamlined.

class LeNet5(nn.Module):
    def __init__(self):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(num_features=6),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0),
            nn.BatchNorm2d(num_features=16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(in_features=16 * 5 * 5, out_features=120),
            nn.ReLU(),
            nn.Linear(in_features=120, out_features=84),
            nn.ReLU(),
            nn.Linear(in_features=84, out_features=10),
        )

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.fc(x)
        return x

PyTorch allows us to print the model’s architecture.

model = LeNet5()
model

# Output
LeNet5(
  (conv1): Sequential(
    (0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
    (1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (conv2): Sequential(
    (0): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
    (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=400, out_features=120, bias=True)
    (2): ReLU()
    (3): Linear(in_features=120, out_features=84, bias=True)
    (4): ReLU()
    (5): Linear(in_features=84, out_features=10, bias=True)
  )
)

Next, we will use the NMIST dataset to train our LeNet-5. PyTorch provides built-in functions to load the NMIST dataset. Since the input received by LeNet-5 is a 32 x 32 grayscale image, we create a transform to convert the loaded NMIST image into a 32 x 32 size. The NMIST dataset itself is a grayscale image. Then, normalize the loaded NMIST image. Here, you can use mean=0.5 and std=0.5 to normalize the image. However, the mean=0.1307 and std=0.3081 used here are calculated based on the NMIST dataset, so using these values ​​to normalize NMIST images will give better results.

transform = transforms.Compose(
    [
        transforms.Resize((32, 32)),
        transforms.ToTensor(),
        transforms.Normalize(mean=(0.1307,), std=(0.3081,)),
    ]
)
train_data = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST(root='./data', train=False, download=True, transform=transform)
print(f'Train data: {len(train_data)}')
print(f'Test data: {len(test_data)}')
print(f'Image shape: {train_data[0][0].shape}')
print(f'Classes: {train_data.classes}')

train_loader = DataLoader(dataset=train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(dataset=test_data, batch_size=64, shuffle=False)

# Output
Train data: 60000
Test data: 10000
Image shape: torch.Size([1, 32, 32])
Classes: ['0 - zero', '1 - one', '2 - two', '3 - three', '4 - four', '5 - five', '6 - six', '7 - seven', '8 - eight', '9 - nine']

After preparing the dataset, we need to train our model. We use cross entropy loss as the cost function and set the learning rate to 0.001. When loading the dataset, we divide the dataset into a batch of 64 images. During training, we pass a batch into the forward propagation of the model, and then execute the back propagation of the model. After training all batches, train the entire dataset 10 times.

cost = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

model.train()
epochs = 10
for epoch in range(epochs):
    running_loss = 0.0

    for batch, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.to(device), targets.to(device)

        # Forward pass
        outputs = model(inputs)
        loss = cost(outputs, targets)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    print(f"Epoch [{epoch + 1}/{epochs}], Loss: {running_loss / len(train_loader):.4f}")

# Output
Epoch [1/10], Loss: 0.1646
Epoch [2/10], Loss: 0.0545
Epoch [3/10], Loss: 0.0421
Epoch [4/10], Loss: 0.0341
Epoch [5/10], Loss: 0.0287
Epoch [6/10], Loss: 0.0251
Epoch [7/10], Loss: 0.0230
Epoch [8/10], Loss: 0.0186
Epoch [9/10], Loss: 0.0170
Epoch [10/10], Loss: 0.0157

Now the model has been trained. We use the test dataset to test the accuracy of the model.

model.eval()
correct = 0
total = 0

with torch.no_grad():
    for inputs, targets in test_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        total += targets.size(0)
        correct += (predicted == targets).sum().item()

accuracy = correct / total * 100
print(f"Test Accuracy: {accuracy:.2f}%")

# Output
Test Accuracy: 98.97%

PyTorch allows us to export models into files.

torch.save(model, 'lenet_mnist.pt')

Finally, we can load the model file and use it to predict other images.

_model = torch.load('lenet_mnist.pt', weights_only=False)
with torch.no_grad():
    for inputs, targets in test_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        outputs = _model(inputs)
        _, predicted = torch.max(outputs, 1)
        total += targets.size(0)
        correct += (predicted == targets).sum().item()

accuracy = correct / total * 100
print(f"Test Accuracy: {accuracy:.2f}%")

# Output
Test Accuracy: 98.97%

Conclusion

This article briefly introduces CNN and its various layers. In addition, we also implemented a simple LeNet-5. I believe you already have some conceptual understanding of CNN. CNN has been used in fields such as image recognition and video analysis, and has achieved quite good results compared with traditional methods.

Reference

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like