top of page

Convolutional Neural Network [W - 3]


Convolutional Neural Networks (CNNs) was implemented from the study of the brain’s visual cortex, and they have been used in image recognition since the 1980s.In the last few years as we have gained more computational power and collected training data CNN models have managed to achieve a great performance on some visual tasks. CNNs are not restricted to visual perception, they are also successful at many other tasks. Now for the visualization part CNN is used and achieved great success. There are different aspects of computer vision like- image classification, object detection. We use convolutional neural networks to train image classification models and there are some other pre-trained models also, like the YOLO algorithm.

I Convolutional Neural Network I

Artificial Neural networks were inspired and invented by how human brains are working. As neurons in our brain are connected to each other and transferring data the neural networks are also working like that. For each neuron in the network there are different tasks assigned to it.

In convolutional neural networks we also have neurons in it, these neurons are in different sets and have different tasks assigned to it. For each feature in the image there are some neurons for the feature detection. Let’s say we have an image of a cat.

In the training process some neurons will detect the eyes, some will detect the nose and so on. These all are called features of the images. These neurons will pass the result to another set of neurons which will look into the bigger perspective, like some set of neurons will detect the head and send the results back. Some other set may detect the body and so on. Then some set of neurons will decide if it is a cat or not based on all previous results and pass the value to the dense layer.

There are different layers in a convolutional neural network, now we will discuss all the layers one by one.

Convolutional Layer:

The first layer of all is Convolutional Layer, this layer takes the training dataset and other parameters we will discuss later. In this layer the low level features are detected from the images. Then the next layer of neurons or hidden layers will get the score and work on higher-level features. Structuring the layers and its right parameters will result in the model accuracy. This is trial and error, we have to check which layer formation is giving us the best accuracy and less loss.

For hand-written digits we have all 0-9 numbers in hand-written image format. The task is to classify each image’s number. Now for each image the set of neurons will try to find features to detect and create a feature map from it.

Here the first set of neurons are detecting the edges of the image of 9. Then the next sets are getting the result and detecting some loopy circle pattern from it. With a loopy circle pattern, an almost straight line and a small curved line we can detect the image as number 9.

(All the values in the matrix would be in RGB numbers actually, here using 1 and -1)

Here for image 9 we have three filters. The first one is detecting the loopy pattern, the second one is detecting the vertical line, the third one is detecting the Diagonal Line. So first we take our original image and apply a convolution operation or a filter operation. These filters then can detect all the patterns from an image.

Convolution Operation or Filter operation:

In this convolution operation we take a 3 x 3 or 4x4 grid [ depending on the image dataset] and multiply individual numbers with the filter. In the end we get the average from it and store it in a feature map. By doing the convolution operation we are creating this feature map. After the operation we will get a feature map for the loopy pattern filter we will get a complete feature map. With this feature map the model can get the loopy pattern. Now for “6” the loopy pattern would be in the bottom of the feature map, as we have the loopy pattern at the bottom of 6. If we have two loopy patterns the filter will be on the top and the bottom of the feature map. This is how the filters detects the features and detects each element of the image. Now after getting all the feature maps for each of the features this stacks like layers and gets passed to the next set of neurons which is working for the bigger features.

After that the feature maps are flattened into a 1D array and this creates the fully connected neural network. This convolution operation part is called the feature extraction part and the second portion where the dense layers are used is called classification part.

Pooling Layers:

The feature map output from the convolutional layer has a problem. The feature maps are location-dependent. That means that during training, convolutional neural networks learn to associate the presence of a certain feature with a specific location in the input image. This can severely depress performance. Instead, we want the feature map and the network to be okay with the different location of the features.

The basic procedure of pooling is very similar to the convolution operation. You select a filter and slide it over the output feature map of the preceding convolutional layer. The most commonly used filter size is 2×2 and it is slid over the input using a stride of 2. There are basically two types of pooling layers we use in CNN:

Here we take the largest value from each of the feature maps and create a smaller and optimized feature map. This example is a 2 by 2 filter with stride =2. Stride means the movement of the pixel.

Average Pooling

For average pooling as the name suggests it takes the average of each feature map and creates another feature map. This example is a 2 by 2 filter.

Benefit of Pooling Layers:

Reduces dimensions and computation

Reduces overfitting

Model is tolerant towards variations and distortions.

Data Augmentation:

Convolutional layers don't take care of the rotation or zoomed images. If we just rotate the image 9 it won’t handle the situation there. So there is this concept called Data Augmentation, where it modifies the images and creates more training images out of it. This process is also used to solve imbalance in the dataset. There are several operations we can perform like - rotation, contrast, brightness, zooming and others. This data augmentation can increase the accuracy of the model. But the training computation may change due to the extra data.

he first one is the original image and after applying data augmentation it returned these two images from it. Translation layer is used in this augmentation process, which moves the image along the X or Y direction.

This is how we can implement the augmentation.

data_augmentation= keras.Sequential([ layers.experimental.preprocessing.RandomZoom(0.3),])data_augmentation= keras.Sequential([ layers.experimental.preprocessing.RandomContrast(0.9),])data_augmentation= keras.Sequential([ layers.experimental.preprocessing.RandomRotation(0.2),])

Convolutional Neural Network Implementation:

def createModel():
model= Sequential()
model.add(Convolution2D(24,(5,5),(2,2), input_shape=(66,200,3), activation='elu'))
model.add(Convolution2D(36,(5,5),(2,2),  activation='elu'))
model.add(Convolution2D(48,(5,5),(2,2),  activation='elu'))
model.add(Convolution2D(64,(3,3),  activation='elu'))
model.add(Convolution2D(64,(3,3),  activation='elu'))

return model

Computer Vision Use Cases:

We use computer vision with our daily drivers but we don’t even know about that. CNN models have a way more impact in our life-style than we can imagine.

**Self-Driving Cars:**Self-driving cars have eyes which are basically computer vision. To make the whole self-driving car work we constantly need to get the images from surroundings and get videos from outside. The different angles captured with the cameras are then classified and detected with CNN models to drive the car safely. We have different models to detect the traffic lights, pedestrians, roadblocks and other things which are trained in huge datasets. Even the human verification task is also used to do some of the training for the CNN models.

**Virtual Reality and Augmented Reality:**Computer vision plays a significant role in augmented and mixed reality. In this technology things are created virtually which do not live at that moment but we can see that. These AR technologies are used in many parts these days. Even we have virtual assistance now who can help you out with issues. For detecting the user’s actions and activity or detecting the place and other things CNN is used.

Health Care with CNN:

We have huge image datasets for different diseases, using CNN models we got tremendous results with it. We can predict cancer using skin disease images, corona virus and Pneumonia with

Chest X-ray images. Even eye diseases are also detected and classified with CNN models and

researchers are also working on this particular field to make it more efficient.


In this research paper on Computer vision, the Convolutional Neural Network implementation is discussed. The layers of cnn and their working procedure is also described. We have pre-trained models like the YOLO algorithm, VGGNet, ResNet and others. This model is pre-trained on thousands of datasets so we don’t have to train it separately.


[1] Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow [O’REILLY] , Geron Aurelien

[2] Deep Learning from Scratch [O’REILLY], Seth Weidman

[3] Neural Networks from scratch in Python, Harrison Kinsley & Daniel Kukiela

[4] Deep Learning with Tensorflow 2.0, Keras and Python, codebasics [YouTube]

22 views0 comments

Recent Posts

See All


bottom of page