Computer vision is one of hottest AI topics these days. Be it self-driving cars, keyless and cashless hotel, detecting cardiac arrest, monitoring traffics and more other things.

The area of computer science that deals with image classification, object detection, video processing is called Computer Vision. For image classification and detection purpose we use Convolutional Neural Network [ CNN]. We could also use ANN [Artificial Neural Network] for the image direction but there are some cons of that -

Disadvantages of using ANN for image classification -

- Too much computation

- Treats local pixels same as pixels far apart

- Sensitive to location of an object in an image.

Different tasks of Computer vision -

- Object detection: Identifying objects of interest (cats, dogs, cars) in digital images or even videos

- Optical character recognition: Translating images of text that are written or typed into a machine-encoded format

- Fingerprint recognition: Using pattern information of the human fingerprint to make a comparison between a fingerprint source and a fingerprint that is a target.

Image classification vs Object Detection

For image classification we try to classify the entire image in one of the classes.

Now for Image classification with localization is where we not only classify the image but also locate the object in that image.

Then we have object detection. Let's say in an image we have a dog and a cat, so we try to detect the cat and the dog separately.

Now in Image Segmentation we try to detect each pixel if it belongs to a certain object.

Convolutional Neural Network

A ConvNet usually has 3 types of layers:

1. Convolutional Layer (CONV)

2. Pooling Layer (POOL)

3. Dense Layers (Dence)

1: Convolutional Layer:

Convolutional Layer is the first layer in a CNN. It gets as input a matrix of the dimensions of the image.

Next, we have kernels (filters). A kernel is a matrix with the dimensions*. * For each convolutional layer, there are multiple kernels stacked on top of each other, this is what forms the 3-dimensional matrix, where matrix is the number of kernels.

For each kernel, we have its respective bias, which is a scalar quantity.

And then, we have an output for this layer, which also has dimensions*. *

For each position of the kernel on the image, each number on the kernel gets multiplied with the corresponding number on the input matrix and then they all are summed up for the value in the corresponding position in the output matrix.

With the same thing occurs for each of the channels and then they are added up together and then summed up with the bias of the respective filter and this forms the value in corresponding position of the output matrix.

2: Pooling Layer:

There are two types of pooling:

1) Max Pooling

2) Average Pooling

The main purpose of a pooling layer is to reduce the number of parameters of the input tensor and thus- Helps reduce overfitting- Extract representative features from the input tensor- Reduces computation and thus aids efficiency.

In case of Max Pooling, a kernel of size `n*n` *(2x2) * is moved across the matrix and for each position the **max value is taken** and put in the corresponding position of the output matrix.

In case of Average Pooling*, * a kernel of size `n*n` is moved across the matrix and for each position the average is taken of all the values and put in the corresponding position of the output matrix.

3: Dense Layer or Fully Connected Layer:

Fully Connected Layer is simply, forward pass deep layers. **Fully Connected Layers form the last few layers in the network.

The input to the fully connected layer is the output from the final Pooling or Convolutional Layer, which is flattened and then fed into the fully connected layer.

Example of a classification Model [cats vs dogs]:

Import Libraries

1. NumPy- For working with arrays, linear algebra.

2. Pandas – For reading/writing data

3. Matplotlib – to display images

4. TensorFlow Keras models – Need a model to predict right!!

5. TensorFlow Keras layers – Every Neural Network need layer and CNN needs well a couple of layers.

CNN does the processing of Images with the help of matrixes of weights known as filters. They detect low-level features like vertical and horizontal edges etc. Through each layer, the filters recognize high-level features.

For compiling the CNN, we are using Adam optimizer.

**Adaptive Moment Estimation (Adam)** is a method used for computing individual learning rates for each parameter. For loss function, we are using Binary cross-entropy to compare the class output to each of the predicted probabilities. Then it calculates the penalization score based on the total distance from the expected value.

- **Image Augmentation [optional]: **

Image augmentation is a method of applying different kinds of transformation to original images resulting in multiple transformed copies of the same image. The images are different from each other in certain aspects because of shifting, rotating, flipping techniques.

- **Activation Function: **

The activation function is added to help ANN learn complex patterns in the data. The main need for activation function is to add non-linearity into the neural network.

## Comments