How to identify dog breeds using CNN?

6 min readFeb 16, 2021

Technology allows us to recreate typical human activities, from turning the lights off our bedroom to recognizing images like a proper human being. What about teaching a computer how to identify dogs and humans? Wouldn’t it be awesome to teach a computer how to identify a poodle?

Photo illustration by Slate. Photo by Holly Allen.

Convolutional Neural Networks (CNNs) are Artificial Intelligence algorithms based on multi-layer neural networks that learns relevant features from images, being capable of performing several tasks like object classification, detection, and segmentation¹

In this current project, we used Convolutional Neural Network to identify dog breeds based on the image provided by the input.

Dataset information

Dog Dataset

The dog dataset is composed of:

133 total dog categories and 8351 total dog images.
6680 training dog images and 836 test dog images.
835 validation dog images

To load the files, we used load_dataset function to populate a few variables through the use of the load_files function from the scikit-learn library.

def load_dataset(path):data = load_files(path)
    dog_files = np.array(data['filenames'])
    dog_targets = np_utils.to_categorical(np.array(data['target']), 133)
    return dog_files, dog_targets

Human Dataset

The human dataset is composed of 13233 total human images. In the code cell below, we import a dataset of human images, where the file paths are stored in the numpy array human_files.

# load filenames in shuffled human dataset
human_files = np.array(glob("../../../data/lfw/*/*"))
random.shuffle(human_files)

Dog detector

The dog prediction is based on two functions: ResNet50_predict_labels and dog_detector.

Both functions use Resnet50, which is a convolutional neural network that is 50 layers deep. The first line of code downloads the ResNet-50 model, along with weights that have been trained on ImageNet.

from keras.applications.resnet50 import preprocess_input, decode_predictionsdef ResNet50_predict_labels(img_path):
    # returns prediction vector for image located at img_path
    img = preprocess_input(path_to_tensor(img_path))
    return np.argmax(ResNet50_model.predict(img))

The categories corresponding to dogs appear in an uninterrupted sequence and correspond to dictionary keys 151–268. Thus, in order to check to see if an image is predicted to contain a dog by the pre-trained ResNet-50 model, we need only check if the ResNet50_predict_labels function above returns a value between 151 and 268 . We use these ideas to complete thedog_detector function below, which returns True if a dog is detected in an image (and False if not).

def dog_detector(img_path):
    prediction = ResNet50_predict_labels(img_path)
    return ((prediction <= 268) & (prediction >= 151))

According to the code below, the percentage of the images in human_files_short have a detected dog is 0%, which is a great measurement of our model.

Face detector

We use OpenCV’s implementation of Haar feature-based cascade classifiers to detect human faces in images. Before using any of the face detectors, it is standard procedure to convert the images to grayscale. The detectMultiScale function executes the classifier stored in face_cascade and takes the grayscale image as a parameter.

The face_detector function takes a string-valued file path to an image as input and returns True if a human face is detected in an image and False otherwise.

def face_detector(img_path):
    img = cv2.imread(img_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray)
    return len(faces) > 0

The output of the code above is gonna be a face image like this one:

While testing the human face detector, 100% of human faces were correctly detected as human faces. Unfortunately 11% of dog faces were also detected as human faces.

After evaluating the results, we got the conclusion that this algorithmic choice necessitates that we communicate to the user that we accept human images only when they provide a clear view of a face. Another efficient approach would be investing in training the CNN model with lots of different types images, then the app would be able to identify more types of images.

Breed prediction

The breed prediction was made using 3 types of models:

1- CNN

2- VGG16

3- Inception V3

1. CNN model

Initially, we trained a simple CNN model with the following architecture:

from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.models import Sequentialmodel = Sequential()#1) Define architeturemodel.add(Conv2D(input_shape = (224, 224, 3), filters = 16, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size= 2))
model.add(Conv2D(filters = 32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size= 2))
model.add(Conv2D(filters = 64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size= 2))
model.add(GlobalAveragePooling2D())
model.add(Dense(133, activation="softmax"))model.summary()

As you can see, the first layer identifies lower level features such as edges or lines. It has 16 filters and kernel size of 2. The activation function used by its neurons is relu, which is added to increase the accuracy of the model which is our primary aim. The images are all sized 224 x 224 with 3 channels, so the input_shape is (224, 224, 3).

After a pooling layer, the second convolution layer with 32 filters identifies more complex features — shapes.
After another pooling layer, the third convolutional layer with 64 filters identifies high level features.
The MaxPooling layer after each converlutional layer reduces the size of the representation by 50% for height and width.
The GlableAveragePooling layer changes the size of height and width to one, then feed it to the last dense layer.
The Dense layer with 133 nodes and a softmax function classifies the image into one of the 133 dog breeds.
The metrics used to improve the model is ‘accuracy’ — measure of how accuracte the predictions of model are to actural prediction. This is an important metrics

Train it for 10 epochs and it gives an accuracy of 3.3493%. It would have given a better accuracy if we trained it for more epochs.

2. VGG16 model

To reduce training time without sacrificing accuracy, we used a pretrained VGG16 model, globally pooled the output and passed it through the last dense layer to predict the class.

The model uses the the pre-trained VGG-16 model as a fixed feature extractor, where the last convolutional output of VGG-16 is fed as input to our model. We only add a global average pooling layer and a fully connected layer, where the latter contains one node for each dog category and is equipped with a softmax.

VGG16_model = Sequential()
VGG16_model.add(GlobalAveragePooling2D(input_shape=train_VGG16.shape[1:]))
VGG16_model.add(Dense(133, activation='softmax'))

Train on 6680 samples, validate on 835 samples.Trained for 20 epochs to get 39.1148% accuracy on the test set.

3. Inception V3 model

In the third approach, I used Inception V3 model coupled with to fully connected layers. I pooled the output of Inception V3 and added a dropout between the dense layers to prevent over fitting.

inceptionv3_model = Sequential()
inceptionv3_model.add(GlobalAveragePooling2D(input_shape=train_inceptionv3.shape[1:]))
inceptionv3_model.add(Dense(256, activation='relu'))
inceptionv3_model.add(Dropout(0.2))
inceptionv3_model.add(Dense(133, activation='softmax'))

This model scored an acceptable accuracy of 83.4928% and worked perfectly with our algorithm. Here are a some results on a few test images: