How to build a Gesture Recognition system using deep learning models?

10 min readMay 1, 2021

In this blog, I am going to explain how to build a machine learning model for gesture Recognition. It can be used in smart television. The gesture will be continuously monitored by the webcam mounted on the TV.

Each gesture will correspond to a specific command

Thumbs up: Increase the volume
Thumbs down: Decrease the volume
Left swipe: ‘Jump’ backward 10 seconds
Right swipe: ‘Jump’ forward 10 seconds
Stop: Pause the movie

Here, I have demonstrated two approaches

Convolutions + RNN: The conv2D network will extract a feature vector for each image, and a sequence of these feature vectors is then fed to an RNN-based network. The output of the RNN is a regular softmax (for a classification problem such as this one)
3D Convolutional Network, or Conv3D: 3D convolutions are a natural extension to the 2D convolutions you are already familiar with. Just like in 2D conv, you move the filter in two directions (x and y), in 3D conv, you move the filter in three directions (x, y, and z). In this case, the input to a 3D conv is a video (which is a sequence of 30 RGB images).

Importing the required libraries

import numpy as np
import os
from scipy.misc import imread, imresize
import datetime
import os
import math
import cv2 as cv
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV

We set the random seed so that the results don’t vary drastically.

np.random.seed(30)
import random as rn
rn.seed(30)
from keras import backend as K
import tensorflow as tf
tf.set_random_seed(30)Using TensorFlow backend.

In this block, we read the folder names for training and validation. we also set the batch_size here. Note that we set the batch size in such a way that you are able to use the GPU in full capacity. You keep increasing the batch size until the machine throws an error.

train_doc = np.random.permutation(open('/mnt/disks/user/project/PROJECT/Project_data/train.csv').readlines())
val_doc = np.random.permutation(open('/mnt/disks/user/project/PROJECT/Project_data/val.csv').readlines())
batch_size = 8 #experiment with the batch size

Generator

This is one of the most important parts of the code. The overall structure of the generator has been given. In the generator, you are going to preprocess the images as you have images of 2 different dimensions as well as create a batch of video frames. You have to experiment with img_idx, y,z and normalization such that you get high accuracy.

img_idx = [i for i in range(0,30)] #create a list of image numbers you want to use for a particular videodef generator(source_path, folder_list, batch_size):
    print( 'Source path = ', source_path, '; batch size =', batch_size)
     
    x = len(img_idx) #x is the number of images you use for each video
    y = 128 #(y,z) is the final size of the input images and 3 is the number of channels RGB
    z = 128 #(y,z) is the final size of the input images and 3 is the number of channels RGB
    while True:
        t = np.random.permutation(folder_list)
        num_batches = math.floor(len(folder_list)//batch_size) # calculate the number of batches
        for batch in range(num_batches): # we iterate over the number of batches
            batch_data = np.zeros((batch_size,x,y,z,3)) # x is the number of images you use for each video, (y,z) is the final size of the input images and 3 is the number of channels RGB
            batch_labels = np.zeros((batch_size,5)) # batch_labels is the one hot representation of the output
            for folder in range(batch_size): # iterate over the batch_size
                imgs = os.listdir(source_path+'/'+ t[folder + (batch*batch_size)].split(';')[0]) # read all the images in the folder
                for idx,item in enumerate(img_idx): #  Iterate iver the frames/images of a folder to read them in
                    image = imread(source_path+'/'+ t[folder + (batch*batch_size)].strip().split(';')[0]+'/'+imgs[item]).astype(np.float32)
                    
                    #crop the images and resize them. Note that the images are of 2 different shape 
                    #and the conv3D will throw error if the inputs in a batch have different shapes
                    # Cropping non symmetric frames
                    if image.shape[0] != image.shape[1]:
                        image=image[0:120,20:140]
                        
                    # Resizing the image
                    image = cv.resize(image, (y, z), interpolation=cv.INTER_AREA)
                    
                    batch_data[folder,idx,:,:,0] = image[:,:,0] - np.percentile(image[:,:,0],5) / (np.percentile(image[:,:,0],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
                    batch_data[folder,idx,:,:,1] = image[:,:,1] - np.percentile(image[:,:,1],5) / (np.percentile(image[:,:,1],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
                    batch_data[folder,idx,:,:,2] = image[:,:,2] - np.percentile(image[:,:,2],5) / (np.percentile(image[:,:,2],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
                    
                batch_labels[folder, int(t[folder + (batch*batch_size)].strip().split(';')[2])] = 1
            yield batch_data, batch_labels #you yield the batch_data and the batch_labels, remember what does yield do
        
        # write the code for the remaining data points which are left after full batches
        #Code for remaining values in batch
        total_folder_processed = num_batches * batch_size # Finding total images procees
        remaining_batch_size = len(folder_list) -  total_folder_processed # finding remaining images to process
        batch_data = np.zeros((remaining_batch_size,x,y,z,3)) # x is the number of images you use for each video, (y,z) is the final size of the input images and 3 is the number of channels RGB
        batch_labels = np.zeros((remaining_batch_size,5)) # batch_labels is the one hot representation of the output
            
        for folder in range(remaining_batch_size): # iterate over remaining images
                imgs = os.listdir(source_path+'/'+ t[folder + (batch*remaining_batch_size)].split(';')[0]) # read all the images in the folder
                for idx,item in enumerate(img_idx): #  Iterate over the frames/images of a folder to read them in
                    image = imread(source_path+'/'+ t[folder + (batch*remaining_batch_size)].strip().split(';')[0]+'/'+imgs[item]).astype(np.float32)
                    
                    #crop the images and resize them. Note that the images are of 2 different shape 
                    #and the conv3D will throw error if the inputs in a batch have different shapes
                    # Cropping non symmetric frames
                    if image.shape[0] != image.shape[1]:
                        image=image[0:120,20:140]
                        
                    # Resizing the image
                    image = cv.resize(image, (y, z), interpolation=cv.INTER_AREA)
                    
                    batch_data[folder,idx,:,:,0] = image[:,:,0] - np.percentile(image[:,:,0],5) / (np.percentile(image[:,:,0],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
                    batch_data[folder,idx,:,:,1] = image[:,:,1] - np.percentile(image[:,:,1],5) / (np.percentile(image[:,:,1],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
                    batch_data[folder,idx,:,:,2] = image[:,:,2] - np.percentile(image[:,:,2],5) / (np.percentile(image[:,:,2],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
                    
                batch_labels[folder, int(t[folder + (batch*remaining_batch_size)].strip().split(';')[2])] = 1
        yield batch_data, batch_labels #you yield the batch_data and the batch_labels, remember what does yield do

Note here that a video is represented above in the generator as (number of images, height, width, number of channels). Take this into consideration while creating the model architecture.

curr_dt_time = datetime.datetime.now()
train_path = '/mnt/disks/user/project/PROJECT/Project_data/train'
val_path = '/mnt/disks/user/project/PROJECT/Project_data/val'
num_train_sequences = len(train_doc)
print('# training sequences =', num_train_sequences)
num_val_sequences = len(val_doc)
print('# validation sequences =', num_val_sequences)
num_epochs = 30 # choose the number of epochs
print ('# epochs =', num_epochs)# training sequences = 663
# validation sequences = 100
# epochs = 30

Model

Here we make the model using different functionalities that Keras provides. Remember we use Conv3D and MaxPooling3D not Conv2D and Maxpooling2D for a 3D convolution model. We would want to use TimeDistributed while building a Conv2D + RNN model. Also, remember that the last layer is the softmax. We have design the network in such a way that the model is able to give good accuracy on the least number of parameters so that it can fit in the memory of the webcam.

Here in the case of implementation, we have built the following model architecture

CNN + 3D
CNN + GRU
CNN + LSTM

from keras.models import Sequential, Model
from keras.layers import Dense, GRU, Flatten, TimeDistributed, Flatten, BatchNormalization, Activation, Dropout,LSTM, Input, MaxPool3D,ZeroPadding3D
from keras.layers.convolutional import  Conv3D,Conv2D, MaxPooling3D,MaxPooling2D
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from keras import optimizers
from keras.optimizers import Adam

Defining the class for the above architectures

class CNNModelGenerator(object):
    """Class function to perform all the required experiments"""


    @classmethod
    def cnn_3d(self,input_shape, no_classes):
        # Define model
        model = Sequential()

        model.add(Conv3D(8, kernel_size=(3,3,3), input_shape=input_shape, padding='same'))
        model.add(BatchNormalization())
        model.add(Activation('relu'))
        model.add(Dropout(0.25))

        model.add(MaxPooling3D(pool_size=(2,2,2)))

        model.add(Conv3D(16, kernel_size=(3,3,3), padding='same'))
        model.add(BatchNormalization())
        model.add(Activation('relu'))
        model.add(Dropout(0.25))

        model.add(MaxPooling3D(pool_size=(2,2,2)))

        model.add(Conv3D(32, kernel_size=(1,3,3), padding='same'))
        model.add(BatchNormalization())
        model.add(Activation('relu'))
        model.add(Dropout(0.25))

        model.add(MaxPooling3D(pool_size=(2,2,2)))

        model.add(Conv3D(64, kernel_size=(1,3,3), padding='same'))
        model.add(Activation('relu'))
        model.add(Dropout(0.25))

        model.add(MaxPooling3D(pool_size=(1,2,2)))

        #Flatten Layers
        model.add(Flatten())

        model.add(Dense(256, activation='relu'))
        model.add(Dropout(0.5))

        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.5))

        #softmax layer
        model.add(Dense(no_classes, activation='softmax'))
        return model
    
   
    
   
    
    
    @classmethod
    def cnn_gru(self,input_shape, no_classes):
        model = Sequential()

        # layer 1
        # input, with 8 convolutions for 5 images
        # that have (128, 128, 3) shape
        model.add(
            TimeDistributed(
                Conv2D(8, (3,3), 
                    padding='same', strides=(2,2), activation='relu'),
                input_shape = input_shape
            )
        )

        # layer 2
        # input, with 16 convolutions for 5 images
        model.add(
            TimeDistributed(
                Conv2D(16, (3,3), 
                    padding='same', strides=(2,2), activation='relu')
            )
        )
        model.add(
            TimeDistributed(
                MaxPooling2D((2,2), strides=(2,2))
            )
        )

        # layer 3
        # input, with 32 convolutions for 5 images
        model.add(
            TimeDistributed(
                Conv2D(32, (3,3), 
                    padding='same', strides=(2,2), activation='relu')
            )
        )
        model.add(
            TimeDistributed(
                MaxPooling2D((2,2), strides=(2,2))
            )
        )

        # layer 4
        # input, with 64 convolutions for 5 images
        model.add(
            TimeDistributed(
                Conv2D(64, (3,3), 
                    padding='same', strides=(2,2), activation='relu')
            )
        )
        model.add(
            TimeDistributed(
                MaxPooling2D((2,2), strides=(2,2))
            )
        )


        model.add(TimeDistributed(BatchNormalization()))
        model.add(Dropout(0.25))

        model.add(TimeDistributed(Flatten()))

        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.25))
        model.add(Dense(64, activation='relu'))
        model.add(Dropout(0.25))

        ## using GRU as the RNN model along with softmax as our last layer.
        model.add(GRU(128, return_sequences=False))
        model.add(Dense(5, activation='softmax')) # using Softmax as last layer
        return model
    
   
    
    @classmethod
    def cnn_lstm(self,input_shape, no_classes):
        model = Sequential()

        # layer 1
        # input, with 8 convolutions for 5 images
        # that have (128, 128, 3) shape
        model.add(
            TimeDistributed(
                Conv2D(8, (3,3), 
                    padding='same', strides=(2,2), activation='relu'),
                input_shape = input_shape
            )
        )

        # layer 2
        # input, with 16 convolutions for 5 images
        model.add(
            TimeDistributed(
                Conv2D(16, (3,3), 
                    padding='same', strides=(2,2), activation='relu')
            )
        )
        model.add(
            TimeDistributed(
                MaxPooling2D((2,2), strides=(2,2))
            )
        )

        # layer 3
        # input, with 32 convolutions for 5 images
        model.add(
            TimeDistributed(
                Conv2D(32, (3,3), 
                    padding='same', strides=(2,2), activation='relu')
            )
        )
        model.add(
            TimeDistributed(
                MaxPooling2D((2,2), strides=(2,2))
            )
        )

        # layer 4
        # input, with 64 convolutions for 5 images
        model.add(
            TimeDistributed(
                Conv2D(64, (3,3), 
                    padding='same', strides=(2,2), activation='relu')
            )
        )
        model.add(
            TimeDistributed(
                MaxPooling2D((2,2), strides=(2,2))
            )
        )


        model.add(TimeDistributed(BatchNormalization()))
        model.add(Dropout(0.5))

        model.add(TimeDistributed(Flatten()))

        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.5))
        model.add(Dense(64, activation='relu'))
        model.add(Dropout(0.5))

        ## using GRU as the RNN model along with softmax as our last layer.
        model.add(LSTM(128, return_sequences=False))
        model.add(Dense(5, activation='softmax')) # using Softmax as last layer
        return model
    
    
    
   
   
    
    @classmethod
    def model_summary(self,model,optimiser):
        """Python function to get sumary of model"""
        model.compile(optimizer=optimiser, loss='categorical_crossentropy', metrics=['categorical_accuracy'])
        return model.summary()
    
    @classmethod
    def train_model(self,model,folder_name, train_generator, steps_per_epoch, num_epochs,val_generator,validation_steps):
        """Python function to reatin trained models"""
        model_name = 'model_init_'+ folder_name + '_' + str(curr_dt_time).replace(' ','').replace(':','_') + '/'
    
        if not os.path.exists(model_name):
            os.mkdir(model_name)

        filepath = model_name + 'model-3d-relu-{epoch:05d}-{loss:.5f}-{categorical_accuracy:.5f}-{val_loss:.5f}-{val_categorical_accuracy:.5f}.h5'

        checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=False, save_weights_only=False, mode='auto', period=1)

        LR = ReduceLROnPlateau(monitor='val_loss', factor=0.01, patience=5, cooldown=4, verbose=1,mode='auto',epsilon=0.0001) # write the REducelronplateau code here
        callbacks_list = [checkpoint, LR]
        history = model.fit_generator(train_generator, steps_per_epoch=steps_per_epoch, epochs=num_epochs, verbose=1, 
                            callbacks=callbacks_list, validation_data=val_generator, 
                            validation_steps=validation_steps, class_weight=None, workers= -1, initial_epoch=0)
        return history
        
    
    @classmethod
    def plot_accuracy(self,history):
        # summarize history for accuracy
        plt.plot(history.history['categorical_accuracy']) # Get 
        plt.plot(history.history['val_categorical_accuracy'])
        plt.title('model accuracy')
        plt.ylabel('accuracy')
        plt.xlabel('epoch')
        plt.legend(['train', 'dev'], loc='upper left')
        plt.show()
    
    @classmethod
    def plot_loss(self,history):
        # summarize history for loss
        plt.plot(history.history['loss'])
        plt.plot(history.history['val_loss'])
        plt.title('model loss')
        plt.ylabel('loss')
        plt.xlabel('epoch')
        plt.legend(['train', 'test'], loc='upper left')
        plt.show()

We will now invoke the generator function to select the images from the required folders

train_generator = generator(train_path, train_doc, batch_size)
val_generator = generator(val_path, val_doc, batch_size)
if (num_train_sequences%batch_size) == 0:
    steps_per_epoch = int(num_train_sequences/batch_size)
else:
    steps_per_epoch = (num_train_sequences//batch_size) + 1

if (num_val_sequences%batch_size) == 0:
    validation_steps = int(num_val_sequences/batch_size)
else:
    validation_steps = (num_val_sequences//batch_size) + 1

Building CNN + 3D model

model_class = CNNModelGenerator()
input_shape = (len(img_idx),128,128,3)
no_classes = 5
optimiser = Adam(0.001) #write your optimizer
model = model_class.cnn_3d(input_shape,no_classes)
print(model_class.model_summary(model,optimiser))_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv3d_1 (Conv3D)            (None, 30, 128, 128, 8)   656       
_________________________________________________________________
batch_normalization_1 (Batch (None, 30, 128, 128, 8)   32        
_________________________________________________________________
activation_1 (Activation)    (None, 30, 128, 128, 8)   0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 30, 128, 128, 8)   0         
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 15, 64, 64, 8)     0         
_________________________________________________________________
conv3d_2 (Conv3D)            (None, 15, 64, 64, 16)    3472      
_________________________________________________________________
batch_normalization_2 (Batch (None, 15, 64, 64, 16)    64        
_________________________________________________________________
activation_2 (Activation)    (None, 15, 64, 64, 16)    0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 15, 64, 64, 16)    0         
_________________________________________________________________
max_pooling3d_2 (MaxPooling3 (None, 7, 32, 32, 16)     0         
_________________________________________________________________
conv3d_3 (Conv3D)            (None, 7, 32, 32, 32)     4640      
_________________________________________________________________
batch_normalization_3 (Batch (None, 7, 32, 32, 32)     128       
_________________________________________________________________
activation_3 (Activation)    (None, 7, 32, 32, 32)     0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 7, 32, 32, 32)     0         
_________________________________________________________________
max_pooling3d_3 (MaxPooling3 (None, 3, 16, 16, 32)     0         
_________________________________________________________________
conv3d_4 (Conv3D)            (None, 3, 16, 16, 64)     18496     
_________________________________________________________________
activation_4 (Activation)    (None, 3, 16, 16, 64)     0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 3, 16, 16, 64)     0         
_________________________________________________________________
max_pooling3d_4 (MaxPooling3 (None, 3, 8, 8, 64)       0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 12288)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               3145984   
_________________________________________________________________
dropout_5 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               32896     
_________________________________________________________________
dropout_6 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 645       
=================================================================
Total params: 3,207,013
Trainable params: 3,206,901
Non-trainable params: 112
_________________________________________________________________
None

Plotting the accuracy matrices with no epochs

model_class.plot_accuracy(history)

Plotting the loss matrices with no epochs

model_class.plot_loss(history)

Building CNN + GRU model

model_class = CNNModelGenerator()
input_shape = (len(img_idx),128,128,3)
no_classes = 5
optimiser = Adam(0.001) #write your optimizer
model = model_class.cnn_gru(input_shape,no_classes)
print(model_class.model_summary(model,optimiser))_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
time_distributed_1 (TimeDist (None, 30, 64, 64, 8)     224       
_________________________________________________________________
time_distributed_2 (TimeDist (None, 30, 32, 32, 16)    1168      
_________________________________________________________________
time_distributed_3 (TimeDist (None, 30, 16, 16, 16)    0         
_________________________________________________________________
time_distributed_4 (TimeDist (None, 30, 8, 8, 32)      4640      
_________________________________________________________________
time_distributed_5 (TimeDist (None, 30, 4, 4, 32)      0         
_________________________________________________________________
time_distributed_6 (TimeDist (None, 30, 2, 2, 64)      18496     
_________________________________________________________________
time_distributed_7 (TimeDist (None, 30, 1, 1, 64)      0         
_________________________________________________________________
time_distributed_8 (TimeDist (None, 30, 1, 1, 64)      256       
_________________________________________________________________
dropout_18 (Dropout)         (None, 30, 1, 1, 64)      0         
_________________________________________________________________
time_distributed_9 (TimeDist (None, 30, 64)            0         
_________________________________________________________________
dense_10 (Dense)             (None, 30, 128)           8320      
_________________________________________________________________
dropout_19 (Dropout)         (None, 30, 128)           0         
_________________________________________________________________
dense_11 (Dense)             (None, 30, 64)            8256      
_________________________________________________________________
dropout_20 (Dropout)         (None, 30, 64)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 128)               74112     
_________________________________________________________________
dense_12 (Dense)             (None, 5)                 645       
=================================================================
Total params: 116,117
Trainable params: 115,989
Non-trainable params: 128
_________________________________________________________________
None

Plotting the accuracy matrices with no epochs

model_class.plot_accuracy(history)

model_class.plot_loss(history)

Building CNN + LSTM model

model_class = CNNModelGenerator()
input_shape = (len(img_idx),128,128,3)
no_classes = 5
optimiser = Adam(0.001) #write your optimizer
model = model_class.cnn_lstm(input_shape,no_classes)
print(model_class.model_summary(model,optimiser))_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
time_distributed_1 (TimeDist (None, 30, 64, 64, 8)     224       
_________________________________________________________________
time_distributed_2 (TimeDist (None, 30, 32, 32, 16)    1168      
_________________________________________________________________
time_distributed_3 (TimeDist (None, 30, 16, 16, 16)    0         
_________________________________________________________________
time_distributed_4 (TimeDist (None, 30, 8, 8, 32)      4640      
_________________________________________________________________
time_distributed_5 (TimeDist (None, 30, 4, 4, 32)      0         
_________________________________________________________________
time_distributed_6 (TimeDist (None, 30, 2, 2, 64)      18496     
_________________________________________________________________
time_distributed_7 (TimeDist (None, 30, 1, 1, 64)      0         
_________________________________________________________________
time_distributed_8 (TimeDist (None, 30, 1, 1, 64)      256       
_________________________________________________________________
dropout_7 (Dropout)          (None, 30, 1, 1, 64)      0         
_________________________________________________________________
time_distributed_9 (TimeDist (None, 30, 64)            0         
_________________________________________________________________
dense_4 (Dense)              (None, 30, 128)           8320      
_________________________________________________________________
dropout_8 (Dropout)          (None, 30, 128)           0         
_________________________________________________________________
dense_5 (Dense)              (None, 30, 64)            8256      
_________________________________________________________________
dropout_9 (Dropout)          (None, 30, 64)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               98816     
_________________________________________________________________
dense_6 (Dense)              (None, 5)                 645       
=================================================================
Total params: 140,821
Trainable params: 140,693
Non-trainable params: 128
_________________________________________________________________
None

Plotting the accuracy matrices with no epochs

model_class.plot_accuracy(history)

model_class.plot_loss(history)

Note this model performance can be optimized more by hyperparameter tunning and trying some other architectures.

How to build a Gesture Recognition system using deep learning models?

Generator

Model

Building CNN + 3D model

Building CNN + GRU model

Building CNN + LSTM model

Written by Manish Poddar

No responses yet