How to build a Gesture Recognition system using deep learning models?
In this blog, I am going to explain how to build a machine learning model for gesture Recognition. It can be used in smart television. The gesture will be continuously monitored by the webcam mounted on the TV.
Each gesture will correspond to a specific command
- Thumbs up: Increase the volume
- Thumbs down: Decrease the volume
- Left swipe: ‘Jump’ backward 10 seconds
- Right swipe: ‘Jump’ forward 10 seconds
- Stop: Pause the movie
Here, I have demonstrated two approaches
- Convolutions + RNN: The conv2D network will extract a feature vector for each image, and a sequence of these feature vectors is then fed to an RNN-based network. The output of the RNN is a regular softmax (for a classification problem such as this one)
- 3D Convolutional Network, or Conv3D: 3D convolutions are a natural extension to the 2D convolutions you are already familiar with. Just like in 2D conv, you move the filter in two directions (x and y), in 3D conv, you move the filter in three directions (x, y, and z). In this case, the input to a 3D conv is a video (which is a sequence of 30 RGB images).
Importing the required libraries
import numpy as np
import os
from scipy.misc import imread, imresize
import datetime
import os
import math
import cv2 as cv
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
We set the random seed so that the results don’t vary drastically.
np.random.seed(30)
import random as rn
rn.seed(30)
from keras import backend as K
import tensorflow as tf
tf.set_random_seed(30)Using TensorFlow backend.
In this block, we read the folder names for training and validation. we also set the batch_size
here. Note that we set the batch size in such a way that you are able to use the GPU in full capacity. You keep increasing the batch size until the machine throws an error.
train_doc = np.random.permutation(open('/mnt/disks/user/project/PROJECT/Project_data/train.csv').readlines())
val_doc = np.random.permutation(open('/mnt/disks/user/project/PROJECT/Project_data/val.csv').readlines())
batch_size = 8 #experiment with the batch size
Generator
This is one of the most important parts of the code. The overall structure of the generator has been given. In the generator, you are going to preprocess the images as you have images of 2 different dimensions as well as create a batch of video frames. You have to experiment with img_idx
, y
,z
and normalization such that you get high accuracy.
img_idx = [i for i in range(0,30)] #create a list of image numbers you want to use for a particular videodef generator(source_path, folder_list, batch_size):
print( 'Source path = ', source_path, '; batch size =', batch_size)
x = len(img_idx) #x is the number of images you use for each video
y = 128 #(y,z) is the final size of the input images and 3 is the number of channels RGB
z = 128 #(y,z) is the final size of the input images and 3 is the number of channels RGB
while True:
t = np.random.permutation(folder_list)
num_batches = math.floor(len(folder_list)//batch_size) # calculate the number of batches
for batch in range(num_batches): # we iterate over the number of batches
batch_data = np.zeros((batch_size,x,y,z,3)) # x is the number of images you use for each video, (y,z) is the final size of the input images and 3 is the number of channels RGB
batch_labels = np.zeros((batch_size,5)) # batch_labels is the one hot representation of the output
for folder in range(batch_size): # iterate over the batch_size
imgs = os.listdir(source_path+'/'+ t[folder + (batch*batch_size)].split(';')[0]) # read all the images in the folder
for idx,item in enumerate(img_idx): # Iterate iver the frames/images of a folder to read them in
image = imread(source_path+'/'+ t[folder + (batch*batch_size)].strip().split(';')[0]+'/'+imgs[item]).astype(np.float32)
#crop the images and resize them. Note that the images are of 2 different shape
#and the conv3D will throw error if the inputs in a batch have different shapes
# Cropping non symmetric frames
if image.shape[0] != image.shape[1]:
image=image[0:120,20:140]
# Resizing the image
image = cv.resize(image, (y, z), interpolation=cv.INTER_AREA)
batch_data[folder,idx,:,:,0] = image[:,:,0] - np.percentile(image[:,:,0],5) / (np.percentile(image[:,:,0],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
batch_data[folder,idx,:,:,1] = image[:,:,1] - np.percentile(image[:,:,1],5) / (np.percentile(image[:,:,1],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
batch_data[folder,idx,:,:,2] = image[:,:,2] - np.percentile(image[:,:,2],5) / (np.percentile(image[:,:,2],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
batch_labels[folder, int(t[folder + (batch*batch_size)].strip().split(';')[2])] = 1
yield batch_data, batch_labels #you yield the batch_data and the batch_labels, remember what does yield do
# write the code for the remaining data points which are left after full batches
#Code for remaining values in batch
total_folder_processed = num_batches * batch_size # Finding total images procees
remaining_batch_size = len(folder_list) - total_folder_processed # finding remaining images to process
batch_data = np.zeros((remaining_batch_size,x,y,z,3)) # x is the number of images you use for each video, (y,z) is the final size of the input images and 3 is the number of channels RGB
batch_labels = np.zeros((remaining_batch_size,5)) # batch_labels is the one hot representation of the output
for folder in range(remaining_batch_size): # iterate over remaining images
imgs = os.listdir(source_path+'/'+ t[folder + (batch*remaining_batch_size)].split(';')[0]) # read all the images in the folder
for idx,item in enumerate(img_idx): # Iterate over the frames/images of a folder to read them in
image = imread(source_path+'/'+ t[folder + (batch*remaining_batch_size)].strip().split(';')[0]+'/'+imgs[item]).astype(np.float32)
#crop the images and resize them. Note that the images are of 2 different shape
#and the conv3D will throw error if the inputs in a batch have different shapes
# Cropping non symmetric frames
if image.shape[0] != image.shape[1]:
image=image[0:120,20:140]
# Resizing the image
image = cv.resize(image, (y, z), interpolation=cv.INTER_AREA)
batch_data[folder,idx,:,:,0] = image[:,:,0] - np.percentile(image[:,:,0],5) / (np.percentile(image[:,:,0],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
batch_data[folder,idx,:,:,1] = image[:,:,1] - np.percentile(image[:,:,1],5) / (np.percentile(image[:,:,1],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
batch_data[folder,idx,:,:,2] = image[:,:,2] - np.percentile(image[:,:,2],5) / (np.percentile(image[:,:,2],95) - np.percentile(image[:,:,0],5)) #normalise and feed in the image
batch_labels[folder, int(t[folder + (batch*remaining_batch_size)].strip().split(';')[2])] = 1
yield batch_data, batch_labels #you yield the batch_data and the batch_labels, remember what does yield do
Note here that a video is represented above in the generator as (number of images, height, width, number of channels). Take this into consideration while creating the model architecture.
curr_dt_time = datetime.datetime.now()
train_path = '/mnt/disks/user/project/PROJECT/Project_data/train'
val_path = '/mnt/disks/user/project/PROJECT/Project_data/val'
num_train_sequences = len(train_doc)
print('# training sequences =', num_train_sequences)
num_val_sequences = len(val_doc)
print('# validation sequences =', num_val_sequences)
num_epochs = 30 # choose the number of epochs
print ('# epochs =', num_epochs)# training sequences = 663
# validation sequences = 100
# epochs = 30
Model
Here we make the model using different functionalities that Keras provides. Remember we use Conv3D
and MaxPooling3D
not Conv2D
and Maxpooling2D
for a 3D convolution model. We would want to use TimeDistributed
while building a Conv2D + RNN model. Also, remember that the last layer is the softmax. We have design the network in such a way that the model is able to give good accuracy on the least number of parameters so that it can fit in the memory of the webcam.
Here in the case of implementation, we have built the following model architecture
- CNN + 3D
- CNN + GRU
- CNN + LSTM
from keras.models import Sequential, Model
from keras.layers import Dense, GRU, Flatten, TimeDistributed, Flatten, BatchNormalization, Activation, Dropout,LSTM, Input, MaxPool3D,ZeroPadding3D
from keras.layers.convolutional import Conv3D,Conv2D, MaxPooling3D,MaxPooling2D
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from keras import optimizers
from keras.optimizers import Adam
Defining the class for the above architectures
class CNNModelGenerator(object):
"""Class function to perform all the required experiments"""
@classmethod
def cnn_3d(self,input_shape, no_classes):
# Define model
model = Sequential()
model.add(Conv3D(8, kernel_size=(3,3,3), input_shape=input_shape, padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(MaxPooling3D(pool_size=(2,2,2)))
model.add(Conv3D(16, kernel_size=(3,3,3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(MaxPooling3D(pool_size=(2,2,2)))
model.add(Conv3D(32, kernel_size=(1,3,3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(MaxPooling3D(pool_size=(2,2,2)))
model.add(Conv3D(64, kernel_size=(1,3,3), padding='same'))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(MaxPooling3D(pool_size=(1,2,2)))
#Flatten Layers
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
#softmax layer
model.add(Dense(no_classes, activation='softmax'))
return model
@classmethod
def cnn_gru(self,input_shape, no_classes):
model = Sequential()
# layer 1
# input, with 8 convolutions for 5 images
# that have (128, 128, 3) shape
model.add(
TimeDistributed(
Conv2D(8, (3,3),
padding='same', strides=(2,2), activation='relu'),
input_shape = input_shape
)
)
# layer 2
# input, with 16 convolutions for 5 images
model.add(
TimeDistributed(
Conv2D(16, (3,3),
padding='same', strides=(2,2), activation='relu')
)
)
model.add(
TimeDistributed(
MaxPooling2D((2,2), strides=(2,2))
)
)
# layer 3
# input, with 32 convolutions for 5 images
model.add(
TimeDistributed(
Conv2D(32, (3,3),
padding='same', strides=(2,2), activation='relu')
)
)
model.add(
TimeDistributed(
MaxPooling2D((2,2), strides=(2,2))
)
)
# layer 4
# input, with 64 convolutions for 5 images
model.add(
TimeDistributed(
Conv2D(64, (3,3),
padding='same', strides=(2,2), activation='relu')
)
)
model.add(
TimeDistributed(
MaxPooling2D((2,2), strides=(2,2))
)
)
model.add(TimeDistributed(BatchNormalization()))
model.add(Dropout(0.25))
model.add(TimeDistributed(Flatten()))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.25))
## using GRU as the RNN model along with softmax as our last layer.
model.add(GRU(128, return_sequences=False))
model.add(Dense(5, activation='softmax')) # using Softmax as last layer
return model
@classmethod
def cnn_lstm(self,input_shape, no_classes):
model = Sequential()
# layer 1
# input, with 8 convolutions for 5 images
# that have (128, 128, 3) shape
model.add(
TimeDistributed(
Conv2D(8, (3,3),
padding='same', strides=(2,2), activation='relu'),
input_shape = input_shape
)
)
# layer 2
# input, with 16 convolutions for 5 images
model.add(
TimeDistributed(
Conv2D(16, (3,3),
padding='same', strides=(2,2), activation='relu')
)
)
model.add(
TimeDistributed(
MaxPooling2D((2,2), strides=(2,2))
)
)
# layer 3
# input, with 32 convolutions for 5 images
model.add(
TimeDistributed(
Conv2D(32, (3,3),
padding='same', strides=(2,2), activation='relu')
)
)
model.add(
TimeDistributed(
MaxPooling2D((2,2), strides=(2,2))
)
)
# layer 4
# input, with 64 convolutions for 5 images
model.add(
TimeDistributed(
Conv2D(64, (3,3),
padding='same', strides=(2,2), activation='relu')
)
)
model.add(
TimeDistributed(
MaxPooling2D((2,2), strides=(2,2))
)
)
model.add(TimeDistributed(BatchNormalization()))
model.add(Dropout(0.5))
model.add(TimeDistributed(Flatten()))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
## using GRU as the RNN model along with softmax as our last layer.
model.add(LSTM(128, return_sequences=False))
model.add(Dense(5, activation='softmax')) # using Softmax as last layer
return model
@classmethod
def model_summary(self,model,optimiser):
"""Python function to get sumary of model"""
model.compile(optimizer=optimiser, loss='categorical_crossentropy', metrics=['categorical_accuracy'])
return model.summary()
@classmethod
def train_model(self,model,folder_name, train_generator, steps_per_epoch, num_epochs,val_generator,validation_steps):
"""Python function to reatin trained models"""
model_name = 'model_init_'+ folder_name + '_' + str(curr_dt_time).replace(' ','').replace(':','_') + '/'
if not os.path.exists(model_name):
os.mkdir(model_name)
filepath = model_name + 'model-3d-relu-{epoch:05d}-{loss:.5f}-{categorical_accuracy:.5f}-{val_loss:.5f}-{val_categorical_accuracy:.5f}.h5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=False, save_weights_only=False, mode='auto', period=1)
LR = ReduceLROnPlateau(monitor='val_loss', factor=0.01, patience=5, cooldown=4, verbose=1,mode='auto',epsilon=0.0001) # write the REducelronplateau code here
callbacks_list = [checkpoint, LR]
history = model.fit_generator(train_generator, steps_per_epoch=steps_per_epoch, epochs=num_epochs, verbose=1,
callbacks=callbacks_list, validation_data=val_generator,
validation_steps=validation_steps, class_weight=None, workers= -1, initial_epoch=0)
return history
@classmethod
def plot_accuracy(self,history):
# summarize history for accuracy
plt.plot(history.history['categorical_accuracy']) # Get
plt.plot(history.history['val_categorical_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()
@classmethod
def plot_loss(self,history):
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
We will now invoke the generator function to select the images from the required folders
train_generator = generator(train_path, train_doc, batch_size)
val_generator = generator(val_path, val_doc, batch_size)
if (num_train_sequences%batch_size) == 0:
steps_per_epoch = int(num_train_sequences/batch_size)
else:
steps_per_epoch = (num_train_sequences//batch_size) + 1
if (num_val_sequences%batch_size) == 0:
validation_steps = int(num_val_sequences/batch_size)
else:
validation_steps = (num_val_sequences//batch_size) + 1
Building CNN + 3D model
model_class = CNNModelGenerator()
input_shape = (len(img_idx),128,128,3)
no_classes = 5
optimiser = Adam(0.001) #write your optimizer
model = model_class.cnn_3d(input_shape,no_classes)
print(model_class.model_summary(model,optimiser))_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv3d_1 (Conv3D) (None, 30, 128, 128, 8) 656
_________________________________________________________________
batch_normalization_1 (Batch (None, 30, 128, 128, 8) 32
_________________________________________________________________
activation_1 (Activation) (None, 30, 128, 128, 8) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 30, 128, 128, 8) 0
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 15, 64, 64, 8) 0
_________________________________________________________________
conv3d_2 (Conv3D) (None, 15, 64, 64, 16) 3472
_________________________________________________________________
batch_normalization_2 (Batch (None, 15, 64, 64, 16) 64
_________________________________________________________________
activation_2 (Activation) (None, 15, 64, 64, 16) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 15, 64, 64, 16) 0
_________________________________________________________________
max_pooling3d_2 (MaxPooling3 (None, 7, 32, 32, 16) 0
_________________________________________________________________
conv3d_3 (Conv3D) (None, 7, 32, 32, 32) 4640
_________________________________________________________________
batch_normalization_3 (Batch (None, 7, 32, 32, 32) 128
_________________________________________________________________
activation_3 (Activation) (None, 7, 32, 32, 32) 0
_________________________________________________________________
dropout_3 (Dropout) (None, 7, 32, 32, 32) 0
_________________________________________________________________
max_pooling3d_3 (MaxPooling3 (None, 3, 16, 16, 32) 0
_________________________________________________________________
conv3d_4 (Conv3D) (None, 3, 16, 16, 64) 18496
_________________________________________________________________
activation_4 (Activation) (None, 3, 16, 16, 64) 0
_________________________________________________________________
dropout_4 (Dropout) (None, 3, 16, 16, 64) 0
_________________________________________________________________
max_pooling3d_4 (MaxPooling3 (None, 3, 8, 8, 64) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 12288) 0
_________________________________________________________________
dense_1 (Dense) (None, 256) 3145984
_________________________________________________________________
dropout_5 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 128) 32896
_________________________________________________________________
dropout_6 (Dropout) (None, 128) 0
_________________________________________________________________
dense_3 (Dense) (None, 5) 645
=================================================================
Total params: 3,207,013
Trainable params: 3,206,901
Non-trainable params: 112
_________________________________________________________________
None
Plotting the accuracy matrices with no epochs
model_class.plot_accuracy(history)
Plotting the loss matrices with no epochs
model_class.plot_loss(history)
Building CNN + GRU model
model_class = CNNModelGenerator()
input_shape = (len(img_idx),128,128,3)
no_classes = 5
optimiser = Adam(0.001) #write your optimizer
model = model_class.cnn_gru(input_shape,no_classes)
print(model_class.model_summary(model,optimiser))_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
time_distributed_1 (TimeDist (None, 30, 64, 64, 8) 224
_________________________________________________________________
time_distributed_2 (TimeDist (None, 30, 32, 32, 16) 1168
_________________________________________________________________
time_distributed_3 (TimeDist (None, 30, 16, 16, 16) 0
_________________________________________________________________
time_distributed_4 (TimeDist (None, 30, 8, 8, 32) 4640
_________________________________________________________________
time_distributed_5 (TimeDist (None, 30, 4, 4, 32) 0
_________________________________________________________________
time_distributed_6 (TimeDist (None, 30, 2, 2, 64) 18496
_________________________________________________________________
time_distributed_7 (TimeDist (None, 30, 1, 1, 64) 0
_________________________________________________________________
time_distributed_8 (TimeDist (None, 30, 1, 1, 64) 256
_________________________________________________________________
dropout_18 (Dropout) (None, 30, 1, 1, 64) 0
_________________________________________________________________
time_distributed_9 (TimeDist (None, 30, 64) 0
_________________________________________________________________
dense_10 (Dense) (None, 30, 128) 8320
_________________________________________________________________
dropout_19 (Dropout) (None, 30, 128) 0
_________________________________________________________________
dense_11 (Dense) (None, 30, 64) 8256
_________________________________________________________________
dropout_20 (Dropout) (None, 30, 64) 0
_________________________________________________________________
gru_1 (GRU) (None, 128) 74112
_________________________________________________________________
dense_12 (Dense) (None, 5) 645
=================================================================
Total params: 116,117
Trainable params: 115,989
Non-trainable params: 128
_________________________________________________________________
None
Plotting the accuracy matrices with no epochs
model_class.plot_accuracy(history)
model_class.plot_loss(history)
Building CNN + LSTM model
model_class = CNNModelGenerator()
input_shape = (len(img_idx),128,128,3)
no_classes = 5
optimiser = Adam(0.001) #write your optimizer
model = model_class.cnn_lstm(input_shape,no_classes)
print(model_class.model_summary(model,optimiser))_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
time_distributed_1 (TimeDist (None, 30, 64, 64, 8) 224
_________________________________________________________________
time_distributed_2 (TimeDist (None, 30, 32, 32, 16) 1168
_________________________________________________________________
time_distributed_3 (TimeDist (None, 30, 16, 16, 16) 0
_________________________________________________________________
time_distributed_4 (TimeDist (None, 30, 8, 8, 32) 4640
_________________________________________________________________
time_distributed_5 (TimeDist (None, 30, 4, 4, 32) 0
_________________________________________________________________
time_distributed_6 (TimeDist (None, 30, 2, 2, 64) 18496
_________________________________________________________________
time_distributed_7 (TimeDist (None, 30, 1, 1, 64) 0
_________________________________________________________________
time_distributed_8 (TimeDist (None, 30, 1, 1, 64) 256
_________________________________________________________________
dropout_7 (Dropout) (None, 30, 1, 1, 64) 0
_________________________________________________________________
time_distributed_9 (TimeDist (None, 30, 64) 0
_________________________________________________________________
dense_4 (Dense) (None, 30, 128) 8320
_________________________________________________________________
dropout_8 (Dropout) (None, 30, 128) 0
_________________________________________________________________
dense_5 (Dense) (None, 30, 64) 8256
_________________________________________________________________
dropout_9 (Dropout) (None, 30, 64) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 128) 98816
_________________________________________________________________
dense_6 (Dense) (None, 5) 645
=================================================================
Total params: 140,821
Trainable params: 140,693
Non-trainable params: 128
_________________________________________________________________
None
Plotting the accuracy matrices with no epochs
model_class.plot_accuracy(history)
model_class.plot_loss(history)
Note this model performance can be optimized more by hyperparameter tunning and trying some other architectures.