Abstract employed for classification of the kinect based

Abstract This paper presents a novel Sign Language
Recognition (SLR) framework that exploits the best
features extracted from both the Microsoft kinect and
leap motion devices for robust gesture recognition. The
hand shape with its internal details is segmented using
both RGB and Depth modalities provided by kinect
sensor. A spatio-temporal descriptor based on 3D gradient
(HOG3D) and PCA are utilized for feature extraction
of the segmented hand shape. CCA is employed
for classification of the kinect based features, where the
best signs with the highest scores, are selected. The
DTW-KNN is applied on the leap motion based descriptors
corresponding to the best signs to get the final
decision. The Framework components are validated by
comparison with the state of the art solutions. Accuracies
of 85.17%, and 92.02% are reported for kinect and
leap motion sensors, respectively, while the accuracy is
boosted up to 93.90% when both devices are exploited.
Keywords Arabic Sign Language · HOG3D · CCA ·
Leap Motion Sensor · Kinect Sensor
1 Introduction
One of the major trends in the computer vision research
is hand gesture recognition. Hand gesture can represent
general gestures which can be used for general applications
such as human machine interaction or sign
language gestures. Sign Language Recognition (SLR)
Marwa Elpeltagy
ECE Department, Egypt-Japan University of Science and
Technology, New Borg El-Arab City, Alexandria 21934, Egypt
E-mail: [email protected]
Moataz Abdelwahab
ECE Department, Egypt-Japan University of Science and
Technology, New Borg El-Arab City, Alexandria 21934, Egypt
systems reduced the communication barrier between
hearing-impaired and normal people. There is a big demand
for implementing systems that can interpret what
they need to convey to the society to ease the communication
among them. Hence several research efforts have
been made to facilitate automatic sign language recognition.
With the emergence of 3D depth sensors, sign
language recognition system models have shifted from
2D camera-based technique to a sensor-based 3D environment.
The introduction of low cost depth sensors
such as Leap motion and Kinect sensors has allowed
mass market and researchers to exploit the 3D information
of the body-parts movements that occur within
the eld of view of the sensors to recognize the performed
gesture. These sensors are found to be efficient in capturing
3D representation of the gestures in real time
without restrictions on the user or on the environment
conditions such as lightening variations and cluttered
backgrounds. Kinect sensor is used to capture RGB
image, depth image in addition to full-body skeleton,
whereas Leap motion is used to capture precise fingers
and hand movements within a smaller field of view limited
to one cubic centimeter, approximately allows to
recognize the small details associated to the pose of the
fingers such as positions and orientation.
Signs can be classified as being static or dynamic
gestures. The static gestures contain fixed hand shape
without motion while the dynamic gesture involves meaningful
motion in addition to hand shapes. Majority of
the research work are directed to static gesture recognition
where few of them concentrated on dynamic gesture
recognition. We focus our work to dynamic SLR using
both kinect and leap motion sensors. Dynamic hand
gesture recognition is considered to be the problem of
sequential modeling and classication. In the literature,
there are several sequence representation and classifi-
2 Marwa Elpeltagy, Moataz Abdelwahab
cation algorithms such as COV3dJ 1, HMM 2 and
DTW 3. The author in 1 proved that the COV3dJ
outperforms HMM for human action recognition. The
author in 3 had, experimentally, proved the success of
DTW in 3D gesture recognition.
Most of the research work employed single sensor
either Leap motion or Kinect sensor which is inefficient
especially if one nger/hand occludes other nger/hand
which affects the recognition rate. Since, both Leap motion
and Kinect provide different modalities and different
features; it seems to be more efficient to exploit
both devices together to boost up the performance. A
SLR was proposed in 4, 5, 6, 7by combining the two
sensors. The Author in 4, 5 used the fingertips positions
and directions features that extracted from the
depth modality of the kinect sensor using Candescent
NUI library 8. One of the drawbacks of this library is
that it fails to detect the fingertips positions and directions
when all fingers are bended on the palm region.
Therefore, all the gestures that do not include extended
fingers cannot be recognized using these features. Another
hybrid SLR system was proposed in 6, 7 where
the feature vector includes fingertips distances from the
hand center, fingertips elevations and adjacent ngertipdistance.
The disadvantage of using these features is
that they need to be normalized by dividing the values
by the distance between the hand center and the middle
ngertip and this value need to be recomputed each
time a new signer use the system as people differs in
their hand size.
In this paper, we propose to use both sensors and
exploit the best corresponding features and classifiers
to enhance the recognition performance as compared to
the individual sensor methods. A set of feature descriptors
that avoid the previous drawbacks is introduced for
both the leap motion and kinect sensors. The proposed
leap motion based feature vector consists of bones directions,
hand orientation and angles between fingers.
We exploit both the RGB and depth modalities provided
by the kinect sensor to extract the complete hand
shape features with its internal articulations. First, the
hand region is segmented from the depth image depending
on a certain threshold and used as a mask
applied on the RGB image. The spatio-temporal descriptor
based on 3D gradient (HOG3D) implemented
in 9 and PCA algorithm is applied on the segmented
hand shape for feature extraction and CCA classifier is
used to select the best signs. The DTW-KNN is applied
on the leap motion based descriptors corresponding to
the best signs to get the final decision. Arabic sign language
did not receive much attention. One of the most
necessary applications is the medical application where
the situation is too difficult when doctors in hospitals
could not understand what the hearing impaired people
suffer from. Therefore, a medical application based
Arabic sign language dataset was recorded simultaneously
from both leap motion and kinect sensors. The
recorded dataset consists of 33 signs and contains all
the modalities from kinect sensor in addition to the
leap motion data. The experimental results show that
the leap motion based descriptor achieves higher accuracy
compared to the state of the art descriptors 5.
Furthermore, DTW-KNN solution component is compared
to the state of the art component in 1 and it is
found to outperform Cov3dJ based algorithm for recognition
performance of the leap motion feature. We experimentally
demonstrate that fusing the RGBD based
hand shape feature extracted from the kinect with the
features extracted from the leap motion sensor is more
robust than the single sensor-based feature achieving
accuracy of 93.90% and avoids drawbacks of the previous
work. The paper is organized as follows: Section 2
presents the related work. Section 3 explains the dataset
acquisition. The proposed technique is introduced in
section 4. Section 5 presents the experimental results.
Section 6 concludes the work.
2 Related Work
Kinect sensor has been used for the development of
gesture recognition systems. For example, SIFT (Scale
invariant feature transform), SURF (Speeded-Up Robust
Feature), VFH (Recognition using Viewpoint Feature
Histogram) were used in 10 with nearest neighbor
classier for the recognition of 140 static Indian gestures
performed by 18 signers. Accuracies of 49.07%,
55.52%, 57.19% were recorded by the authors using
SIFT, SURF, VFH, respectively. Gabor, Local binary
patterns (LBP) and HOG based features were used
in 11 to provide useful information about the hand
congurations beside the skeletal feature. These features
are classied by multiple Extreme Learning Machines on
the frame level. Then the classier outputs are modeled
on the sequence level and fused together to provide
the nal decisions. The framework was evaluated on
the ChaLearn 2013 dataset where accuracy of 85% was
achieved on the validation set. In 12, multiple features
were extracted from different information modalities,
including depth image sequence; body skeleton joints,
facial landmark points, hand shapes, and facial expressions
for ASL. In particular, Depth Motion Maps with
Histogram of Oriented Gradients (DMM-HOG) is used
for feature extraction from the color and depth images;
and a histogram is used for representing Binary Facial
Expressions (BFE) features and Binary Hand Shape
Medical application based Arabic sign language recognition employing leap motion and kinect sensors 3
(BHS) features. Then, linear SVM is used for classication,
achieving an average recognition rate of 36.07%
over 27 lexical items.
In contrast to the Kinect sensor, Leap motion device
has also been used by researchers for development
of various gesture recognition systems. The leap motion
sensor was employed in 13 for Australian Sign
Language (AuSL) symbols recognition. The sensor accurately
tracks hands and fingers movements and the
Articial Neural Networks (ANN) was used for recognition.
However, the system failed to recognize gestures
when the hand movements position obstructs the
sensors ability to view. A Leap motion based gesture
recognition system for Arabic Sign Language (ArSL)
is proposed in 14. The fingers position and the distances
between the fingers in each frame are the features
that have been fed directly in to a Multi-Layer Perceptron
Neural Network (MLP) for recognition. The system
achieves recognition rate of 88% on 50 dynamic sign
gesture. In 7, The Hidden Conditional Neural Field
(HCNF) classier was employed to recognize dynamic
hand gestures. Two datasets were recorded using LMC
namely LeapMotion-Gesture3Ddataset and HandicraftGesture.
Two features were used namely single-finger
features (fingertip distances, angles and elevations) and
double-finger feature (Adjacent ngertip-distances and
Adjacent ngertip-angles). The recognition rate is 89.5%
for the LeapMotion-Gesture3D dataset which consists
of 12 gestures and 95.0% for the Handicraft-Gesture
dataset which consists of 10 gestures. Leap motion sensor
has also been employed in 15 for American static
sign language recognition of 26 letters from the English
alphabet. A new features called average distance, average
spread, and average tri-spread were derived from
the sensory data. A recognition rate of 79.83% is achieved
using support vector machine classifier. Recently, some
researchers have proposed hybrid systems for developing
an ecient SLR system by combining the input data
from more than one sensor.
A joint approach for gesture recognition has been
proposed in 6 by combining Leap motion and Kinect
sensors. The authors have computed ngertip angle, elevation,
and distance based features from leap motion
data, where the Kinect feature is based on the curvature
and the correlation of the hand region. The multi class
SVM was used for recognition achieving accuracy of
91.28% on 10 ASL static gestures. Another one is introduced
in16 where the leap motion data is used beside
the kinect data to aid feature extraction. The kinect
feature relies on the convex hull, the hand contour of
the hand shape and the distances of the hand samples
from the centroid. The proposed features are fed to two
different classifiers, one based on multi-class SVMs and
one based on Random Forests achieving accuracy of
96.5% on 10 signs. In 4, the author combined the features
extracted using Kinect and Leap motion sensors
to describe gestures representing various words of Indian
Sign Language (ISL). These features depend on
fingertips and palm positions and directions. A dataset
consistsing of 50 dynamic signs was recorded where 28
signs were performed by single hand and 22 of the signs
were performed using both hands. The Markov Model
(HMM) and Bidirectional Long Short-Term Memory
Neural Network (BLSTM-NN) are combined to boost
up the accuracy, where accuracies of 97.85% and 94.55%
have been recorded for single hand and double handed,
respectively. CHMM was used in 5 to fuse the features
extracted from both sensors to improve the performance
achieving recognition rate of 90.80% over 25
single handed dynamic signs.
3 Data set acquisition
There is a great challenge to build a recognition system
that is able to recognize the whole ArSL dictionary
with high performance, so, it is convenient to collect the
datasets depending on a specific application. One of the
most important and necessary application is the medical
application. Therefore, we have collected our sign
language dataset depending on the words that is being
used in medical application. The dictionary in 17
classified the Arabic sign language according to the application
they are used in. We recorded 33 dynamic sign
words from the medical application words in this dictionary.
These signs are depicted in Fig 1 where the movement
directions are represented by arrows. Some movements
are represented by the yellow arrows and others
are complex and represented by yellow and red arrows
where the yellow arrows represent the first movement
and the red arrows represent the second movement. The
movements in the words cancer and rays are to front
and it is difficult to draw. Some gestures are explained
by more than one frame. 18 words of these words use
single hand and 15 words use the two hands. All sign
gestures are dynamic signs performed by 10 different
signers three of them belong to a school for hearing impaired
and seven of them are normal signers. Each sign
word is repeated three times by the signer. The setup
of the capturing is shown in Fig 2(a). The signer is sitting
on a chair and the Leap motion is placed below
the hand of the signer to capture the horizontal hand
information. All gestures are performed above the leap
motion sensor and the desk to ease the hand shape segmentation
process. The Kinect sensor is placed in front
of the signer to properly acquire the depth and skeleton
information. This position is also useful for acquiring
4 Marwa Elpeltag


I'm Johnny!

Would you like to get a custom essay? How about receiving a customized one?

Check it out