Word-Level Sign Language Recognition from Videos




Hosain, Al Amin

Journal Title

Journal ISSN

Volume Title



Sign language is the primary form of communication among Deaf and Hard of Hearing(DHH) individuals. Due to the absence of speaking capability, voice-controlled assistants such as Apple Siri or Amazon Alexa are not readily available to a DHH individual. An automated sign language recognizer can work as an interface between a DHH individual and the voice-controlled digital devices. Recognizing world-level sign gestures is the first step of an automated sign language recognition system. These gestures are characterized by fast, highly articulate motion of the upper body, including arm movements with complex hand shapes. The primary challenge of a world-level sign language recognizer (WLSLR) is to capture the hand shapes and their motion components. Additional challenges arise due to the resolution of the available video, differences in the gesture speed, and large variations in the gesture performing style across individual subjects. In this dissertation, we study different methods with the goal of improving video-based WLSLR systems. Towards this goal, we introduced a multi-modal American Sign Language (ASL) dataset, GMU-ASL51. This publicly available dataset features multiple modalities and 13; 107 word-level ASL sign videos. We implemented machine learning methods using only video inputand a fusion of videos and body pose data. Usually, word-level sign videos have a varying number of frames, roughly ranging from 10 to 200, based on the source and type of the sign videos. To utilize the frame-wise representation of hand shapes, we implemented Recurrent Neural Network (RNN) models using per-frame hand-shape features extracted from a pre-trained Convolutional Neural Network (CNN). To further improve hand-shape representation, we proposed a hand-shape annotation method. This method can quickly annotate hand-shape images and simultaneously train a CNN model. We later used this model as a hand-shape feature extractor for the downstream sign recognition task. Most of the information in sign language is conveyed using hand-arm movements. To prioritize the hand-arm related features, we proposed a pose guided feature localizing method from 3D feature maps of a 3D CNN model. This method can track the location of hands in a feature map space and extract representative features for hands in a sign video. To further leverage the idea of hand representation, we developed a graph-based hand modeling. This formulation sees the hands as graphs and attempts to model the finger structures using Graph Convolutional Network (GCN). When added with existing models, in an ensemble manner, the graph modeling yielded extra recognition performances. In an attempt to build an interface between DHH individuals and voice assistants, this dissertation presents different building blocks of a video-based WLSLR. These range from developing a multi-modal dataset to improving state-of-the-art video classification models. We demonstrate the roles of hand shapes and pose data in several contexts of sign video modeling. We anticipate that the data and the insights emerged from this work will help to advance the research towards an automated sign language interpreter.



Computer science, CNN, Human Pose Data, Neural Networks, RNN, Sign Language, Video Modling