INTRO: With most of us holed up in our homes and coordinating work over video calls due to the COVID-19 pandemic, you might have become well-acquainted with the variety of video conferencing software. A great feature of these video calling apps is automatic switching between video feeds of the person talking in real-time. This, however, doesn’t work with sign language users and they could feel left out of the conversation.
Google researchers have decided to fix this accessibility issue by building a real-time sign language detection engine. It can detect when a person in a video call is trying to communicate using sign language and bring the spotlight on them. The engine will be able to tell when a person starts signing and make them the active speaker.
This model was presented by Google researchers at ECCV 2020. The research paper titled Real-Time Sign Language Detection using Human Pose Estimation talks about how a ‘plug and play’ detection engine was created for video conferencing apps. The efficiency and latency of the video feed were a crucial aspect and the new model can handle both very well. I mean, what good will a delayed and choppy video feed do?
Here’s a quick look at what the sign language engine sees in real-time:
Now, if you are wondering how this sign language detection engine works then Google has explained it all in detail. First, the video passes through PoseNet, which estimates the key points of the body such as eyes, nose, shoulders, and more. It helps the engine create a stick figure of the person and then compare its movements to a model trained with the German Sign Language corpus.
This is how the researchers detect that the person has started or stopped signing. But, how are they assigned an active speaker role when there is essentially no audio? That was one of the biggest hurdles and Google overcame it by building a web demo that transmits a 20kHz high-frequency audio signal to the video conferencing app you connect with it. This will fool the video conferencing app into thinking that the person using sign language is speaking and thus, make them an active speaker.
Google researchers have already managed to achieve 80% accuracy in predicting when a person starts signing. It can easily be optimized to reach over 90% accuracy, which is just amazing. This sign detection engine is just a demo (and a research paper) for now but it won’t be long until we see one of the popular video conferencing apps, be it Meet or Zoom, adopt this to make life easier for mute people.