Teach Alexa to understand sign language, do not speak can also control the voice assistant

Last Update:2018-10-22 Source: Internet

Author: User

Tags keras model

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Alexa, Siri, small degree ... A variety of voice assistants are dazzling, but these devices are targeted to the ability of the user, ignoring the ability to listen to, say, the obstacles to the crowd. The writer is acutely aware of the bug and trained Amazon's voice assistant Alex to learn to recognize American Sign language. Social media has been a hot hit after the project was released. This blog post will cover the underlying technologies of the project and how to build the system using Tensorflow.js.

One night, a few months ago, when I was lying in bed, a thought flashed through my mind, "if speech is the future of computing interfaces, what should people who can't hear or see?" "I don't know exactly what triggered this idea. I can listen to myself, I can say, there is no deaf-mute around, and I do not have a voice assistant. Maybe it's because countless voice-assistant articles suddenly appear, perhaps because companies are scrambling to get you to choose their voice-assist products, perhaps because they're often seen on friends ' desks. Since the problem cannot be lost from memory, I know I need to think it over.

The idea mentioned above eventually became the primer for this project. This is a proof of concept, in which I use Amazon Echo to respond to sign language-more precisely American Sign Language (ASL), because there are many kinds of sign language, just like spoken language.

While I can simply publish the code, I choose to publish a video of the demo system because I think many machine learning projects lack visual elements, which makes it difficult to use and understand them. In addition, I hope that this approach will allow people to focus not on the technical elements of the project, but on the human element, which does not value the underlying technology of the project, but attaches importance to the ability of this technology to provide us with human beings.

Now that the video has been published, this blog post will cover the underlying technologies of the project and how to build the system using Tensorflow.js (http://js.tensorflow.org/). You can also use the Live demo demo. I put them together so that you can train it with your own word-gesture/gesture set. You can choose whether or not to place an Echo in your neighborhood to respond to your request.

Early research

It was a long time ago that I knew what the big modules of this experiment would be. I know I need to:

1. Neural Network for interpreting gestures (converting gesture video to text)

2. Text to Speech system, to the Alexa to say the understanding of the gesture

3. Voice-to-text system, transcription of Alexa response for users

4. The device that runs the system (laptop/tablet) and the Echo that interacts with it

5. Connect all of this to the interface

I seem to have spent a lot of time deciding which neural network architecture is best suited for this task. I have made the following choices:

1) Since gestures have visual and time two aspects, my intuition is to combine CNN with RNN, where the output of the last convolutional layer (before classification) is fed into RNN as a sequence. I later discovered that the technical term for this network was the long-term cyclic convolution network (LRCN).

2) Use a 3D convolutional network. Where the convolution will be applied in three dimensions, the first two are images, and the third dimension is time. However, these networks require a lot of memory, but I want to implement this training on a 7-year MacBook Pro.

3) do not train each frame on CNN in the video stream, but only on the optical flow characterization. The optical flow characterization will represent the pattern of the apparent motion (apparent motion) between two successive frames. My idea is that it will encode the action to form a more general sign language model.

4) using dual stream CNN, where the space flow will be single frame (RGB), the time flow will make the optical flow characterization.

In a further study, I found papers that used at least several of the video activity recognition methods described above (most commonly used in UFC101 datasets). However, I soon realized that I could not do it. Not only do I have limited computational power, but I also have limited ability to learn and implement these papers from scratch. After months of intermittent research, and because of the fact that the project is often shelved for other reasons, I have no expected output to show.

The last method I used was completely different.

Using Tensorflow.js

The Tensorflow.js (https://js.tensorflow.org/) team has launched an interesting browser-based experiment. It allows people to familiarize themselves with the concept of machine learning and encourages people to use these projects as part of their own projects. For those unfamiliar with it, Tensorflow.js is an open source library that allows you to use Javascript to define, train, and run machine learning models directly in your browser. You can start with one of the following two demos: Pacman Webcam Controller and teachable machine.

Although they all get the input image from the webcam and predict it based on the training data, each operation is different internally:

1) Pacman Webcam-it uses convolutional neural Networks (CNN), which is passed through a series of convolution layers and pooling layers after an image input from a webcam. Use it to extract the main features of the image and to predict its label based on already trained examples. Because the training process is expensive, it uses a pre-trained model called Mobilenet for Migration learning. The model is trained on 1000 ImageNet classes, but is optimized to run in browsers and mobile applications.

2) Teachable machine-Use the KNN (k-nearest-neighbors) method. It is very simple and technically does not perform any "learning" at all. It takes an input image (from a webcam) and categorizes it by using a similarity function or distance measurement method to find the label closest to the training sample of the input image. However, before feeding into KNN, the image is first passed through a small neural network called Squeezenet. Then, the output of the second-to-last layer of the network is fed into KNN so that it can train its own classes. The advantage of this approach is that we can use the advanced abstraction that Squeezenet has learned to feed into KNN instead of providing the original pixel value directly from the webcam to the KNN, thus enabling a better classifier to be trained.

Now, you may wonder how the temporal nature of these gestures should be handled. Both systems take the input image frame by frames and make predictions without regard to the previous frame. Isn't it necessary to really understand gestures? When I learned ASL from online resources for this project, I found that the posture and position of the hands that started and ended between different gestures were very varied when I represented a gesture. While the process of change in the middle of a gesture is necessary for communication between humans, it is sufficient for the machine to use only the beginning and end of gestures. Therefore, contrary to popular language, I no longer focus on the process of gesture change, but only the beginning and the end.

The decision to use Tensorflow.js is proven to be useful in other ways as well:

1. I can use these demo prototypes without writing any code. By simply running the original example in the browser, I started the early prototyping, trained the gestures I intended to use, and looked at how the system would perform-even if the output meant that "Pac Man" moved on the screen.

2. I can run the model directly in the browser using Tensorflow.js. This model is large in terms of portability, speed of development, and the ability to interact with web interfaces. In addition, these models can be run directly in the browser without having to upload the data to the server.

3. Because it will run in the browser, I can easily connect it to the voice-to-text and text-to-speech APIs, which is what modern browser support and I need to use.

4. It accelerates the process of testing, training and commissioning, which is often a challenge in machine learning.

5. Since I do not have a sign language dataset, the training sample is basically my repeated execution of these gestures, so it is convenient to use a webcam to collect training data.

After thoroughly testing and discovering that the two systems performed fairly well in my tests, I decided to use teachable machine as my base system because:

1. On a smaller data set, KNN can actually run faster/better than CNN. But when you train with big data sets, they consume a lot of memory and degrade performance, but I know my dataset is small, so that's not a problem.

2. Since KNN does not really learn from the examples, they have a poor generalization capability. Therefore, the predictive power of a model that is trained on a dataset that is created entirely by one person will not be well migrated to another person's dataset. This is not a problem for me either, because both the training set and the test set are my own repeated gestures.

3. The team has open source a good Project simplification template (https://github.com/googlecreativelab/teachable-machine-boilerplate), which can serve as a useful starting point.

Working principle

The following is a high-level view of the system's workflow:

1. After entering the website in the browser, the first step is to provide a training sample. This means that you use the camera to capture every gesture you perform repeatedly. This is a relatively fast method, because holding down a specific capture button captures the frame continuously until you release the button and mark the captured image with the appropriate label. The system I trained consists of 14 words, which can have a variety of combinations, which allows me to create various requests for Alexa.

2. After the training is completed, enter the predictive mode. It now reads the image through the webcam and runs through the classifier, then finds its closest frame based on the training set and tags provided in the previous step.

3. If a forecast threshold is exceeded, it appends the label to the left side of the screen.

4. Then, I use the web-side API for speech synthesis to say the detected tags.

5. If the word "Alexa" is spoken, it will wake the nearby Echo and start listening for instructions. It is also worth noting that-I created an arbitrary sign (raise the right fist) to denote the word Alexa, because this action does not correspond to the words already in the ASL, and repeated spelling a-l-e-x-a the user experience is not good.

6. Once the entire gesture phrase is complete, I use the network speech API again to transcribe Echo's response, which is used to reply to a query without knowing it came from another machine. The transcription response is displayed on the right side of the screen for the user to read.

7. Enter the wake-up keyword again to clear the screen and start the process of repeating the query.

Although the system works relatively well, it does require some technicians to help it achieve the desired results and improve accuracy, such as:

1. Make sure that no symbols are detected unless the wake-up word Alexa has been said.

2. Add a complete training set of all classes, I classify the idle state as "other" (empty background, I stand idly with my arm and so on). This prevents false detection of words.

3. Set a high threshold before accepting the output to reduce the prediction error.

4. Reduce the forecast rate. Do not predict at the maximum frame rate, and controlling the amount of predictions per second can help reduce false predictions.

5. Make sure that words that have been detected in the phrase are no longer used for predictions.

6. Since sign language usually ignores gesture descriptions and relies on context to convey the same content, I use certain words to train the model, which includes appropriate instructions or prepositions, such as weather, lists, and so on.

Another challenge is how to accurately predict when a user will complete a gesture instruction. This is essential for accurate transcription. If transcription is triggered prematurely (before the user completes the gesture), the system begins to transcribe it into the corresponding speech. On the other hand, a late trigger may cause it to miss the Alexa partial response. To overcome this problem, I have implemented two separate technologies, each with its advantages and disadvantages:

1. The first option is to add some words to the training stage and mark them as end words. The ending word is the word that appears at the end of the user gesture phrase. For example, what if the query directive is "alexa,what" s the weather? (What's the weather like today?) ), the transcription can be triggered correctly when the word is detected by marking the "weather" as a terminal word. While effective, this means that the user must mark the word as a terminal during training, and assume that the word appears only at the end of the query instruction. This means changing your query order to "alexa,what" s The weather in New York? What's the weather like in New York? ) "will cause problems. This method is used in the demo.

2. The second option is to have the user specify an end word as the boot way to let the system know that they have completed the query. The system can trigger transcription when the end word is identified. Therefore, the user will follow the wakeword> query> Stopword. This approach exists where the user completely forgets the risk of giving the end word, causing the transcription to not trigger at all. I have implemented this method in a standalone GitHub branch (Https://github.com/shekit/alexa-sign-language-translator/tree/stopword), you can use the wake-up word Alexa as the end word for your query, ie "alexa,what" s The weather in New York (Alexa)? " (How is the weather in New York (Alexa))? 」。

Of course, if there is a way to accurately differentiate between voice from an internal source (laptop) and voice from an external source (nearby Echo), the whole problem can be resolved, but this is another challenge.

To further explore, I think there are many other ways to solve this problem, which may be a good starting point for creating a more robust and generic model for your own projects:

1. Tensorflow.js also released the Posenet, which may be an interesting way to use it. From the machine's point of view, tracking the position of the wrist, elbow, and shoulder in the picture should be enough to predict with most words. When you spell something, the position of the finger is often important.

2. Using a CNN-based approach, such as the "Pac Man" example, can improve accuracy and make the model more resistant to translational invariance. It also helps to better generalize to different people. It can also include the ability to save a model or load a pre-trained Keras model that has been archived. This eliminates the need to retrain your system every time you restart your browser.

3. A combination of CNN + RNN or posenet + RNN that considers time characteristics may improve accuracy.

4. Use the newer reusable KNN classifier (https://github.com/tensorflow/tfjs-models/tree/master/knn-classifier) contained in Tensorflow.js.

Since the first release of the project, it has been widely shared on social media, has been touted by the media, and even Amazon has implemented an accessibility function (Tap to Alexa) on Echo Show for those who may be unable to speak. Although I have no evidence that my project has affected them to achieve this function (time is very coincidental), it would be nice if it did. I hope that future Amazon Show or other voice assistants based on the camera and screen will be able to build this feature directly. For me, this could be the final use case for this prototype and be able to open these devices to millions of new people.

Reducing the complexity of the network while building a simple architecture to create my prototype architecture definitely helps to implement this project quickly. My goal is not to solve the whole sign Language conversion text problem. Instead, it talks about inclusive design, presents machine learning in an approachable way, and inspires people to explore this problem space-and I hope this project will do just that.

Original address: https://medium.com/tensorflow/ getting-alexa-to-respond-to-sign-language-using-your-webcam-and-tensorflow-js-735ccc1e6d3f?linkid=55302800

Teach Alexa to understand sign language, do not speak can also control the voice assistant

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More