We can often see a variety of robots in science fiction films with human beings in the same stage, and human freedom of communication, and even smarter than humans. People would like to know how such a man-made machine is done, can we really create such a robot now?
Joking, I'm not going to be able to explain this right here, but from another point of view it's simple to communicate with the robot. This is through the voice to realize the interaction with the machine, interaction of an operation, human and robot communication, one of its core is the recognition of speech, that is, the robot must first understand the Then this article will talk about the realization of human-computer interaction through speech.
Let's look at a simpler example Windows Speech Recognition Program:
The main function of Windows Speech recognition is to command your computer with voice commands, and to achieve human-computer interaction by leaving the keyboard and mouse. Through the Sound Control window, start the program, switch between windows, the use of menus and click buttons and other functions. The Windows Speech recognition feature is limited to some common operations and instructions within Windows system systems and is performed with the monitor to perform the entire speech operation.
For example, you want to use voice to open a program through the main menu, and when you say "start", the system will provide a "display number" block partitioning function (the number is translucent so you can know which program or folder is under this number) so if you want to open the "Download" folder, you just need to say its number "10" , the program will open the "Download" folder for you. The reason for this is that: if you need to open a user's own installation of the complex program, the Voice library of Windows may not have the appropriate name of these programs, will cause recognition, or even unrecognized, the second is by displaying the number, and voice recognition number, the response to the instructions more efficient, Therefore, this kind of speech coordination Monitor's module display greatly enhances the user to use the Windows system the efficiency and the accuracy rate.
Similarly, if you are on the desktop shortcuts or files for voice operation, the system will provide a so-called "mouse network" function, the desktop for the division of the previous area and automatic numbering, with Voice + visual to improve operational efficiency and identify the accuracy rate:
In the current speech recognition program for Windows, in addition to text-to-speech input (including text and symbols), there are 16 common commands, 9 common control commands, 31 text Processing commands, 15 window commands, 5 Click anywhere on the screen, and a few other sets of keyboard commands. The Voice command of the user is to interoperate around these prepared commands, designed to make it possible to improve the efficiency of the use of the computer and to free up as many hands as possible from the mouse keyboard.
Similar to the original intention, we can also see the application of speech recognition function in the current mainstream mobile devices:
And then we're moving forward a little bit, and think about it. If we are not going to face the computer, the mobile phone, but a robot! A anthropomorphic, simulated robot, comparing the above example, you can easily find that it differs from the usual electronic devices in that it's probably not going to have a screen that we usually see, and the more efficient interaction that comes with voice commands combined with visual aids on the screen is limited to the robot. In this case you face the robot, you must think it is listening to me? Does it understand what I'm saying? What do I say it can understand? What am I going to say? Such a pair of questions will come immediately.
In fact, in our existing technology and conditions, especially for the mass business of the robot, want to do like the film that people and robots to communicate freely the situation is almost impossible. Of course we do a product, of course there will be functional positioning and market demand, and so many aspects to consider, then I am here to discuss a user to provide a variety of advice and can be simple voice logic "chat" robot, how to deal with the problem of voice interaction, here, for example, Qrobot, As far as possible without relying on computer screens, and directly to interact with people and provide a variety of advice to the robot.
Man is created by God, and the robot is created by man, and under the present knowledge and technical conditions, the robot can do nothing until he has given him a certain ability. Here are a few points to discuss what you need to do to interact with a robot:
One, provide the robot with a "brain"-the material of thought: knowledge, language library. If you have all these "raw materials" piled up in a heap, such as Qrobot, which provides a wide range of consulting and communication-operation functions, it may be panicked and unintelligible when you ask for it. (Robots cannot judge a word's proper meaning in the current context based on the relationship between the dialogue and the context) so we will first classify the robot's voice knowledge base, separate different types and professional words and phrases to improve the efficiency of the robot and the accuracy of the service. So the user needs to obtain which aspect information and the function, must first let the robot "the thought" to enter the corresponding language storehouse. For example, when you use a robot to understand the "music" aspect of the information, you need to let the robot into the music-related "library thinking" in this case it will be what you say as "music" related content or instructions.
In contrast to Apple's recent iphone 4s Siri, according to data analysis, Siri is a centralized voice analysis center, which listens to user voice and extracts keywords to understand user intent. (Of course the user has to know what the iphone can do for him) and then it may be confirmed to you, triggering the corresponding functions and services. So it ultimately provides functional advice and services from the entire iphone system, whether local apps or cloud (Web APIs) are ready to integrate the consulting information and functionality. This approach makes the product look smarter and easier to use.
Of course, in addition to the Division of Professional Library, the robot has to have a "normal" thinking, that is, the identification of professional language libraries outside the various directives and ordinary dialogue, (above the integration model) otherwise it will only be "machine" and no "person".
Second, Qrobot the transformation between the partitions and from the Voice library partition back to the "Integrated mode" in addition to voice commands, but also if not the voice of the intermediate intervention, which involves triggering the monitoring and monitoring time control problems.
From the image above you can see that Windows Speech recognition program is a floating controller switch to make the machine listen to your instructions or not. This can be done by voice to make the program to turn off the state, but in the closed state can not use voice to command it to reboot, this time need to return to the mouse operation.
The iphone's voice control function is to enter a voice mode by touching the screen to start the Siri program, in which the user can use voice to operate the phone and use the service, and if you quit the Siri phone you will not be able to understand any of your sounds.
Also, you won't let the Qrobot robot listen to you all the time, or if you need it to provide a specific message, how to get it into the appropriate voice area quickly and provide information efficiently and accurately. The robot cannot operate with one mouse, we design a response area and a corresponding gesture to the robot:
1, using the touch response area to control the robot to listen to or not to listen to instructions
2, the Voice library of the robot can be switched by the combination of touch response area and voice instruction.
or use a specific standard word statement to stimulate the robot to enter or switch speech areas to efficiently and accurately obtain information. (also divided into two types of instructions)
In addition, in different cases, the state of the robot listening to the user's instructions is also different, for example, in the "conversation" state, the robot needs continuous speech recognition, which is both based on the context need and also based on The voice technology, for example, in the function of operation or consultation acquisition and the robot to speak with no continuous speech recognition Instead of setting up a proper voice-monitoring time, the robot will not recognize it, and will not cause false hearing and misoperation.
Third, the expression of the same topic can be expressed in many ways, the same answer to any question is not a single, so the second task is to allow the robot to understand as much as possible about a variety of different ways of expression, And let the robot respond to your request or question each time in a different way or even in the mood (which makes the robot appear more intelligent and humane).
As shown above, due to the flexibility and richness of the language, it is necessary to do a lot of input and output in the configuration of The Voice library, which includes the local (robot built-in storage space) and the cloud two.
The meaning of an instruction needs to be prepared and configured in the library in a variety of language expressions, and possible key words, so that users can use a variety of expressions to accurately determine the intent of the instructions to provide accurate feedback and services.
On the other hand, when the robot understands the instructions and then passes the "brain" processing and then feeds the results back to the user, as mentioned above, the designer can not only have a preparation, how to enable users to get accurate information, but also reflect the robot's "human", but also need to do a lot of technical algorithms of the reserves and statements, The key words are ready for configuration and so on, so that each output is both appropriate and flexible.
Because of our daily access to and can use the voice of interactive products are not many, the technical level is not satisfactory. The above text only from a few basic aspects of the shallow touch of speech recognition based on the interaction and product, then the voice interaction on the user's value may be reflected in the following situations:
1. Users have visual impairments and defects
2. User's limbs are busy
3. User's eyes are occupied by other things
4. When a flexible response is required
5. In some occasions, the use of keyboard, mouse and other input forms are not convenient
However, there are some deficiencies in the form of voice interaction, such as the interaction with the fingers, the voice interaction increases the user's cognitive burden, the speech interaction is easily disturbed by the external noise, and the change of speech recognition, such as user and environment, will become unstable and so on. I'm just borrowing some of the voice-interaction issues involved in the Qrobot project, combing and discussing in concise, straightforward language, and welcome students who are interested in this field to guide and discuss.
(This article is from the Tencent CDC Blog, reprint, please indicate the source)