In the past “Double Eleven” shopping carnival, the sales volume of “Tmall Genie” smart speakers broke through 1 million units in just 9 hours. The price war that Alibaba set off is enough to see that it is smart. The importance of the speaker market. On November 14th, GASA University Sixiang Class II, Professor Wang Gang, the chief scientist of Alibaba A.I. Labsoratory, explained the product of “Tmall Genie” and Alibaba’s breakthrough in human-computer interaction for the participants. At the same time, it also had in-depth exchanges with the students on the issues of commercial realization, convergence with the Alibaba ecosystem, user experience, large-scale commercial interaction of voice, competition and cooperation.
Voice into artificial intelligence important interaction
"Tmall Genie" is a smart speaker launched by the Alibaba Group A.I. Labsoratory in August this year. It is also the first hardware product we are exploring for the next generation of human-computer interaction. It has the voice series technology we developed, which can recognize the user's voice and help people to convert to text, as well as the technology of natural language understanding, which can give the user's intention to the corresponding structured information, and then provide the corresponding information to the user. service.
"Tmall Genie" Although it is a very small product, it may be similar to our small mobile phone, but its function is very rich, because its background combines a lot of services, basically including all aspects of life. Get up in the morning for you to control lights, curtains, organize the day's arrangements, broadcast the news, help you take a taxi when you go out, book a meeting room, order takeaways, check flight information, quick inquiries in the hotel, go to your fitness instructor, etc. Help us greatly simplify the acquisition of services and content in our lives. Of course, the most important thing is that as a speaker, it doesn't need to be banned, so we also have the most complete copyright on the Internet. We have reached a cooperation with Tencent. Basically, we can get all the genuine songs, so this is also ours. Some unique advantages. With such a product, a lot of content such as apps in our mobile phone can be copied.
The same is the interactive way, why is voice interaction more advantageous than mobile APP? We can make a comparison: listen to the song with the mobile app. Our steps are to open the phone to unlock, find the app, then use the text to enter the title of the song and then click to play, this process may take a minute or more; The Tmall Genie may take only five seconds, and its speed and efficiency gains are still very obvious.
Semantic Understanding Platform (AliGenie Open Platform)
This is the architectural diagram of our voice interactive system. After the voice signal from the device is received by our gateway, it begins a series of understanding and recognition. From the main link, we first use voice recognition to turn the sound into text, and then we have a platform for semantic understanding. This platform will understand the user's sentence based on the developer's configuration, the domain design, corpus and dictionary data. If the intent of the third-party service needs to be invoked, the accessed third-party service is invoked through a mechanism of the service proxy. After that, according to the configuration of each skill, the corresponding reply statement is encapsulated, and then broadcasted to the user through the service of voice synthesis.
It is worth mentioning that the “Tmall Genie” also supports the technique of voiceprint recognition, which can identify different users according to the sound, thus ensuring its security and privacy. This technology plays an important role in both payment and user identification. At the same time, we also provide a complete set of end-to-cloud solutions, so our natural language interaction system will not be limited to the "Tmall Genie" in the future, but will be integrated into various terminal devices, including cars and headphones. and many more.
Solving the problem of natural semantic understanding with deep learning
Natural language human-computer interaction is an interactive medium. We need developers from all walks of life to voice their original services and make them better accessible to users. Every interaction revolution will bring about a major upgrade in the service industry, so today I will also briefly introduce our AliGenie open platform to help developers from all walks of life come in and do their own dialogue robots. Our original intention to understand natural language is to allow people from all walks of life to voice their services. Therefore, in addition to the terminal products of the "Tmall Genie", we hope to empower the technology to third-party partners, including: voice wake-up, speech recognition, voiceprint recognition, semantic understanding, speech synthesis, and so on.
Here I zoom in on the system structure of natural language semantic understanding. We disassemble natural language understanding into two tasks, intent recognition and slot filling. In the semantic understanding part, the definition of intent and slot comes from the developer's configuration on the AliGenie open platform. Here we use a knowledge map and a user's portrait that have accumulated a long time to help us understand the semantics of the user. The result of this understanding is passed to the conversation engine, where the developer-defined conversation strategy and the calling strategy of the third-party service are preloaded. Here, sometimes you will take the initiative to ask questions, such as the instructions for turning off the lights, will take the initiative to ask, "Do you turn off the lights in the bedroom now?", will also call third-party services to generate the results of the dialogue. Eventually a natural language response, reply to the user.
In the entire voice interaction system, I think the most difficult part is natural language understanding, because language is the type of data created by people. The machine is doing something that one is good at, and the expectation of the ability of the machine to understand the machine is very high. This is why many so-called artificial intelligence systems, many users will think of artificial mental retardation.
Natural language understanding is because people have great diversity and ambiguity when they are using it. For example, if we want to ask the weather, many users may ask, "Is it necessary to wash the car tomorrow?" Some very strange questions like pants. However, in order to give users a better experience, we hope to be able to accurately identify them, so it is very challenging. In addition to the diversity just mentioned, there is ambiguity. For example, if you want to ask Apple how much money, when different users ask this question, his intentions are different. Some technology enthusiasts may ask Electronic equipment, but some people may ask about fruit. So in order to solve such ambiguity, we also copy the context information into it, in order to make better decisions and judgments.
Define natural language understanding and challenges
So here, we define a scene to complete the natural language understanding of a task, which is the core issue of today's course, that is, how to let the machine help us accomplish a certain task. The first step is to define what is understood and how it is understood. Here we use the semantic expression of the framework, using the skills, intentions and slot values to structure the meaning of a sentence. In this way, we abstract the natural language understanding into a machine learning proposition in this scenario, that is, from a natural language to these three structured expressions.
After the problem is clearly defined, let's see how to solve the problem. There are many interesting challenges in it. For example, natural language sentences and words are many, but in general, the training corpus is very limited, and the user's expectations are very high. This poses a very big challenge to the model and the open platform. Another big challenge is that a large number of developers are not yet NLP and machine learning experts. How to enable them to collaborate with NLP experts without understanding natural language technology and machine learning techniques, and to configure a skill understanding. Ok, it is also a very challenging thing.
Here I want to make a simple summary, because one of the characteristics of our natural language understanding technology is the extensive use of deep learning methods. Deep learning now has a dominant position in all walks of life because it can discover from the data which features are the most essential and important information. Deep learning technology has greatly promoted natural language understanding in the last 2-3 years. Perhaps for the application of deep learning on NLP, everyone has been on the issue of translation and other seq2seq. So today I want to introduce to you from the perspective of natural language understanding, this part of the more important work.
We use a lot of deep learning technology in the AliGene system, the more traditional one is CNN. CNN is based on convolutional neural network, using a calculation such as convolution method to find the most effective information in the local. The other is LSTM, because CNN may focus more on local things. The natural language must have a context in which sentences are processed. Each word is not independent, so we need to have a method called LSTM that ensures context. Information can be combined. Compared with CNN in the shallow feature extraction, only the n-gram information is different. LSTM can better extract the long and short relationship in the sentence, and the attention mechanism used here now makes this correlation more direct. LSTM can better encode the relationship between text sequences.
One of our simple extensions is to deeply integrate CNN and LSTM to train an end-to-end system that not only expresses local signals, but also expresses contextual information. An optimization is included in our processing.
In addition, we have also made some adaptive neural networks. The current neural network has a shortcoming. It can process data for different needs after training, but it uses the same structure to extract information. This mode is not Optimal. Because each input sentence has its own characteristics, we extracted the adaptive neural network. The key point is that it can adaptively adjust the network structure according to the characteristics of the signal to find the most useful and important information.
The Attention-Based Intent Recognition and Slot Filling Joint Model is a joint model of intent recognition and slot filling that we currently use in the AliGene system. For the part of the intent identification, we used the above convRNN network. Then the output of this BiLSTM, which is the h of this, is used for the task of sequence labeling, that is, slot filling.
As I mentioned earlier, the classification method based on the attention mechanism is used here for sequence labeling. Because the solution space of the sequence label is very large, the CRF layer is added on top of this, and the global maximum possible sequence labeling result is calculated by the method of Viterbi decoding. Here we use the power of knowledge maps to encode possible knowledge information in sentences, such as Jay Chou, who may be a singer, an actor, or a composer in our knowledge base. Such a kind of knowledge information can help us better understand the semantics of this sentence, and at the same time, optimize the overall solution with the semantic features of neural network learning.
For example, in this example, there are keywords such as "Want to put the first" and "Song", then the neural network will think that Jay Chou is a lot more likely to be a singer. At the same time, knowledge coding is useful in some situations. For example, “a yellow chicken rice” and “a seven-scented incense” are the same in sentence, but because of the different knowledge content, we can understand different intention of. Therefore, we also found in the experiment that this part of the improvement of the effect is very obvious. Based on this model, our intent recognition and slot filling are highly accurate. We also tested the effectiveness of our approach on public datasets, achieving the best results in the industry in both text categorization and question-and-answer matching datasets.
Wonderful Q&A not to be missed
Q: Will HKUST News become a technology company?
A: Now everyone has formed a consensus. The development of AI, talents and algorithms are very important. Another important factor is data. If there is no data, stamina and development can be limited. Tmall launches this end and has an advantage in terms of data. Based on large sales, it will build its own ecosystem and form a cycle. Just now you mentioned that the University of Science and Technology has been flying. We have always had a demand for high technology. Although the amount of data can bring about improvement, it will gradually converge. In the later stage, it will not have as strong data increment as in the early stage. The curve is probably like this, so I don't think the technology will be weakened in the later period. It can only be said that we may need more and more revolutionary technologies. We also hope that companies like HKUST will be able to make greater innovations in the frontiers of technology and lead us.
Q: I have two questions. One is about the commercialization of the “Tmall Genie”. Are you profitable? The other one, when we went to the United States to study in the eleventh, the professor also said that about 50% of the smart speakers are actually used at home as speakers. The smart unsmart is actually not so big. For the "Tmall Genie" Speaking, what proportion of users is using it as a smart device to interact with it every day, how many people use it as a speaker?
A: The first question is actually beyond my scope, because I am a person who does algorithms and research, so in terms of commercial thinking, such as how to make money in the future, it is actually not my biggest concern. But I feel that if it is a product and it is an artificial intelligence product, it seems that in the early days, we may not have to think too much about it. Let it develop freely. Don't add too much commercialization to it. In the future, it may Drive us a lot of surprises, such as Alibaba's cloud business. We will try our best to make our products well, to make the technology work well, and to reach more users. This is our biggest KPI now, and we hope to achieve it.
The second question you just talked about, first of all, I think that even if it is listening to music, it actually reflects our intelligence. Just the example I just gave, when we listen to music, we use mobile APP, but it is also It takes a long time to interact with the mobile phone, but it takes only five seconds to use the smart speaker. This is a smart performance in itself. The machine interaction is equivalent to a new step, making the machine interaction more natural and more Convenience, when the user adapts to this kind of interaction, it can extend it to other aspects. For example, when cooking in the kitchen, I want to buy a seasoning. I used to wash my hands and open the mobile app. Now it's okay to just shout a scorpion, so I think it can be extended. When the user gets used to it, we cultivate the user in such a way that the user adapts to such a way, he will extend it to Go inside other areas and do more operations.
From our results, listening to music is the biggest demand of users, because it is a speaker after all, but users also use a lot of services, including setting alarm clocks, listening to news, asking about schedules and even including take-outs. Users are actually more about exploring the services we offer. There are still many situations where users may not know that we are providing services, so we are also pushing them from time to time to let them know that we have some new features coming online this week. Users still feel very excited and feel like an unknown. The world, they have new surprises every week.
Q: We are also looking at this thing. Our view is this: It is actually a smart butler in the long run. For example, my wife wants to have a birthday today, it will automatically remind you, "Don't forget, this thing. It’s very big, and it will be based on your wife’s habits and what she already has. What she lacks will be automatically recommended to you, so that it’s completely connected to Alibaba’s ecosystem, so this is We understand the long-term imagination. In the near term, it is a kind of speaker. At the beginning, it was some appetizers before the big meal. It just said to verify that we have a small closed loop, which is not necessarily directly related to the business model, but said To the third question, when it comes to technology, will Keda Xunfei become a technology company, and I very much agree with Dr. Wang Gang that there is too much technology to solve. For example, we just said that CNN+LSTM is very good, but a big problem, just now we said the example, how to let the computer remember my wife’s birthday, how many days in advance to remind me to say, “Wife is going to have a birthday, before I bought a perfume. I don’t want to buy perfume this time. This time I have a new lipstick. Do you want to buy one? This kind of thing is actually to be linked with some common sense, is it CNN? Plus LSTM can solve this problem, I can't see it.
A: Yes, no.
Q (continued to ask a question): So there are too many problems to deal with, so if this thing is said, it will have too much relationship between commercial realization and application of the scene, but there is one thing in business realization. Curious, that is, some simple things can be bought, not limited to listening to songs, too complicated may not be able to buy, you said that I can buy a washing powder, this can be bought immediately, this time immediately presents a problem, you and Alibaba Wangwang How is it connected? I would like to know about this aspect?
A: A large number of new modes of voice interaction, which can make the efficiency of interactions improve in many cases, and also make it more friendly to some specific people, such as the elderly, is the experience of this interaction. But the services that it essentially provides may still be the same. The technique you just talked about is particularly good. In the future, we must consider more world knowledge and experience knowledge in terms of natural semantic understanding. Therefore, we must build a knowledge map to make language understanding especially ambiguous. Because when many people talk, when we communicate with people, he often omits many words when he talks, because we assume that the other party knows the corresponding knowledge, so we also hope that the speakers can have similar abilities in natural language. Understanding this link will increase its efficiency.
Q: Can you tell us about the user experience of the current product? And the future direction of improvement?
A: We have a lot of metrics for this system: First, the achievement rate, because we are now defined as intelligent assistants to help users complete tasks, such as I want to listen to songs, I want to play the corresponding music. So we need to see the results returned by this speaker, is it helping the user to achieve what they want, this is the achievement rate. The other is performance, and how long it takes to return it, which is also an important indicator. Of course, there are other indicators, such as listening to music, how long the user listened to music, whether he was cut off in the middle, and we are also concerned. Judging from the current business situation, our achievement rate has steadily increased. Because after the first version came out, after testing and evaluation, the achievement rate is not particularly ideal. So we have had several iterations of technology in the back. Now, our achievement rate is good. However, we are still optimizing our performance. The chain of this product is still very long. We think there is room for speed optimization, and we hope to complete the user's needs in 1 second.
Q: Now there are two in the field of artificial intelligence, one is the recognition of images and the corresponding artificial intelligence, one is like voice, textual, how do you think about these two different latitudes and the relationship between them? In the future, when they finally evolved forward, when they gradually realized the function of perception and understanding, will it finally develop in the smart end or in two directions that are completely parallel? If it can be brought together, what is the point of collection, and what is its core competence?
A: We, Ma (Ma Yun), once said that human intelligence is different from machine intelligence. I think that from the application scene, image artificial intelligence and voice text artificial intelligence can be parallel. For example, voice can be used in smart speakers, and vision can be used in automatic driving, face recognition, security and so on. From the perspective of the products of the application, I think they are more of a parallel relationship. For people, our brain actually handles different signals and has different regions, but there are always ways to communicate between different regions, such as deep neural networks, as a method that requires less knowledge. It can automatically mine out which features are important features from the data. So this method can also be applied to both images and speech, because both of them are a problem of signal processing signal analysis, so their methods can be more general, but the input and output are different.
Q: According to my own understanding, I think that the previous mouse operation of the computer is a way of operation. Until now, the touch screen of the mobile phone screen is a way. I hope to use voice to control the device as a new way of interaction in the future. I personally think that it takes about ten years for a finger to touch from Apple to mature. How long does it take for such a voice interaction to be truly large-scale? The second problem is that the entrance to the smart home is now being talked about. If the management of the family really becomes a reality, how long it may take to become a large-scale use?
A: The large-scale use of voice interaction I think has happened in the United States. For example, like Echo, it is about 10 million sales per year. There may be tens of millions of families behind it, so the coverage in the United States is still very large. In China, first of all, in terms of the maturity of technology, we hope that voice interaction can be more reliable and stable. I am very optimistic, I think that in a year to a year and a half, you can achieve more than 95% of the ability to meet.
Q: What do you think about the development of artificial intelligence in hardware and software? Artificial intelligence hardware, such as "Tmall Genie", is a form of sound, so how do you look at the direction of software and hardware?
A: I think they are more and more integrated behind each other. The end may be just a product form. For example, like our speakers, there is such a hardware in the user's home, but in fact artificial intelligence is in It is processed on the cloud because there are more computing resources on the cloud, so I think it is just two different forms of performance. Some artificial intelligences do some automatic makeup like Mito Xiuxiu. It may appear in the form of software. The shape of our current "Tmall Genie" is manifested in the form of hardware, but it is actually I think the intelligence ability is similar, there is not much difference.
Q: I asked a technical question about a non-technical issue. Technical issues, I see an application scenario is to make me more excited, what do you think is the advantage of comparing other platforms Alibaba on the technical level?
A: This is a very interesting question. I just said that because of technology, it may be very difficult to evaluate which technology is better than the other, and the technology is constantly evolving, like our system. Often updated, I believe other companies are constantly updating the technology, so I am very difficult to evaluate. For us, like knowledge maps, things like this can be combined with our strategic understanding, that is, to complement the shortcomings of natural language understanding. This is the direction we are very interested in. I will not be in other companies. Too clear, will they do this, including how much difference we have after they have done it, I can't comment now, because many times technology development is a must to try again and again, after the test How much can be improved, there is no way to make judgments in advance.
Q (continued to ask a question): The second non-technical question, if I am a product company, I want to be a speaker, I really want to cooperate with Alibaba, but the product you made makes me have a few concerns, this The product is very good to sell, the result is your own, you are doing very cheap, the data is all for you, then how can I cooperate with you?
A: We haven't cooperated with the speaker company so far. We are more with content providers, with performance developers, including other IoT-based equipment vendors, so my suggestion may be that everyone does what we need. In the future, don't make speakers anymore, do content, and do performance. This is very welcome.
Q: Now there are also speakers in the country. Himalayan also has its own content. Its technology is provided by the cheetah. I am also using it, but now it includes some music copyrights that you just mentioned in the "Tmall Genie". It is from Tencent. Tencent is actually developing AI. It is also doing speech recognition. What is the competition? I want to hear your thoughts.
A: Alibaba Music has its own copyright, Tencent Music also has its copyright, so we made some copyright exchanges, that is, both sides benefit, not that we took the copyright of Tencent. You just talked about this problem, because we are in the very early stage of intelligent speaker voice interaction. Maybe everyone wants to say that this thing will be bigger first. As for how the final pattern will be distributed, I think it is too early. At one point, I can't comment. Anyway, I think everyone will put the user together first. Because the user bought this speaker, he hopes to have more complete content. I think we should cooperate first to do this thing first, and then talk about it later.
Q: Your technology must be iterative. I think there is no possibility for Alibaba to cooperate with others in the future, because the speaker is an entrance, because this is indeed a scene that is wide enough and is good in the whole room. In the whole office, it can be used in many places, but some will have special living scenes, such as in the kitchen, there are many places in the kitchen that can carry this technology, it can be implanted into the appliance, for example, I It is a refrigerator, technology can quickly connect with it to go shopping, I don't necessarily listen to music, but I can implant recipes, for example, I have a mirror in the bathroom, in fact, there are many things behind the mirror. Can be connected, you can tell me what the body needs through face recognition, more health, because this kind of data is a lot of data, you will be willing to cooperate with these specialized vertical fields to export your Technology, and then more scenes, everyone to work together to develop?
A: Actually, we released an AliGenie at the Yunqi Conference. Compared with the “Tmall Genie”, it is an open platform. It includes the SR and LP that I mentioned just now, including the one integrated later. More than a hundred services, we hope to export such capabilities to cooperate with third-party manufacturers, like some smart home manufacturers, we are very happy, in fact, we also have a lot to talk about, and we also hope to be able to output different In the vertical industry, such as China Southern Airlines, they have cooperated with us to customize some speakers for their VIP rooms. Summarize our mentality is very open, the theme of cooperation is very clear, we really hope that with all the ability There is no doubt that the related partners have made this matter bigger together.
Q: I would like to ask, the scene in the United States is also very different from the scene in China. That is to say, their homes are relatively large, or the housewives spend a lot of time at home, compared to the fact that we may have more families in China. I don't cook very much. I have less time at home. If you look at the future market development, do you have a complete consideration of the difference between this product and the US, or do you predict that China's market development will be compared with the US acceptance of Echo? The same success, is that everyone considering the difference in this market?
A: When we are doing, we first believe that the habits of many users are cultivated. Just now I also said that voice interaction can really improve convenience. This kind of convenience is not known to many users. The interaction of this thing is actually with us. The carrier is related, we can also export our capabilities to other devices, such as home products, mirrors, I think voice interaction brings this revolutionary or such huge improvement, no matter Beauty is in demand, it may just be that different markets we may need different optimization methods to make it acceptable to users in a better way, so I think so.
Secondly, although everyone said that there is a difference between China and the United States, we sold one million units in nine hours. I think this is also a factual evidence that Chinese families have such needs, just saying that Everyone may be more pessimistic. Anyway, I am more optimistic. We feel that this market can be up and can achieve great success like Echo in the United States. We firmly believe this.
Q: Actually, the two parties in the market have a voice. One group thinks that the smart speaker will become the next home entrance. It is a big step. Another group thinks that this thing is a pseudo-proposition. It feels that there is no smart speaker. It doesn’t matter, you put it on it. Other things above, even it is a receiver that you set on the wall, or just inside the phone, it really has to do it to do three things, one is to control your smart home, the second is the connection service For example, if we call Drip to take delivery, you can just call it with your voice. You don't need to open the mobile app. There is also a way to provide good content, that is, I am listening to music, listening to books or watching TV. I can get content through it, who can really integrate these three things into a new APP Store, then you are the master of the next entrance, but now everyone is engaged in speakers, in fact, I am from Echo, like asking I bought a bunch of cockroaches. I bought it for so long. I really can’t use it. At least, for me, I still use a mobile phone. It’s not good, maybe In person he would probably think it was funny, you think it must be the next speaker of the form, or it is a new interactive voice platform?
A: I may believe that it will be a more general platform for voice interaction. In the end, it should be ubiquitous, and then there is no obvious sense of existence. When you want it, it will come. You don’t need it when you look at it. I don't see it, I think this may be an ideal situation. But in the early days, we also hope that there is such a case to prove its usability. The speaker is selected as a carrier. Whether in the US or China, we can still see the value of voice interaction. . In the future, whether it's Amazon Echo or our AliGenie, in fact our vision is to make voices ubiquitous, not limited to a form like a speaker, but in a more flexible and free way and in life. It is also our belief that all connected electronic devices can be integrated.
Q: I have been doing speech recognition for more than a decade. I would like to ask, because we found it in the United States at that time, because it used speech recognition very early, the earliest car is standard and can not leave the steering wheel. I can't take a mobile phone, so it is legally forced to push speech recognition in that era. In China, many cars also have Bluetooth and voice recognition, but why didn't we push it? The first thing to educate consumers about voice recognition, the second is people and machines. The question I want to ask is that the emotional Chinese people seem to be conservative and subtle and shy. He doesn't want to interact with the machine. Why do you say Echo is? The United States sells more than 10 million a year. What is its sales in China, or the future you see, such as 2019 or next year, like our current mobile phone, siri uses a lot, Huawei's mobile phone has voice Identification, but there are several people who are really interacting with the machine. Tomorrow morning, ask "What time is it, how is the weather?" Is this Chinese cultural habit and usage habits a cause of acceptance and promotion of speech recognition? Lag. Some people have told me before that men and women use different voice recognition. In the past, robots like SoftBank were similarly placed in banks, some were whole body and half-length, and the bank made a statistic, saying Most of the machine interactions are women, because they are more patient, men feel very shy when they go in, or think it is silly for the lords to talk to the machine. Have you studied this thing, which actually determines the final product of our products? Can't succeed, what do you think about this matter?
A: A very interesting question. We didn't study the personality of the audience at that time. I feel that there are not many people using voice interaction before. Maybe the main problem is that the technology is not mature enough. For example, in the car, for example, we want to use it because there are noises. The factors are not allowed. After using the user twice, the user feels frustrated. I feel that it is better to use the mobile phone. My understanding is more because the technical immaturity has caused the user to lose patience. If we look at the use of big data in recent years because of in-depth study and the use of big data, the effect of speech recognition in language understanding is getting better and better, so we have slowly reached a level that makes everyone feel very frustrated. Now we are slowly getting to this. The critical point, maybe there are eight sentences and nine sentences in ten sentences, it can be answered, and the user will accept it. I think it is a technical problem.
Q: As for the multimodal interaction, what did Alibaba do, please share it with us?
A: We made expressions like gestures, we also made gestures. We also did some research and development of such technologies, but as to whether there will be such an interactive way to come out and products come out, this is not certain.
Q: The scene in the family is very noisy. For example, if there are children and family members, will there be a misjudgment when chatting, how is it solved now?
A: This kind of situation exists. For example, in our current products, when the noise is relatively large and the noise is quite noisy, it will cause us problems, that is, we will recognize the error. We may use more construction like background noise later. The mode, using the position of the sound, etc., hopes to suppress the factors caused as much as possible. This is what we are going to do now, and we hope to solve this problem next year.
Q: I especially want to know where the concept of the first thing we started doing this product is, I have to do this thing?
A: We have been studying this interaction for a long time. We feel that it is necessary for any group and more convenient interaction. No matter nationality, gender, age, they need a more convenient. A more natural and easier way to interact, so we firmly believe that voice interaction is a worthwhile thing to do. We also believe that there are some opinions. Many things are that after we have done it, the user discovers that "I need it." If there is a lot of conflicts before the user is implemented, it is because of the belief.
Q: I went to the University of Science and Technology Flight Research about seven or eight years ago. At that time, they were doing voice recognition, and they have been doing it for a long time. It has been done for a long time. In my impression, Alibaba’s technique of making speech is definitely not without them for a long time. What is the difference in your philosophy? What is the difference in the direction of these two technologies?
A: Keda Xunfei is a very respectable high-tech company. We may not have them for a long time in the voice, but we are still very rich in Alibaba. We used to have a lot of voice recognition needs in the cloud, so We also have a team to do this, and our team is not small, the whole Alibaba is probably a lot of people have been persistently doing speech recognition, as for the direction just now, because now deep learning It is the most promising direction. I think that whether it is Keda Xunfei or us, it is all about deep learning. The deep learning is a very big box. There are many optimization points, and there may be new neural networks. Design, some new optimization methods, I think that each family will do the corresponding optimization according to their own ideas, and finally we will not be clear when we converge to a certain point or cause a big difference.
Q: Can we now get the voice that distinguishes everyone's voice as unique?
A: The number of voiceprints is relatively small. For example, about six people, we can do it, but after all, it has no such rich face information, so there may be problems when people are more than one, in the family environment. OK, it will have problems if it encounters a larger scale.
Q: I am using the small fish at home. My experience will feel that they are also iterating. I came to the innovation workshop before I came, because they are now splitting two, one for the family. One is for the enterprise level. I want to ask if there is a spin-off like this in the "Tmall Genie" side.
A: We also have some vertical applications with the industry. For example, one is a hotel. Like we have a lot of cooperation with Wanda Hotel, we put a Tmall Genie in each hotel room and use it to control the smart home inside the hotel. For example, the waiter is here to send things. We provide some special features with China Southern Airlines, such as in their VIP lounge, so we also attach great importance to some deep integration with different vertical products. We are also doing this.
Q: How is the security of a product like the Tmall Genie? For example, how can users trust the user at home, and when I don't want it to hear me, it really doesn't listen.
A: The Tmall Genie has a wake-up word, which is the "Tmall Genie". Only after calling the wake-up word, it starts to open the wheat, and the sound is processed. If there is no wake-up word, it will not be extracted. Any sound will not do this, because if you need a Tmall Genie, you have to talk to it.
Q (continued to ask a question): How does this make users believe? Because after all, when it is open, it is monitoring my voice?
A: This is a very interesting question. We can only explain it to the user, and we have corresponding terms to explain to us the protection of our data. Our internal data protection for users is very powerful, basically the user said "Tmall Genie", we are basically difficult to get the original sound data, basically impossible, because we have to do these algorithms, these data are processed data, it is impossible to access the original sound data, so we Data protection is very strict.