Cloud Acoustic Liang: intelligent voice Cloud for mobile internet

Last Update:2015-03-19 Source: Internet

Author: User

Keywords We can this through

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first China cloud Computing conference was opened in June 2013 5-7th at the Beijing National Convention Center. The Conference, with an international perspective, insights into global cloud computing trends, and from the application, explore cloud computing and large data, cloud computing and mobile Internet, cloud security and cloud computing industry applications, such as the focus of the topic. In particular, the Conference has set up a cloud computing services display area, exchange the latest research results of international cloud computing, showcase domestic cloud computing pilot City development achievements, share cloud computing development experience, promote global cloud computing innovation cooperation.

Beijing Cloud founder, CEO Liang

The following is a transcript of the speech:

Liang: Thank you. I am very pleased to have the opportunity to share with you at the cloud Computing Conference a specific application case of cloud computing, that is, the way the voice cloud is used, and how our cloud-aware platform is shared with the vast majority of developers. My report is mainly divided into several aspects-the background of mobile internet explosion research voice platform, speech recognition technology breakthrough, cloud acoustic open voice cloud platform, Internet application case, application Development Guide.

The outbreak of mobile internet has three main characteristics:

First, bandwidth increase, cost reduction. The rapid development of mobile communication technology, from the previous analog era to the later 2G, 3G, or even 4G, Wi-Fi, so that our broadband widening, the cost of continuous reduction, so that the terminal and cloud platform communication quality is very high, while the cost is getting lower.

Second, intelligent mobile terminals. The machine that surfed the Internet ten years ago is a PC, it is inconvenient to carry. So far there are a lot of smart devices, the simplest is the smart phone. Now there are some television equipment, car equipment, wear equipment, such as Google Glasses, and some toys, but also through language communication.

Third, the cloud computing platform and virtualization Technology drive productivity development. Cloud computing platforms, including virtualization technologies, and the development of CPUs and GPU, make the platform more productive.

With these three conditions, we can use a very small mobile terminal, through the means of mobile communication with the powerful cloud computing platform to establish a relationship, get very good online interactive services. This is the hardware development trend of mobile Internet.

Under such conditions, we also ushered in a greater outbreak. First is the mobile terminal, in 2013 forecast can achieve 390 million of shipments. There are also a large number of users, the end of 2012 mobile internet users have more than 400 million. In mobile voice search, Baidu last year 10% of the search from voice search, Google's data is more than 25%. Mobile products are more interactive, and now hiring a good product manager is no less than the architect's salary.

This picture is the number of Internet users since 2005 to 2012, and the penetration rate. Only 100 million of Internet users in 2005 have now grown to nearly 600 million, up nearly 6 times times. Internet penetration has also increased from 8.5% to 42.1%. An important proportion is the Internet through the mobile phone, from 07 50 million to 2012 breakthrough 400 million, in the past 4 people inside only 1 people through the mobile phone internet, now 4 people have 3 people through the mobile phone online.

The world's mainstream speech recognition systems are based on the following 5 frameworks:

1. To transform the captured sound signal into a feature sequence called feature extraction. It needs to solve the environmental noise, channel problems, the channel is whether we are through the microphone, or through the phone or telephone to collect voice. The 3rd is to eliminate the speaker factor, such as I have a certain accent, to remove this factor.

2. Statistical acoustics model. We have to get enough people to talk about pronunciation. For example, when people send "ah" this tone, different people voice is not the same, and the distribution status is what. The most recent improvement in depth neural network learning is used in this field, which was built with a hybrid high-speed model, but the modeling capability is weak. Pronunciation accurate words like pinyin input method, in fact, the biggest interference or the front of this layer, different people speak have different accents, different backgrounds and different channels, if it into a phonetic string, it is the same as the ordinary pinyin input method.

3. Pronunciation dictionary. Pronunciation dictionaries are maps to a word, and the dictionary is very fastidious. Chinese vocabulary is very large, there are about 70,000 Chinese characters, the most commonly used is more than 20,000 words. There are also a number of domain-specific dictionaries, such as food and mapping in the field of words are not the same. There are hot lists, in the Internet area is very obvious, at a certain time will appear a new word, never heard before, now it has a new meaning. There is also a personalized thesaurus, which is like everyone's address book.

4. Statistical language model. The frequency of different word strings is not the same, it is the probability statistic analysis of the word string. The bigger we do, the greater the likelihood of searching.

5. Identify the encoder. It's actually a search engine, and when we get a special sequence, we can quickly find a matching sentence.

Speech recognition looks like an artificial intelligence, and it seems like a very magical thing. We often cite an example is the magician, the Magician is through a variety of tricks and props to operate, it seems inconceivable, but are through a solid basic skills to achieve. Speech recognition itself is a guessing problem, when I see a voice feature signal, I will think what you want to say is what sentence, if I have ten guesses correctly, you will feel the accuracy of this system is very good.

The most recent technological breakthroughs are due to the maturity of the statistical speech recognition architecture, which we can use more and more data to make the system more and more good, because we are simply not able to do so well with manual rules. The entire academic community in the last 10 years has been a lot of progress, these technologies in the context of large data, which technology is really effective, can be very effective integration of them to do a precise system, it depends on the strength of the team and understanding of the ability.

Focus on DNN Depth neural network modeling

The red is DNN depth neural network modeling, which started in the 2009, but has been applied since 06.

There is a breakthrough in technology, and more importantly, computational capacity and the ability to model massive data have become very powerful, in this case can be practical.

For speech recognition technology, how to evaluate the key indicators? There are two points that are very important. First, the accuracy rate. If the identification is inaccurate, it is no longer worth it. How should the accuracy of recognition be measured? When I say a word, if there are 100 words, can identify how many words, we also want to reduce the typo, many words or missing words. The accuracy of a practical system in the industry is 90% per cent. It might have been done in the lab ten years ago, but it's difficult to do 90% accuracy in practical situations. Second, the real time coefficient. How much time do we need to do this for each second of speech processing? If the real-time coefficient is less than 1 to do online services, if it is 1, online service is very difficult, now is the faster the better.

What is the most difficult point of it? Voice tools are open source, very sophisticated, and it's not very difficult to build a recognition system. The difficulty is that when we use this system in a large-scale system is the parameter system synthesis optimization, we can achieve very good performance? This is not a very rigorous mathematical presentation. I mentioned a total of 5 links, if each link to achieve a 99% accuracy rate, the overall system of the overall accuracy can be achieved 95%. If each link can only achieve 95% accuracy rate, the overall accuracy can only reach 77%. So the biggest difficulty is to be able to achieve the acme of each link.

The speed of speech recognition is certainly getting better, the response is very fast and the experience is very good. As a mass deployment, costs can be reduced. Each 1 time times, the machine can shrink by half. The use of speech recognition can be divided into voice control and voice input, similar to the music vision of this scheme, can be through the sound switching platform. Voice input is like a cloud input method. There are voice inquiries, questions and answers, and dialogue, which requires semantic understanding and data services.

Cloud Acoustic open free SDK less than 5 minutes to develop a voice recognition app

This is about our mission and service, we are in the market demand outbreak and voice technology breakthrough, we hope to provide accurate, real-time, professional, complete intelligent voice services. Our service philosophy is professional, innovative, open, win-lose. Hope that our expertise to build a platform for the vast number of developers to serve, let us share the mobile voice of the era.

Our voice cloud is growing faster, and we launched a quiz last September 29 to invite industry-related teams and developers to test. By November 21, the Search dog voice assistant released, respectively, in December last year and April this year carried out two significant performance improvements, including the construction of deep neural modeling. In today's May 15 we announced to the developers completely open and permanently free. As long as we register the SDK on our website, no matter what the application and the profit model, we have no reservations. If we feel that such an application has satisfied everyone's request, we will continue to serve the service for free.

Our platform is primarily speech recognition, which converts sound into text. Semantic understanding is when we receive the text string, how to know the user's real intention, such as whether he wants to check the weather, watch TV, or check stocks, shopping, this requires semantic understanding of the function. The 3rd is the Atlas of knowledge, which links all knowledge through similar databases and graphs, and links it to semantic understanding to satisfy the user's intentions.

On this platform to support a number of applications, such as application developers, can do many aspects of the application, such as voice operations, inquiries, but also including medical, education, film inquiries, micro-letter road conditions, go out to ask and so on. Intelligent Customer Service for the enterprise information, the company's data can be linked to the map of our knowledge, users can consult the enterprise through Customer Service platform development planning, prices, orders and so on. There is the advertiser, the enterprise if want to carry on the marketing words will push some advertisement, The Advertiser may through the Platform application service to each kind of terminal customer. You can register to download the SDK on our official website.

The reason why we dare to do this platform is because we have more than 10 years of technology accumulation, our platform in the industry can reach the leading level. Speed is the fastest, we say that every second of the speech calculation time only need 0.2 seconds, because it is a streaming transmission code, in this case it is difficult to feel the difference in recognition speed. For example, record a voice on a micro-letter, and then return it, the difference is very large. Including our service platform, now has continued without fault service for more than six months, very stable, and can expand the platform as needed. Our platform support capability has exceeded 20 million times/day service capability. Online engine updates and system iterations can be done on our platform, users do not need to do any updates, users can directly experience the effect of the update.

Here is the development of online performance, last September, our platform can achieve a 85% accuracy rate. By the end of 2012, we have improved our accuracy to more than 90% through many tests and online optimizations. Over the last 4 months, the accuracy rate has exceeded 93% through engine optimization and online data iterations. The accuracy rate of the next version should be 95%.

Our identification real time factor can reach 0.55 times times, at the end of last year can achieve 0.45 times times the support. The scale of this promotion is very small, but if the whole system, DNN calculation than the traditional high many times, when we increase the computational complexity of the case, still can improve system performance, this is a very big progress. This 3-month progress has been greater, directly to increase the speed of more than 1 time times. This can be done on very common servers and does not require powerful computing resources.

This is our developer platform, the growth of the test developer. Last year we invited 5 developers to test, without any promotion, just through the influence of different users, including early last year, Sogou voice assistant helped us to promote. We currently have over 400 developers on our platform. Our customers like Sogou voice assistant, music, video cloud TV, small I Robot, Tintin nets, Touch Treasure, Pa.

Let me introduce the classic application case: Using our platform can make the logical structure very simple, developers only need to focus on the smart Terminal app, we provide an SDK implanted on the app, and the cloud platform to communicate. The cloud platform includes load balancing, database of user data, acoustics model, voice model and so on. The user publishes the voice through the client, greatly simplifies the speech recognition work.

This was last November 21 to support Sogou Voice assistant publishing application. Voice assistant in early November to find us, we only spent 2 weeks to let the Voice assistant released smoothly. Sogou Voice assistant just used the function of speech recognition, it sent the voice back to our server, we put the identification feedback back, semantic understanding and search services are all Sogou company completed, because they are very powerful search tools, has a strong semantic understanding team and search platform.

This is the voice of our cloud to know how to do the speech assistant, which is the data service in comparison with the dog is very large, important or in the vertical industry services, including open services, we are through Baidu and Sogou such a platform to achieve. Like asking about the weather, asking about movies, and asking TV shows, there are now more than 30 areas of service.

The second case is our own development of the app, it is very simple, that is, the passage of our words into words, the point of certainty can be sent to the micro-letter. This is done at the beginning of this year, just want to let users experience the cloud to know how fast the recognition rate, more accurate. The first place in the App Store free tools list was the week we released. We can see this input performance in the contact method.

The third case is the video Super TV, May 7 in the MasterCard Center for the global start. This is our voice assistant solution on the video Super TV.

How can developers use such an SDK? In fact, it is very simple, in 3, 4 minutes to do a voice recognition app. The first is to download our SDK on the registration website, the first is to register the account, through the mail activation, and then apply for app key, the corresponding version of the SDK can be downloaded. Both the Android platform and the iOS platform are now available for download. For example, Android development, the first is to import the SDK. 2nd, you need to configure some permissions in manifest.

This is a very simple code, a page ppt can be written down. Having such a code can make a very simple application of speech input recognition. When creating voice content, you can add the recognizer to the dialog box to enter the app key for the application and add a show function to bounce the box. This SDK is a streaming process, and I'm talking and recording on the side of this recording device. API index has five main functions, the SDK is placed inside, the second is to set the object of recognition. The third is the callback object. Four is the display of the recognition box.

Thank you, my introduction is here.

(Responsible editor: The good of the Legacy)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More