DSP selection strategy for mobile phone speech recognition applications

Source: Internet
Author: User

With the advancement of DSP technology, DSPs with higher computing power, lower power consumption, and smaller size have emerged, making it possible to implant more precise and complex Automatic Speech Recognition (ASR) Functions on 3G mobile phones. Currently, basic ASR applications can be divided into three categories: 1. Speech-text conversion (voice input); 2. Speaker recognition; 3. speech command control (voice control ).


These three features include the many ASR performance required by 3G. A typical example of speech-text conversion is speech dialing and email dictation. The speaker recognition function can securely read personal data in the memory through speech recognition, so as to meet the needs of credit card ordering, banking services, and other highly confidential applications. The voice command control function includes a voice interface that connects to the content of a voice extension Markup Language (VXML) website. It supports business such as financial services and directory assistants. Currently, VXML is used to regulate the voice tags of website content.


Two Methods of Speech Recognition


The ASR Application Design of 3G mobile phones can be divided into two types: terminal-oriented and customer-centric applications. 1 shows the terminal-centric design method. 3G mobile phones (terminals) perform the entire speech recognition process and send the recognition results. In the customer/Server method shown in figure 2, the terminal only performs preprocessing Feature Extraction and then sends these parameters to the central server through a data channel protected by the error code, the central server completes speech recognition. If customer/Server-centric design is adopted, 3G mobile phones should use data channels instead of mobile channels to send voice to servers for recognition, because the low-rate Speech Encoding used by mobile channels seriously affects the performance of speech recognition.


The differences between ASR systems are mainly reflected in the vocabulary. A simple network device may only need a 16-character dictionary to implement the required speech recognition function, while 3G mobile phones need a larger professional dictionary. These words can be related to the speaker (training a speech recognition device to familiarize it with the user's voice features) or the Speaker (the speech recognition device can recognize the voice of anyone ), the computing load of DSP increases with the increase of vocabulary and training data.


For example, the hidden Markov model (HMM) can be used to analyze a typical application instance that identifies 100 commands unrelated to the speaker. Assume that HMM models are arranged sequentially from left to right without jumping. There are 6 States and 5 Gaussian mixture distributions with diagonal covariance, contains 39 features (13 Gbit/s-frequency logarithm coefficient or MFCC, and first and second-order difference), with 16-bit accuracy, the HMM acoustic model is 100 × 5 × 5 × (39 + 2) × 2 = 240kB.


To enable real-time operations such as input speech sample difference, window interception, MFCC extraction, probability calculation, and Viterbi search, in typical cases, tens of millions of multiplication-accumulation periods (MMAC) of the DSP are required ). For continuous speech recognition, thousands of Three audio models and multiple syntax models require more storage space and faster DSP processing speed.


Therefore, the success or failure of the ASR System in mobile phones depends largely on the functions and design of the DSP. The third-generation system itself needs a DSP with higher performance than the second-generation system, and the addition of ASR functions puts forward higher requirements on the DSP. From the perspective of structure, the requirement for DSP performance is fast processing speed, low power consumption, and high code density.


Using high-speed DSP is the key


As the system needs to process and sample speech in real time, the speech recognition system needs to have a huge computing capability. The following figures and computing assumptions use the design method around the terminal. If 20% of DSP computing resources are allocated to a 10 mmac speech recognition system, a 50 mmac DSP is required to meet this functional requirement, it also provides sufficient space for other DSP tasks required for 3G mobile phones, such as processing soft cats. If a slow DSP is used, such as a 25 mmac DSP, the number of commands in the vocabulary will be halved, Or HMM parameters will be reduced, which will reduce the overall system performance.


The speed of DSP determines the complexity and performance of the speech recognition system. For example, if a basic continuous speech recognition system unrelated to the speaker needs 100 MMAC, and 50% of DSP computing resources are used to meet the needs of other DSP tasks of 3G mobile phones, the processing speed of DSP must reach 200 MMAC.


Cost, performance, and efficiency compromise


The faster the DSP speed, the more convenient it is to use modern HMM technology, such as channel matching and domain matching technology. Therefore, theoretically, the faster the DSP speed, the better the ASR system performance. However, parallel processing plays an important role in improving the ASR system throughput. For example, a 200 MHz DSP with 4 ALU (arithmetic logic unit) has a higher throughput than a DSP with only 1 ALU but running at MHz. Depending on the specific application, two to three single ALU DSPs provide performance similar to a DSP with a 4 ALU. Compared with a 4-alu dsp processor solution, multiple single-ALU DSPs increase the cost of mobile phones. Therefore, it is necessary to fully balance the compromise between marketable products and performance.


In short, when comparing a single alu dsp of MHz and a DSP of MHz but with 4 ALU, design engineers should always grasp the final goal of efficient computing throughput, DSP with multiple ALU may be the best solution.


Performance and power consumption


Top-level performance DSPs use parallel structures to obtain the best performance space. A well-known balanced parallel structure StarCore SC140 adopts the command-level parallel structure. It has four parallel ALU and an improved and very long command word model called the variable-length execution set (VLES. VLES supports efficient command scheduling, execution, and packaging in the memory. It can provide feedback to the front end through a command queue and control the backend through the scheduler. Therefore, unless computing is required, VLES processing generally does not consume power.


In the parallel VLES structure, some special commands need to be grouped to avoid null operations (Nop). Because the clock cycle is reduced, the processing time is also reduced. In terms of comparison, all execution steps must be arranged in order in even long-term script computation. Therefore, when an 8-byte execution set is even 1-byte data, the system requires seven placeholders (placeholder) or Nop.


Because the VLES structure does not require Nop, the complexity in the VLES design is transferred from hardware or programming to the compiler. Because each cycle is full of data, each cycle has a higher efficiency, which also improves the efficiency of power and memory usage.


Power Management


Because the ASR System needs to continuously process voice data, the DSP will become the main component of power consumption, efficient use of power is crucial to the success of the device to the market.


In a high-performance DSP, selecting a 16-bit Instruction Set instead of a 32-bit instruction set can increase the code density and further reduce the memory, power consumption, and volume requirements, this is partly because a shorter 16-bit instruction set can reduce the number of registers and data lines. For example, in ASR applications, the storage vocabulary may reach 2.5 MB (for the triphoneme state of 1024 clusters, the acoustic HMM state model is kb for five synthesis and 39 parameters; A 10 thousand tri-state tri-phoneme code is 60 kb; A triphoneme state transfer probability matrix is 20 thousand KB; a 40-State 1.6-character dual-letter group with 40 messy states is MB ). If the DSP has a high code density and can provide a fixed amount of memory for the ASR System, a better acoustic and language model can be obtained.


On-chip and off-chip Storage


Effective use of on-chip and off-chip memory is another important topic for DSPs used in ASR systems. Because the ASR System requires a large amount of storage space for the storage of vocabulary and pattern recognition data, a flexible storage structure will be particularly important here. For example, a DSP with a unified addressing memory can enable design engineers to well balance programs and data, and balance the complexity of system algorithms with the size of acoustic and language models for optimal performance.


For example, if the identification system model with 100 commands only has KB of on-chip system memory, the total memory space is kb, therefore, the secondary recognition method can make more effective use of on-chip fast memory.


The first time (in the original identification phase), only 13 of the 39 parameters are used. Therefore, the model size is 80 KB and can be loaded into the memory on the chip. The number of candidate commands in the original recognition stage is less than the original 100, for example, 33 commands, but the reliability is as high as 99.9%.


In the second (precise identification stage), 39 parameters of 33 candidate commands are used as the model, with a size of 80 KB. Therefore, the model can be loaded into the on-chip memory. This secondary recognition method will introduce some latencies, but the latencies are very small, only about 10 ms, the speaker is generally not aware.


The unified addressing memory supports a large vocabulary or command set, as well as a large HMM model or neural network coefficient, which simplifies real-time tasks. For example, to prepare KB of memory for the ASR System program and data, the design engineer can balance the relationship between the complexity of the algorithm and the vocabulary or the size of the command set. If the program occupies 50 kb, the data can only be 50 kb. If you can reduce the recognition accuracy and compress the program code to 20 KB, the command set can use 80 KB, which increases the vocabulary library capacity.


In the ASR System, highly parallel processing, high code density, and effective use of memory also enable DSP to complete tasks other than speech recognition. In most cases, the design engineer can allocate some computing resources to speech recognition, and use the remaining resources to execute other tasks required in the channel processing system.


Requirements other than DSP Kernel


After selecting the optimal DSP, to obtain a high-performance ASR System-level chip, you need to add some functions, such as quick cache or Quick Command/Data Access and real-time operating system (RTOS) so that the ASR System can truly achieve real-time performance. Multi-task RTOS enables the system to run multiple applications at the same time, such as dual-channel speech recognition, which can greatly improve system performance.


Design Engineers for complex SoC applications (such as channel processing systems) can benefit from DSP and SoC using efficient advanced language compilers, these compilers allow design engineers to program in C or C ++. Enhanced on-chip simulation and debugging can further shorten the design time. In addition to real-time performance and simplified design processes, power management control is also important for the design of various components and systems in 3G Mobile Phone applications. When designing a SoC, selecting a kernel with the adjustable power function will benefit a lot. For example, when a mobile user is talking, the DSP needs to run at full speed (for example, 300 MHz ). When ASR is not used, the SoC power management circuit can gradually reduce the clock speed (such as 100 MHz) to effectively reduce the leakage and power consumption.


As the ASR System's demand for computing speed changes significantly depending on the differences in recognition features, such as solitary speech recognition or continuous speech recognition, vocabulary, and speech recognition unrelated to the speaker, the complexity of the channel processing system that supports ASR is also greatly changed.


SOC is very suitable for the construction of the chip infrastructure, so it is ideal for customer/server system-centric design, but the SOC device is too powerful, therefore, it is not very suitable for the terminal-centric design of the user end. However, as the ASR System matures and 3G mobile phones support increasingly complex applications and complex ASR systems, such powerful SOC can also be successfully applied to the user end.


Using multiple DSPs on the SOC makes it easier for the system to perform other tasks while completing speech recognition. For example, one of the three kernels can be specifically designated to complete the multi-channel server ASR, while the other two kernels are used to perform tasks such as voice channel and Internet data processing. In the future, if the mobile phone keyboard no longer exists, ASR will become the only interface between the user and the mobile phone, and this function will take up most of the work time.


Using Multiple DSP kernels can also provide powerful computing capabilities, making it possible to execute complicated ASR tasks, for example, continuous speech recognition in email dictation, secure transactions, and "password + Speaker Verification" in VXML. Combining multiple DSPs with a unified memory on a large chip can greatly shorten the training process unrelated to the speaker, because the computing load during the training process in statistical ASR is much heavier than the load during the recognition and processing process.




Although 3G mobile phones are expected to win the market, there are still some questions about their functions and designs. However, these systems require a high-performance signal processing platform to meet the needs of multimedia tasks, with the continuous popularization of ASR systems, 3G mobile phones certainly need a multi-dsp soc capable of running multiple tasks as a solution.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.