Python speech recognition

Last Update:2018-04-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Produced by | Yue Yun Intelligent (public number id:aibbtcom)

Please leave a message at the end of this question

The great success of Amazon's Alexa has proven that in the near future, achieving a certain level of voice support will become a basic requirement for everyday technology. The Python program, which integrates speech recognition, provides interactivity and accessibility unmatched by other technologies. Most importantly, it is very easy to implement speech recognition in a Python program. Through this guide, you will learn:

The working principle of speech recognition;
Which packages are supported by PyPI;
How to install and use the SpeechRecognition package-a full-featured and easy-to-use Python speech recognition library.

▌ an overview of how language recognition works

Speech recognition originated from the research done at Bell Labs in the early the 1950s. The early speech recognition system can only identify individual speakers and only about more than 10 words in the vocabulary. Modern speech recognition systems have made great strides in identifying multiple speakers and having a large vocabulary that identifies multiple languages.

The first part of speech recognition is of course voice. With a microphone, the voice is converted from a physical sound to an electrical signal and then converted to data via an analog-to-digital converter. Once digitized, several models can be used to transcribe audio into text.

Most modern speech recognition systems rely on hidden Markov models (HMM). It works as follows: The speech signal on very short time scale (such as 10 milliseconds) can be approximated as a stationary process, that is, a statistical characteristics of the process does not change over time.

Many modern speech recognition systems use neural networks before HMM recognition to simplify speech signals through feature transformation and dimensionality reduction techniques. You can also use voice activity detectors (VAD) to reduce audio signals to parts that may contain only speech.

Fortunately, for Python users, some speech recognition services are available online using the API, and most of them also provide the Python SDK.

▌ Select Python Speech recognition Package

There are some ready-made speech recognition packages in PyPI. These include:

? apiai

? google-cloud-speech

? pocketsphinx

? Speechrcognition

? watson-developer-cloud

? wit

Some software packages, such as wits and Apiai, provide built-in functionality beyond basic speech recognition, such as natural language processing to identify the speaker's intentions. Other software packages, such as Google Cloud Voice, focus on the conversion of voice to text.

Among them, speechrecognition for ease of use stand out.

Recognizing the voice requires input audio, and retrieving the audio input in speechrecognition is straightforward, eliminating the need to build a script that accesses the microphone and processes the audio files from scratch, which can be retrieved and run automatically in just a few minutes.

The SpeechRecognition Library caters to several mainstream voice APIs and is therefore highly flexible. The Google Web Speech API supports hard-coded default API keys in the SpeechRecognition library and can be used without registration. SpeechRecognition is the best choice for writing Python programs with its flexibility and ease of use.

▌ Installing Speechrecognation

SpeechRecognition is compatible with Python2.6, 2.7 and 3.3+, but additional installation steps are required if used in Python 2. All development versions in this tutorial are default Python 3.3+.

Readers can install SpeechRecognition from the terminal using the PIP command:

$ pip install SpeechRecognition

After the installation is complete, open the Interpreter window and enter the following to verify the installation:

>>> import speech_recognition as sr>>> sr.__version__‘3.8.1‘

Note: Do not close this session, you will use it in the last few steps.

If you are working with an existing audio file, simply call SpeechRecognition and pay attention to some dependencies of the specific use case. Also note that install the Pyaudio package to get the microphone input.

▌ Recognizer Class

The core of SpeechRecognition is the recognizer class.

The recognizer API is primarily targeted for speech recognition, and each API has several settings and features to identify the audio source's voice, respectively:

Recognize_bing (): Microsoft Bing Speech

Recognize_google (): Google Web Speech API

Recognize_google_cloud (): Google Cloud Speech-requires installation of the Google-cloud-speech package

Recognize_houndify (): Houndify by SoundHound

RECOGNIZE_IBM (): IBM Speech to Text

Recognize_sphinx (): CMU sphinx-requires Installing Pocketsphinx

Recognize_wit (): Wit.ai

Only Recognition_sphinx () in the above seven can work offline with the CMU Sphinx engine, and the other six need to be connected to the Internet.

SpeechRecognition comes with the Google Web Speech API's default API key, which you can use directly. The other six APIs require authentication using either an API key or a username/password combination, so this article uses the Web Speech API.

Now proceed to practice, calling the Recognise_google () function in the interpreter session.

>>> r.recognize_google()

The screen will appear:

call last): File "<stdin>", line 1, in <module>TypeError: recognize_google() missing 1 required positional argument: ‘audio_data‘

Believe you have guessed the result, how can you identify the data from an empty file?

all 7 recognize_* () Recognizer classes need to enter the audio_data parameter, and each recognizer's audio_data must be an instance of the Audiodata class of SpeechRecognition.

Audiodata instances are created with two paths: an audio file or audio recorded by a microphone, starting with a more accessible audio file.

▌ use of audio files

First you need to download the audio file (https://github.com/realpython/python-speech-recognition/tree/master/audio_files) and save it to Python The directory in which the interpreter session resides.

The AudioFile class can be initialized with the path to the audio file and provide a context manager interface for reading and processing the contents of the file.

Supported file types

SpeechRecognition currently supports the following file types:

WAV: Must be in PCM/LPCM format

AIFF

Aiff-c

FLAC: Must be the initial FLAC format; Ogg-flac format is not available

If you use the Linux system x-86, MacOS or Windows system, you need to support FLAC files. If running under other systems, you need to install the FLAC encoder and ensure that you have access to the FLAC command.

Use record () getting data from a file

In the Interpreter session box, type the following command to process the contents of the "Harvard.wav" file:

>>> harvard = sr.AudioFile(‘harvard.wav‘)>>> with harvard as source:...  audio = r.record(source)...

The context Manager opens the file and reads the contents of the file, stores the data in the AudioFile instance, and records the data from the entire file to the Audiodata instance through the record (), which can be confirmed by checking the audio type:

>>> type(audio)<class ‘speech_recognition.AudioData‘>

You can now call Recognition_google () to try to recognize the voice in the audio.

>>> r.recognize_google(audio)‘the stale smell of old beer lingers it takes heatto bring out the odor a cold dip restores health andzest a salt pickle taste fine with ham tacos alPastore are my favorite a zestful food is the hotcross bun‘

This completes the recording of the first audio file.

Capturing audio clips with offsets and durations

What should I do if I want to capture only part of the speech in a file? The record () command has a duration keyword parameter that allows the command to stop recording after a specified number of seconds.

For example, the following only gets the voice in the first four seconds of the file:

>>> with harvard as source:...   audio = r.record(source, duration=4)...>>> r.recognize_google(audio)‘the stale smell of old beer lingers‘

When you call the record () command in a with block, the file stream moves forward. This means that if you record four seconds before recording for four seconds, the first four seconds will return the second four seconds of audio.

>>> with harvard as source:...   audio1 = r.record(source, duration=4)...  audio2 = r.record(source, duration=4)...>>> r.recognize_google(audio1)‘the stale smell of old beer lingers‘>>> r.recognize_google(audio2)‘it takes heat to bring out the odor a cold dip‘

In addition to specifying the record duration, you can use the offset parameter to specify a starting point for the record () command whose value represents the time at which the record was started. For example, to get the second phrase in a file only, set an offset of 4 seconds and record the duration of 3 seconds.

>>> with harvard as source:...   audio = r.record(source, offset=4, duration=3)...>>> recognizer.recognize_google(audio)‘it takes heat to bring out the odor‘

Theoffset and duration keyword parameters are useful for splitting audio files in advance of knowing the structure of the speech in the file. But inaccurate use can lead to poor transcription.

>>> with harvard as source:...   audio = r.record(source, offset=4.7, duration=2.8)...>>> recognizer.recognize_google(audio)‘Mesquite to bring out the odor Aiko‘

This program starts from the beginning of 4.7 seconds, so that the phrase "it takes heat to bring out the odor", the "it T" is not recorded, at this time the API only get "akes heat" this input, and the match is "mesqu Ite "this result.

Similarly, at the end of the recording phrase "A cold dip restores health and zest" The API captures only "a co" and is incorrectly matched to "Aiko".

Noise is also a major culprit in translation accuracy. In the example above, the audio file is clean and running well, but in reality it is impossible to get noise-free audio unless the audio file is processed beforehand.

The effect of noise on speech recognition

Noise does exist in the real world, and all recordings have a certain degree of noise, while untreated noise can disrupt the accuracy of speech recognition applications.

To understand how noise affects speech recognition, download "Jackhammer.wav" (Https://github.com/realpython/python-speech-recognition/tree/master/audio_ files) file and make sure that it is saved to the working directory of the interpreter session. The phrase "the stale smell of old beer lingers" in the file is read in the background sound of a large drill wall.

What happens when I try to transcribe this file?

>>> jackhammer = sr.AudioFile(‘jackhammer.wav‘)>>> with jackhammer as source:...  audio = r.record(source)...>>> r.recognize_google(audio)‘the snail smell of old gear vendors‘

So how do we deal with this problem? You can try calling the adjust_for_ambient_noise () command of the Recognizer class.

>>> with jackhammer as source:...   r.adjust_for_ambient_noise(source)...  audio = r.record(source)...>>> r.recognize_google(audio)‘still smell of old beer vendors‘

This is much closer to accurate results, but the accuracy remains problematic, and the "the" at the beginning of the phrase is lost, what is the reason?

Because the first second of the file stream is recognized as the noise level of the audio by default when using the Adjust_for_ambient_noise () command, the first second of the file is consumed before the record () is used to obtain the data.

You can use the duration keyword parameter to adjust the time analysis range of the adjust_for_ambient_noise () command, which is in seconds and defaults to 1, and now reduces this value to 0.5.

>>> with jackhammer as source:...   r.adjust_for_ambient_noise(source, duration=0.5)...  audio = r.record(source)...>>> r.recognize_google(audio)‘the snail smell like old Beer Mongers‘

Now we have the phrase "the", but now there are some new problems-sometimes because the signal is too noisy to eliminate the effects of noise.

If you encounter these problems frequently, you need to do some preprocessing of the audio. This preprocessing can be done through audio editing software, or by applying a filter to a file's Python package (for example, scipy). When dealing with noisy files, you can improve accuracy by looking at the actual API response. Most APIs return a JSON string that contains multiple possible transcripts, but therecognition_google () method always returns only the most likely transcription characters without forcing a full response to be required.

The full response is given by changing the True parameter in Recognition_google () to Show_all.

>>> R.recognize_google (Audio, show_all=True) {' Alternative ': [{' Transcript ':' The snail smell like old Beer mongers '}, {' Transcript ': ' The still smell of the old beer vendors '}, { ' transcript ':  ' the snail smell as old beer vendors '}, { ' Transcript ':  ' the stale smell of the old beer vendors '}, { ' transcript ':  ' the snail smell as old beermonge Rs '}, { ' transcript ':  ' destihl smell of old beer vendors '}, { ' transcript ':  ' the still smell like ol D Beer Vendors '}, { ' transcript ':  ' Bastille smell of Old beer vendors '}, { ' transcript ':  ' the Still Smell like old beermongers '}, { ' transcript ':  ' the still smell of old beer venders '}, { ' transcript ': 
                            
                              ' The still smelling old beer vendors '}, {
                              ' transcript ':  ' musty smell of old beer vendors '}, { ' Transcrip T ':  ' the still smell of old beer vendor '}],  ' final ':  True}       /span>

As you can see,recognition_google () returns a list of the keywords ' alternative ' that refer to all possible responses. This response list structure varies by API and is primarily used to debug the results.

▌ Use of microphones

To access the microphone using SpeechRecognizer, you must install the Pyaudio package, close the current interpreter window, and do the following:

Installing Pyaudio

The process of installing pyaudio will vary depending on the operating system.

Debian Linux

If you are using Debian-based Linux (such as Ubuntu), you can use apt to install pyaudio:

$ sudo apt-get install python-pyaudio python3-pyaudio

You may still need to enable PIP install Pyaudio after the installation is complete , especially if you are running under virtual conditions.

Macos:macos users first need to use Homebrew to install Portaudio, and then call the pip command to install Pyaudio.

$ brew install portaudio$ pip install pyaudio

Windows:windows users can directly invoke pip to install Pyaudio.

$ pip install pyaudio

Installation test: After installing the Pyaudio, you can install the test from the console.

$ python -m speech_recognition

Make sure that the default microphone is turned on and Unmute, and you should see something like the following if the installation is OK:

A moment of silence, please...Set minimum energy threshold to 600.4452854381937Say something!

Speak to the microphone and see how speechrecognition transcribe your speech.

Microphone class

Please open another interpreter session and create an example of a class that knows a different type of device.

>>> import speech_recognition as sr>>> r = sr.Recognizer()

Instead of using an audio file as the source, the default system microphone is used. The reader can access it by creating an instance of the microphone class.

>>> mic = sr.Microphone()

If the system does not have a default microphone (such as on raspberrypi) or if you want to use a non-default microphone, you need to specify the microphone you want to use by providing a device index. Readers can get a list of microphone names by calling the List_microphone_names () function of the microphone class.

>>> sr.Microphone.list_microphone_names()[‘HDA Intel PCH: ALC272 Analog (hw:0,0)‘, ‘HDA Intel PCH: HDMI 0 (hw:0,3)‘, ‘sysdefault‘, ‘front‘, ‘surround40‘, ‘surround51‘, ‘surround71‘, ‘hdmi‘, ‘pulse‘, ‘dmix‘, ‘default‘]

Note: Your output may be different from the previous example.

List_microphone_names () returns the index of the microphone device name in the list. In the above output, if you want to use a microphone named "front", which is indexed as 3 in the list, you can create a microphone instance like this:

>>> # This is just an example; do not run>>> mic = sr.Microphone(device_index=3)

In most cases, however, you need to use the system default microphone.

Use Listen () get microphone input data

Once the microphone instance is ready, the reader can capture some input.

Just like the AudioFile class,microphone is a context manager. You can use the Listen () method of the Recognizer class in the With block to capture the input of the microphone. This method takes the audio source as the first parameter and automatically records the input from the source until it detects a mute stop automatically.

>>> with mic as source:...   audio = r.listen(source)...

After executing the With block, try to say "hello" in the microphone. Please wait for the interpreter to display the prompt again, and the voice will be recognized once the ">>>" prompt returns.

>>> r.recognize_google(audio)‘hello‘

If you are not prompted to return again, perhaps because the microphone is receiving too much ambient noise, use Ctrl + C to interrupt the process so that the interpreter displays the prompt again.

To handle ambient noise, you can call the adjust_for_ambient_noise () function of the recognizer class, as it does with noisy audio files. Because the microphone input sounds are less predictable than audio files, you can use this procedure to process any time you listen to the microphone input.

>>> with mic as source:...   r.adjust_for_ambient_noise(source)...  audio = r.listen(source)...

Wait a moment after running the above code and try saying "hello" in the microphone. Again, you must wait for the interpreter prompt to return before attempting to recognize the voice.

Keep in mind thatadjust_for_ambient_noise () defaults to analyzing 1 seconds of audio in the audio source. If the reader thinks this time is too long, the duration parameter can be used to adjust it.

SpeechRecognition data suggest that the duration parameter is not less than 0.5 seconds. In some cases, you may find that the duration exceeds the default of one second to produce better results. The minimum value you need depends on the ambient environment in which the microphone is located, but this information is often unknown during the development process. According to my experience, the default duration of one second is sufficient for most applications.

Handle hard-to-recognize speech

Try entering the previous code example into the interpreter and entering some incomprehensible noise in the microphone. You should get the result:

call last): File "<stdin>", line 1, in <module> File "/home/david/real_python/speech_recognition_primer/venv/lib/python3.5/site-packages/speech_recognition/__init__.py", line 858, in recognize_google  if not isinstance(actual_result, dict) or len(actual_result.get("alternative", [])) == 0: raise UnknownValueError()speech_recognition.UnknownValueError

Audio that cannot be matched to text by the API throws unknownvalueerror exceptions, so try and except blocks are frequently used to resolve such problems. The API will do its best to turn any sound into text, such as a short grunt may be recognized as "how", coughing, clapping, and tongue clicks may be turned into text causing abnormalities.

▌ Conclusion

In this tutorial, we have been recognizing English phonetics, which is the default language for each recognition _ * () method in the SpeechRecognition package. However, it is absolutely possible and easy to recognize other sounds. To recognize speech in different languages,set the language keyword parameter of the R ecognition _ * () method to the string that corresponds to the desired language.

Original link: http://www.aibbt.com/a/28552.html

Python speech recognition

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python speech recognition

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python speech recognition

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support