1. Introduction
Jasper is an open source voice control assistant based on Raspberry Pi
Jasper working principle is the device passive monitoring microphone, when the wake-up keyword to enter the active listening mode, the Voice command received after the speech recognition, and then the resulting text to parse and process the semantic content, and then the processing results through speech synthesis and output to the user.
The technologies involved include recording and playback of sound; Speech recognition (ASR/STT); Semantic content (NLU/NLP); Speech synthesis (TTS)
2. Audio System 2.1 Hardware
The audio system's hardware is the sound card, and the sound card uses the DAC (digital-to-analog conversion) and ADC (Analog-to-digital conversion) for audio input and output.
Here are the commands for viewing your sound card device under Linux
$ Lspci | grep-i audio
00:05.0 audio Device:intel Corporation 82801fb/fbm/fr/fw/frw (ICH6 Family) Hi GH Definition Audio Controller (rev.)
$ cat/proc/asound/cards
0 [intel ]: Hda-intel-hda Intel
HDA Intel at 0xf0804000 IRQ
2.2 Software
In Linux, the audio system is structured as follows
In Desktop Linux systems, audio systems typically contain drive layers , service tiers (sound servers), and application tiers .
In an embedded system, the audio system consists of only the drive layer and the application layer .
Under Linux There are two sets of audio drive systems, namely OSS and ALSA.
2.2.1 OSS
OSS (Open Sound system) is a unified voice architecture on Unix-like and POSIX-compatible systems, and applications that are compatible with OSS APIs can be easily ported
OSS API mainly provides the following device file interface
/dev/mixer: Mixing and control
/DEV/DSP: Audio input and output
Most Linux systems today do not provide OSS drivers and use ALSA, and are not detailed here.
More information refer to <oss--cross-platform audio interface Introduction >
2.2.2 ALSA
ALSA (Advanced Linux sound Architecture) is a successor to OSS and has become the mainstream audio architecture for Linux
ALSA includes drivers , libraries , and toolkits
alsa-driver: Drive part, integrated in the kernel, mostly in the form of a module
The drive can be divided into the following three layers
-Hardware control layer on the ground floor: Responsible for hardware manipulation access, which is the main part of the sound card driver to be implemented by manufacturers
-Middle tier core layer: A variety of functional audio device components, to provide users with a number of predefined components (such as PCM, AC97, sequencer and controller, etc.), in addition to the user can also define their own device components
-Sound Card Object Description layer: It is an abstract description of the sound card hardware, the kernel through these descriptions can be learned that the sound card hardware functions, device components and operating methods, etc.
The driver provides the following abstract interface for user space
/proc/asound: Information interface
/DEV/SND/CONTROLCX: Control Interface
/DEV/SND/MIXERCXDX: Mixer Interface
/DEV/SND/PCMCXDX:PCM interface
/dev/snd/midicxdx:midi interface
/DEV/SND/SEQ: Sequencer Interface
/dev/snd/timer: Timer interface
alsa-lib: The user space function library, which encapsulates the abstract interface provided by the driver, provides the API to the application through the file libasound.so
alsa-utils: Utility kit to implement tools such as playing audio (Aplay), recording (Arecord) by calling Alsa-lib
2.2.3 Service Layer
The sound server is between ALSA and the application, when the application calls the sound server's API to play the sound, while the audio data is sent to the sound server, the sound server mixes more than one playback request and sends it to the underlying sound card driver (Alsa/oss). The ALSA or OSS drives the sound card to play the data after the remix.
ESD (Enlightened sound Daemon, or esound): a voice server that is a GNOME desktop environment that has been replaced by PulseAudio.
ARts (Analog real-time synthesizer): a voice server in the KDE desktop environment that has been replaced by phonon.
PulseAudio: A new generation of sound servers that provide better sound and are currently the default sound server for GNOME desktops.
2.2.4 Other software
In addition to sound servers, there are some common middle-tier/sound libraries
JACK Audio Connection Kit: A professional sound server that provides real-time, low-latency connectivity for audio and MIDI data between applications.
GStreamer: is an open-source multimedia framework used to build streaming media applications.
Phonon: QT cross-platform multimedia framework.
Portaudio: cross-platform open source audio I/O library requires alsa-lib support in Linux.
3. Speech recognition
Speech recognition (Speech recognition) technology, known as automatic speech recognition, ASR (Automatic Speech recognition);
Also called voice-to-text, Speech to text (STT)
4. Speech synthesis
TTS (Text to Speech), currently open source solutions and proprietary solutions
Open Source solutions include
-Espeak/espeakng, Ekho, Festival/festvox/flite
Proprietary solutions include
-Google TTS, Amazon Polly, Neospeech TTS, and more
-Baidu TTS, Ali TTS, Iflytek TTS, etc.
4.1 ESpeak
Espeak support Chinese output, but the effect is really not flattering
"I am Chinese, I love China"
4.2 Ekho
Ekho is a free open source and multilingual text to speech software
It supports Cantonese, Mandarin, etc.
4.2 Festival
Festival, a general-purpose multilingual speech synthesis system developed for the University of Edinburgh CStr
Festvox software developed for the CMU to build synthetic sound
Flite (Festival-lite) Festival-based Lite speech synthesis system developed for CMU
Simply put, festival and Flite can perform TTS conversions and then output using the sounds provided by the Festvox.
Festival
# Yum Install Festival
$ echo "Hello, you are using festival" | Festival–tts
$ Festival--tts myfile
Flite
# yum Install Flite
$ flite "Hello, you are using Flite" A.wav
$ aplay A.wav
Festival effect is slightly better than espeak, but does not support Chinese
Reference:
<alsa opensrc org>
<linux ALSA sound card Driver development Best Practices >
Jasper Voice Assistant Introduction