Let Java speak-implementing the speech engine in Java
2005-11-07 10:04:09
Category: Java technology
What are the benefits of adding voice capabilities to your application? Roughly speaking, it is for fun, it is suitable for all interesting applications, such as games. Of course, from a more serious point of view, it also involves application usability issues. Note that I'm not only thinking about the inherent shortcomings of the visual interface, but there are situations where it's inconvenient and even illegal to let your eyes leave your current job. For example, if you have a voice-enabled browser, you can go for a walk or drive to work while you listen to your favorite websites. [@[email protected]] What are the benefits of adding voice to your application? Roughly speaking, it is for fun, it is suitable for all interesting applications, such as games. Of course, from a more serious point of view, it also involves application usability issues. Note that I'm not only thinking about the inherent shortcomings of the visual interface, but there are situations where it's inconvenient and even illegal to let your eyes leave your current job. For example, if you have a voice-enabled browser, you can go for a walk or drive to work while you listen to your favorite websites.
?? From now on, the mail reader may be a more practical application of voice technology, and with the help of the JavaMail API, it's all possible. The Mail reader can check the Inbox regularly and then use the Voice "You have new mail, would. Cause you to notice. In a similar way, we can also consider a voice-enabled reminder to connect it to a calendar app: it will prompt you to "Don ' t forget your meeting with the boss in the Minutes!".
?? Maybe you've been attracted to these ideas, or have your own better ideas, now let's move on. First I'll show you how to enable the speech engine provided in this article so that if you think that the implementation details of the speech engine are too complex, you can use it directly and ignore its implementation details.
First, trial speech engine
To use this speech engine, you must include the Javatalk.jar file provided in this article in Classpath, and then run (or call from a Java program) the Com.lotontech.speech.Talker class from the command line. If run from the command line, the command is:
Java com.lotontech.speech.Talker "H|e|l|oo"
If called from a Java program, the code is:
Com.lotontech.speech.Talker talker=new Com.lotontech.speech.Talker ();
Talker.sayphoneword ("H|e|l|oo");
You may now be puzzled about the "H|e|l|oo" string provided on the command line (or when you call the Sayphoneword () method). Here's what I'll try to explain.
The speech engine works by connecting small sound samples, each of which is a minimal unit of human language pronunciation (English). These sound samples are called phonemes (allophone). Each of these factors corresponds to one, two, or three letters. As you can see from the voice of the front "hello", the pronunciation of some letter combinations is obvious, while others are not obvious:
H--the pronunciation is obvious
e--the pronunciation is obvious
L--the pronunciation is obvious, but note that two "L" have been abbreviated into an "L".
OO-should be read as "Hello" in the pronunciation, should not be read as "bot", "too" in the pronunciation.
The following is a list of valid phonemes:
A: such as Cat
B: As Cab
C: such as Cat
D: such as Dot
E: If bet
F: If Frog
G: If Frog
H: If hog
I: Like pig
J: If Jig
K: If Keg
L: If leg
M: If met
N: If begin
O: If not
P: If pot
R: Like Rot
S: such as Sat
T: such as the SAT
U: if put
V: If the
W: If wet
Y: if yet
Z: As Zoo
AA: As Fake
Ay: Like hay
EE: such as Bee
II: If high
OO: Like go
Bb:b changes in form, stress is different
Dd:d changes in form, stress is different
Ggg:g changes in form, stress is different
Hh:h changes in form, stress is different
Ll:l changes in form, stress is different
Nn:n changes in form, stress is different
Rr:r changes in form, stress is different
Tt:t changes in form, stress is different
Yy:y changes in form, stress is different
AR: Like car
AER: Like Care
CH: if which
CK: If check
Ear: Like beer
ER: If later
ERR: Like later (Long sound)
NG: As Feeding
Or: As Law
OU: As Zoo
Ouu: As Zoo (long sound)
OW: such as cow
Oy: Like boy
SH: If shut
Th: Like thing
DTH: If this
The change form of uh:u
WH: such as where
En: If Asian
When people speak, the voice rises and falls throughout the sentence. Intonation changes make the voice more natural and more contagious, so that questions and statements can be distinguished from each other. Please consider the following two sentences:
It is fake--f|aa|k
Is it fake? --f| Aa|k
Perhaps you have guessed that the way to improve intonation is to use uppercase letters.
The above is what you need to know when using the software. If you are interested in the details of its background implementation, read on.
Second, the realization of The voice engine
The implementation of the speech engine consists of only one class, four methods. It leverages the Java sound API contained in J2SE 1.3. Here, I'm not going to introduce this API comprehensively, but you can learn how to use it using an example. The Java Sound API is not a particularly complex API, and the comments in the code will tell you what you have to know.
The following is the basic definition of the talker class:
Package Com.lotontech.speech;
Import javax.sound.sampled.*;
Import java.io.*;
Import java.util.*;
Import java.net.*;
public class Talker
{
Private Sourcedataline Line=null;
}
If you execute talker from the command line, the following main () method runs as an entry point. The main () method gets the first command-line argument and passes it to the Sayphoneword () method:
/*
* read out a string of pronounced pronunciations specified on the command line
*/
public static void Main (String args[])
{
Talker player=new Talker ();
if (args.length>0) Player.sayphoneword (Args[0]);
System.exit (0);
}
The Sayphoneword () method can be called either through the main () method above or directly in a Java program. On the face of it, the Sayphoneword () method is more complex than it really is. In fact, it simply iterates through the speech elements of all words (in the input string The voice element is "|" separated), which is played out by an element of a sound output channel. To make the sound more natural, I merged the end of each sound sample with the beginning of the next sound sample:
/*
* read out the specified speech string
*/
public void Sayphoneword (String word)
{
An analog byte array constructed for the previous sound
Byte[] Previoussound=null;
Split the input string into separate phonemes
StringTokenizer st=new StringTokenizer (Word, "|", false);
while (St.hasmoretokens ())
{
Construct the appropriate file name for the phoneme
String Thisphonefile=st.nexttoken ();
Thisphonefile= "/allophones/" +thisphonefile+ ". au";
Reading data from a sound file
Byte[] Thissound=getsound (thisphonefile);
if (previoussound!=null)
{
Merge the previous phoneme with the current phoneme, if possible
int mergecount=0;
if (previoussound.length>=500 && thissound.length>=500)
mergecount=500;
for (int i=0; i
{
Previoussound[previoussound.length-mergecount+i]
= (byte) ((Previoussound[previoussound.length
-mergecount+i]+thissound[i])/2);
}
Play a previous phoneme
PlaySound (Previoussound);
The truncated current phoneme as the previous phoneme
Byte[] Newsound=new Byte[thissound.length-mergecount];
for (int ii=0; II
Newsound[ii]=thissound[ii+mergecount];
Previoussound=newsound;
}
Else
Previoussound=thissound;
}
Play the last phoneme and clear the sound channel
PlaySound (Previoussound);
Drain ();
}
After Sayphoneword (), you can see that it calls PlaySound () to output a single sound sample (that is, a phoneme) and then calls drain () to clean up the sound channel. Here is the code for PlaySound ():
/*
* This method plays a sound sample
*/
private void PlaySound (byte[] data)
{
if (data.length>0) line.write (data, 0, data.length);
}
Here is the code for drain ():
/*
* This method clears the sound channel
*/
private void Drain ()
{
if (line!=null) Line.drain ();
try {thread.sleep;} catch (Exception e) {}
}
Now look back at Sayphoneword (), here's another way we have no analysis, namely the Getsound () method.
The Getsound () method reads a pre-recorded sound sample from an au file in the form of byte data. To understand the detailed procedures for reading data, converting audio formats, initializing sound output lines (soucedataline), and constructing byte data, refer to the comments in the following code:
/*
* This method reads a phoneme from a file,
* and convert it to a byte array
*/
Private byte[] Getsound (String fileName)
{
Try
{
URL Url=talker.class.getresource (fileName);
Audioinputstream stream = audiosystem.getaudioinputstream (URL);
Audioformat format = Stream.getformat ();
Convert a alaw/ulaw sound into a PCM for playback
if ((format.getencoding () = = AudioFormat.Encoding.ULAW) | |
(format.getencoding () = = AudioFormat.Encoding.ALAW))
{
Audioformat Tmpformat = new Audioformat (
AudioFormat.Encoding.PCM_SIGNED,
Format.getsamplerate (), Format.getsamplesizeinbits () * 2,
Format.getchannels (), Format.getframesize () * 2,
Format.getframerate (), true);
stream = Audiosystem.getaudioinputstream (Tmpformat, stream);
format = Tmpformat;
}
Dataline.info Info = new Dataline.info (
Clip.class, Format,
(int) stream.getframelength () * format.getframesize ()));
if (line==null)
{
The output line is not yet instantiated
Can you find the right type of output line?
Dataline.info outinfo = new Dataline.info (Sourcedataline.class,
format);
if (! Audiosystem.islinesupported (Outinfo))
{
System.out.println ("does not support matching" + outinfo + "output lines");
throw new Exception ("does not support matching" + outinfo + "output lines");
}
Open the output line
line = (sourcedataline) audiosystem.getline (outinfo);
Line.open (format, 50000);
Line.start ();
}
int framesizeinbytes = Format.getframesize ();
int bufferlengthinframes = Line.getbuffersize ()/8;
int bufferlengthinbytes = Bufferlengthinframes * framesizeinbytes;
Byte[] Data=new byte[bufferlengthinbytes];
Reads byte data, and counts
int numbytesread = 0;
if ((Numbytesread = stream.read (data))! =-1)
{
int numbytesrmaining = Numbytesread;
}
Cutting byte data into the right size
Byte[] Newdata=new Byte[numbytesread];
for (int i=0; i
Newdata[i]=data[i];
return newdata;
}
catch (Exception e)
{
return new byte[0];
}
}
That's all the code, including the comments, a speech synthesizer with about 150 lines of code.
Iii. text-to-speech conversion
Specifying the words to be read in the form of a speech element seems overly complex, and if you want to construct an app that can read text (such as a Web page or email), we want to be able to specify the original text directly.
After delving into this problem, I provide an experimental text-to-speech conversion class in the zip file later in this article. Run this class and it will show the results of the analysis. The text-to-speech class can be executed from the command line as follows:
Java com.lotontech.speech.Converter "Hello there"
Output result classes such as:
Hello--H|e|l|oo
There-Dth|aer
If you run the following command:
Java com.lotontech.speech.Converter "I like to read Javaworld"
The output is:
I-II
Like-l|ii|k
To-T|ouu
Read-R|ee|a|d
Java-J|a|v|a
World-W|err|l|d
How does this conversion class work? In fact, my approach is fairly simple, and the conversion process is to apply a set of text substitution rules in a certain order. For example, for the word "ant", "want", "wanted", "unwanted", and "unique", the substitution rule we want to apply might be:
With "|y|ou|n|ee|k|" Replace "*unique*"
With "|w|o|n|t|" Replace "*want*"
With "|a|" Replace "*a*"
With "|e|" Replace "*e*"
With "|d|" Replace "*d*"
With "|n|" Replace "*n*"
With "|u|" Replace "*u*"
With "|t|" Replace "*t*"
For "unwanted", the output sequence is:
Unwanted
Un[|w|o|n|t|] Ed (rule 2)
[|u|] [|n|] [|w|o|n|t|] [|e|] [|d|] (Rules 4, 5, 6, 7)
U|n|w|o|n|t|e|d (after removing the extra characters)
You will see words that contain the letter "wont" and words that contain the letter "ant" in different ways, and you will see that under the special rules, "unique" takes precedence over other rules as a complete word, thus the word "unique" reads "Y|ou ..." instead of "u|n." ...”。
Conclusion: This article provides an easy-to-run speech engine that you can use in your own Java 1.3 application. If you analyze the code carefully, it also provides you with a practical tutorial for playing audio fragments with the Javasound API. To make it really useful, you should consider text-to-speech technology, because this is the real underpinning for the text reading application I mentioned earlier. To improve the effectiveness of this scenario, you must construct a large replacement rule base that carefully adjusts the precedence of the application rules. I hope you have more perseverance than me!
Let Java speak-implementing the speech engine in Java