ONJava.com    
 Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples


The Java Speech API, Part 1

by Mandar S. Chitnis and Lakshmi Ananthamurthy
08/06/2003

The idea of machines that speak and understand human speech has long been a fascination of application users and application builders. With advances in speech technology, this concept has now become a reality. Research projects have evolved and refined speech technology, making it feasible to develop applications that use speech technology to enhance the user's experience. There are two main speech technology concepts -- speech synthesis and speech recognition.

Speech synthesis is the process of generating human speech from written text for a specific language. Speech recognition is the process of converting human speech to words/commands. This converted text can be used or interpreted in different ways.

Over the course of two articles, we will explore the use of the Java Speech API to write applications that have speech synthesis and speech recognition capabilities. In addition, we will look at the application areas where we can effectively use speech technology.

Speech Technology Support in Java

A speech-enabled application does not directly interact with the audio hardware of the machine on which it runs. Instead, there is a common application, termed the Speech Engine, which provides speech capability and mediates between the audio hardware and the speech-enabled application, as shown in Figure 1.

speech engine
Figure 1. Speech engine

Speech engines implemented by each vendor expose speech capabilities in a vendor-specific way. To enable speech applications to use speech functionality, vendors design speech engines that expose services that can be accessed via a commonly defined and agreed-upon Application Program Interface.

Java Speech API

This is where the Java Speech API (JSAPI) steps into the picture. The Java Speech API brings to the table all of the platform- and vendor-independent features commonly associated with any Java API. The Java Speech API enables speech applications to interact with speech engines in a common, standardized, and implementation-independent manner. Speech engines from different vendors can be accessed using the Java Speech API, as long as they are JSAPI-compliant.

With JSAPI, speech applications can use speech engine functionality such as selecting a specific language or a voice, as well as any required audio resources. JSAPI provides an API for both speech synthesis and speech recognition.

The Java Speech API stack
Figure 2. The Java Speech API stack

Figure 2 shows the Java Speech API stack. At the bottom of the stack, the speech engine interacts with the audio hardware. On top of it sits the Java Speech API that provides a standard and consistent way to access the speech synthesis and speech recognition functionality provided by the speech engine. Java applications that need to incorporate speech functionality use the Java Speech API to access the speech engine.

Several speech engines, both commercial and open source, are JSAPI-compliant. Among open source engines, the Festival speech synthesis system is one of the popular speech synthesis engines that expose services using JSAPI. Many commercial speech engines that support JSAPI exist. You can find a comprehensive list of these on the Java Speech API web site.

Java Speech API: Important Classes and Interfaces

The different classes and interfaces that form the JSAPI are grouped into three packages:

Before we proceed with writing an application that uses JSAPI, let's explore a few important classes and interfaces in each of these packages.

JSAPI Speech Engine interfaces and classes
Figure 3. JSAPI speech engine interfaces and classes

Central

The Central class is like a factory class that all JSAPI applications use. It provides static methods to enable the access of speech synthesis and speech recognition engines.

Engine

The Engine interface encapsulates the generic operations that a JSAPI-compliant speech engine should provide for speech applications. Primarily, speech applications can use methods to perform actions such as retrieving the properties and state of the speech engine and allocating and deallocating resources for a speech engine. In addition, the Engine interface exposes mechanisms to pause and resume the audio stream generated or processed by the speech engine. The Engine interface is subclassed by the Synthesizer and Recognizer interfaces, which define additional speech synthesis and speech recognition functionality.

The JSAPI has been modeled on the event-handling model of AWT components. Hence, events generated by the speech engine can be identified and handled as required. There are two ways to handle speech engine events: through the EngineListener interface or the EngineAdapter class.

Next, let's examine some of the important classes and interfaces of the javax.speech.synthesis package. These will be used quite frequently in speech applications.

JSAPI Speech Synthesis interfaces and classes
Figure 4. JSAPI speech synthesis interfaces and classes

Synthesizer

The Synthesizer interface encapsulates the operations that a JSAPI-compliant speech synthesis engine should provide for speech applications. Primarily, speech applications can perform actions such as producing speech output (given text input) or stopping speech-synthesis processing. Other related operations are inherited from the Engine interface. The Synthesizer interface provides different sources of text input, ranging from a plain String, to a URL, to a special-purpose markup language called Java Speech Markup Language (JSML, discussed in the next article).

SynthesizerProperties

The operations in the SynthesizerProperties interface are used to define runtime properties for the Synthesizer object, including the voice, volume, and pitch for speech synthesis by the Synthesizer object.

Voice

The Voice class represents the voice that the Synthesizer object uses to play the speech output. The Voice class also provides methods to obtain metadata information for the voice used for speech synthesis by the Synthesizer object. This metadata includes the name, age, and gender of the voice being used.

Similar to the Engine interface, events generated during speech synthesis can be identified and handled by either implementing the methods in the SpeakableListener interface or using the SpeakableAdapter class.

We will explore the classes and interfaces of the javax.speech.recognition package in the next article.

"Can you hear me now?" Asks the Duke

In order to understand the JSAPI better, let's write a simple application that uses the JSAPI to provide speech synthesis capability. We will build a simple text editor using the Java Swing API set and add speech capability to the editor using JSAPI to enable the application to speak the contents of a file. What speech synthesis capabilities do we add to the speech-enabled text editor? As we saw from the previous section, the speech synthesis engine provides different features, such as producing speech output from text, pausing or resuming the speech output, or ending the speech output generation. We can add the following capabilities to the VoicePad editor:

The user can invoke any of these speech capabilities by clicking on the relevant menu items (“Play”, “Pause”, “Resume,” or “Cancel”) from the Speech Menu.

To build the speech-enabled text editor, first we will define the user interface (UI) elements that we will need. We can use the text area element as the text editor for our application. For navigation and user interaction, we will define a menu bar with menus and menu options. Since our application functionality consists of two parts – text editor and speech – we will define two sets of menus for the VoicePad application:

Class diagram of the VoicePad application
Figure 5. Class diagram of the VoicePad application

Now that we know what is required, let's put together the Java Swing pieces for the application. The primary class of the application is the VoicePad class that extends from the JFrame class. As shown in the class diagram in Figure 5, the VoicePad class will contain all of the methods required for both text editing and speech functionality.

The constructor for the VoicePad application is responsible for initializing the application elements. The constructor invokes the init() method, which performs the initialization of the user interface elements and the speech engine. The JTextArea UI element will be the text editor for our application.

// constructor
public VoicePad()
{
    super("Novusware - VoicePad");
    setSize(800, 600);

    // initialize the application settings
    init();

    // set up the file selection
    fileChooser = new JFileChooser();
    fileChooser.setCurrentDirectory(new File("."));

    WindowListener wndCloser = new WindowAdapter()
    {
        public void windowClosing(WindowEvent e)
        {
            closeSpeechSynthesisEngine();
            System.exit(0);
        }
    };
    addWindowListener(wndCloser);
}

// initialization
private void init()
{
    textArea = new JTextArea();
    ps      = new JScrollPane(textArea);
    this.getContentPane().add(ps, BorderLayout.CENTER);

    textArea.append("Voice-enabled editor.");

    initSpeechSynthesisEngine();

    this.setJMenuBar(getVoicePadMenuBar());

    System.out.println("NOVUSWARE : VoicePad application initialized.");
}

Next, the initSpeechSynthesisEngine() method is called from the init() method. The initSpeechSynthesisEngine() method initializes the FreeTTS speech synthesis engine using the Central class and selects the voice to be used for the speech output. A valid initialization returns the reference to the speech synthesis engine in the Synthesizer object. This reference will be used throughout the life of the VoicePad application to provide speech synthesis functionality.

private void initSpeechSynthesisEngine()
{
    String message = "";
    String synthesizerName = System.getProperty("synthesizerName",
        "Unlimited domain FreeTTS Speech Synthesizer from Sun Labs");

    // Create a new SynthesizerModeDesc that will match the FreeTTS synthesizer.
    SynthesizerModeDesc desc = new SynthesizerModeDesc
                                    (synthesizerName,
                                    null,
                                    Locale.US,
                                    Boolean.FALSE,
                                    null);
    // obtain the Speech Synthesizer instance
    try
    {
        synthesizer = Central.createSynthesizer(desc);

        if (synthesizer == null)
        {
            message = "Make sure that there is a \"speech.properties\" file " +
                        "at either of these locations: \n" +
                        "user.home : " +
                        System.getProperty("user.home") + "\n" +
                        "java.home/lib: " + System.getProperty("java.home") +
                        File.separator + "lib\n";

            System.out.println("NOVUSWARE : ERROR! Synthesizer not found!");
            System.out.println(message);

            throw new Exception("Synthesizer not found!");
        }

        System.out.println("NOVUSWARE : Speech synthesizer obtained.");

        voiceName = System.getProperty("voiceName", VOICE_SELECTED);
        voice     = new Voice(voiceName, Voice.GENDER_DONT_CARE,
                        Voice.AGE_DONT_CARE, null);

        if (voice == null)
        {
            System.out.println("NOVUSWARE : ERROR! No voice selected!");
            throw new Exception("No voice selected!");
        }

        System.out.println("NOVUSWARE : Voice " + VOICE_SELECTED
                         + " selected.");
        synthesizer.allocate();
        synthesizer.resume();
        synthesizer.getSynthesizerProperties().setVoice(voice);
    }
    catch(Exception e)
    {
        e.printStackTrace();
        System.out.println("NOVUSWARE : ERROR!" + e);
        closeSpeechSynthesisEngine();
    }

    System.out.println("NOVUSWARE : Speech engine initialized.");
}

The different UI items of the application will be initialized from the init() method of the VoicePad class. We will wrap the code to create all of the menu UI items and their action listeners in simple getXXX() methods. First, the menu bar for the VoicePad application is created, upon which the two sets of user navigation menus will be attached, using the getVoicePadMenuBar() method. The next step is to create the file and speech menus using getFileMenu() and getSpeechMenu(), respectively.

// menu bar creation
private JMenuBar getVoicePadMenuBar()
{
    if (menuBar == null)
    {
        menuBar = new JMenuBar();

        menuBar.add(getFileMenu());
        menuBar.add(getSpeechMenu());

        System.out.println("NOVUSWARE : Menubar added.");
    }

    return menuBar;
}

// file menu and file menu items begin
private JMenu getFileMenu()
{
    if (fileMenu == null)
    {
        // create menu
        fileMenu = new JMenu("File");
        myActionListener =
            new ActionListener()
            {
                public void actionPerformed(ActionEvent ae)
                {
                    System.out.println("NOVUSWARE : " 
                                     + "File menu action performed.");
                }
            };
        fileMenu.setMnemonic('f');
        fileMenu.addActionListener(myActionListener);

        fileMenu.add(getNewMenuItem());
        fileMenu.add(getOpenMenuItem());
        fileMenu.add(getSaveMenuItem());
        fileMenu.add(getExitMenuItem());

        System.out.println("NOVUSWARE : File menu created.");
    }

    return fileMenu;
}

// speech menu and speech menu items begin
private JMenu getSpeechMenu()
{
    if (speechMenu == null)
    {
        // create menu
        speechMenu = new JMenu("Speech");
        myActionListener =
            new ActionListener()
            {
                public void actionPerformed(ActionEvent ae)
                {
                    System.out.println("NOVUSWARE : "
                                     + "Speech menu action performed.");
                }
            };
        speechMenu.setMnemonic('s');
        speechMenu.addActionListener(myActionListener);

        speechMenu.add(getPlayMenuItem());
        speechMenu.add(getPauseMenuItem());
        speechMenu.add(getResumeMenuItem());
        speechMenu.add(getStopMenuItem());

        System.out.println("NOVUSWARE : Speech menu created.");
    }
    return speechMenu;
}

Menu items for the file menu to enable user actions (creating a new file, opening an existing file, saving the contents of an edited file, and closing a file) will be created using the getNewMenuItem(), getOpenMenuItem(), getSaveMenuItem(), and getExitMenuItem() methods. Since this is a Swing application, the application will respond to different user-navigation events. We will define action listeners for each of the menu items that generate these user-navigation events in the VoicePad application. The text editing functionality will be handled by the action listeners of the menu items of the file menu.

private JMenuItem getNewMenuItem()
{
    if (newMenuItem == null)
    {
        newMenuItem = new JMenuItem("New");
        myActionListener =
            new ActionListener()
            {
                public void actionPerformed(ActionEvent ae)
                {
                    textArea.setText("");
                    System.out.println("NOVUSWARE : "
                                     + "New menu item action performed.");
                }
            };
        newMenuItem.setMnemonic('n');
        newMenuItem.addActionListener(myActionListener);

        System.out.println("NOVUSWARE : New menu item created.");
    }

    return newMenuItem;
}

private JMenuItem getOpenMenuItem()
{
    if (openMenuItem == null)
    {
        openMenuItem = new JMenuItem("Open");
        myActionListener =
            new ActionListener()
            {
                public void actionPerformed(ActionEvent ae)
                {
                    VoicePad.this.repaint();
                    if (fileChooser.showOpenDialog(VoicePad.this) ==
                        JFileChooser.APPROVE_OPTION)
                    {
                        File fSelected = fileChooser.getSelectedFile();
                        try
                        {
                            FileReader in = new FileReader(fSelected);
                            textArea.read(in, null);
                            in.close();
                        }
                        catch(IOException ioe)
                        {
                            ioe.printStackTrace();
                        }
                    }

                    System.out.println("NOVUSWARE : "
                                     + "Open menu item action performed.");
                }
            };
        openMenuItem.setMnemonic('o');
        openMenuItem.addActionListener(myActionListener);

        System.out.println("NOVUSWARE : Open menu item created.");
    }

    return openMenuItem;
}

private JMenuItem getSaveMenuItem()
{
    if (saveMenuItem == null)
    {
        saveMenuItem = new JMenuItem("Save");
        myActionListener =
            new ActionListener()
            {
                public void actionPerformed(ActionEvent ae)
                {
                    VoicePad.this.repaint();
                    if (fileChooser.showSaveDialog(VoicePad.this) ==
                        JFileChooser.APPROVE_OPTION)
                    {
                        File fSelected = fileChooser.getSelectedFile();
                        try
                        {
                            FileWriter out = new FileWriter(fSelected);
                            VoicePad.this.textArea.write(out);
                            out.close();
                        }
                        catch(IOException ioe)
                        {
                            ioe.printStackTrace();
                        }
                    }

                    System.out.println("NOVUSWARE : "
                                     + "Save menu item action performed.");
                }
            };
        saveMenuItem.setMnemonic('s');
        saveMenuItem.addActionListener(myActionListener);

        System.out.println("NOVUSWARE : Save menu item created.");
    }

    return saveMenuItem;
}

private JMenuItem getExitMenuItem()
{
    if (exitMenuItem == null)
    {
        exitMenuItem = new JMenuItem("Exit");
        myActionListener =
            new ActionListener()
            {
                public void actionPerformed(ActionEvent ae)
                {
                    closeSpeechSynthesisEngine();
                    System.out.println("NOVUSWARE : "
                                     + "Exit menu item action performed.");
                    System.exit(0);
                }
            };
        exitMenuItem.setMnemonic('x');
        exitMenuItem.addActionListener(myActionListener);

        System.out.println("NOVUSWARE : Exit menu item created.");
    }

    return exitMenuItem;
}

The menu items for the speech menu to enable user actions for invoking speech capabilities (speaking the contents of the text editor, pausing and resuming the speech synthesis operations, and canceling a speak operation in progress) will be created using the getPlayMenuItem(), getPauseMenuItem(), getResumeMenuItem(), and getStopMenuItem() methods. The action listeners for the menu items of the speech menu will contain the JSAPI-specific code for speech synthesis actions.

Let's see the action listeners for each of the menu items of the speech menu. The action listener for the “Play” menu item contains the call to the speakPlainText() method of the Synthesizer object. This invokes the speech synthesis capability provided by the FreeTTS speech synthesis engine. The contents of the text file opened in the text editor will be read out by the speech synthesis engine.

private JMenuItem getPlayMenuItem()
{
    if (playMenuItem == null)
    {
        playMenuItem = new JMenuItem("Play");
        myActionListener =
            new ActionListener()
            {
                public void actionPerformed(ActionEvent ae)
                {
                    String textToPlay = "";

                    try
                    {
                        // retrieve the text to be played
                        if (textArea.getSelectedText() != null)
                            textToPlay = textArea.getSelectedText();
                        else
                            textToPlay = textArea.getText();

                        // play the text
                        synthesizer.speakPlainText(textToPlay, null);

                        // wait till speaking is done
                        synthesizer.waitEngineState(Synthesizer.QUEUE_EMPTY);

                        System.out.println("NOVUSWARE : "
                                         + "Play menu item action performed.");
                    }
                    catch(Exception e)
                    {
                        e.printStackTrace();
                        System.out.println("NOVUSWARE : ERROR! "
                                         + "Play menu item action." + e);
                    }
                }
            };
        playMenuItem.setMnemonic('p');
        playMenuItem.addActionListener(myActionListener);

        System.out.println("NOVUSWARE : Play menu item created.");
    }
    return playMenuItem;
}

The action listener for the “Pause” menu item contains the call to the pause() method, inherited from the Engine object. This makes the FreeTTS speech synthesis engine halt speech synthesis processing.

private JMenuItem getPauseMenuItem()
{
    if (pauseMenuItem == null)
    {
        pauseMenuItem = new JMenuItem("Pause");
        myActionListener =
            new ActionListener()
            {
                public void actionPerformed(ActionEvent ae)
                {
                    try
                    {
                        // pause the speech synthesizer
                        synthesizer.pause();
                        System.out.println("NOVUSWARE : "
                                         + "Pause menu item action performed.");
                    }
                    catch(Exception e)
                    {
                        e.printStackTrace();
                        System.out.println("NOVUSWARE : ERROR! "
                                         + "Pause menu item action." + e);
                    }
                }
            };
        pauseMenuItem.setMnemonic('a');
        pauseMenuItem.addActionListener(myActionListener);

        System.out.println("NOVUSWARE : Pause menu item created.");
    }
    return pauseMenuItem;
}

The action listener for the “Resume” menu item contains the call to the resume() method of the Synthesizer object, inherited from the Engine object. This makes the FreeTTS speech synthesis engine continue the speech synthesis processing from the last position in the text, where a pause had occurred.

private JMenuItem getResumeMenuItem()
{
    if (resumeMenuItem == null)
    {
        resumeMenuItem = new JMenuItem("Resume");
        myActionListener =
            new ActionListener()
            {
                public void actionPerformed(ActionEvent ae)
                {
                    try
                    {
                        // resume the speech synthesizer
                        synthesizer.resume();
                    }
                    catch(Exception e)
                    {
                        e.printStackTrace();
                        System.out.println("NOVUSWARE : ERROR! "
                                         + "Resume menu item action." + e);
                    }

                    System.out.println("NOVUSWARE : "
                                     + "Resume menu item action performed.");
                }
            };
        resumeMenuItem.setMnemonic('r');
        resumeMenuItem.addActionListener(myActionListener);

        System.out.println("NOVUSWARE : Resume menu item created.");
    }
    return resumeMenuItem;
}

The action listener for the “Stop” menu item contains the call to the cancel() method of the Synthesizer object. This halts any speech synthesis processing currently underway by the FreeTTS speech synthesis engine.

private JMenuItem getStopMenuItem()
{
    if (stopMenuItem == null)
    {
        stopMenuItem = new JMenuItem("Stop");
        myActionListener =
            new ActionListener()
            {
                public void actionPerformed(ActionEvent ae)
                {
                    try
                    {
                        synthesizer.cancel();

                        System.out.println("NOVUSWARE : "
                                         + "Stop menu item action performed.");
                    }
                    catch(Exception e)
                    {
                        e.printStackTrace();
                        System.out.println("NOVUSWARE : ERROR! "
                                         + "Stop menu item action." + e);
                    }
                }
            };
        stopMenuItem.setMnemonic('t');
        stopMenuItem.addActionListener(myActionListener);

        System.out.println("NOVUSWARE : Stop menu item created.");
    }
    return stopMenuItem;
}

Finally, the VoicePad class contains the main() method, which will instantiate an object of the VoicePad class and invoke the setVisible(true) method to display the text editor application.

// execute voicepad application
public static void main(String argv[])
{
    VoicePad voicePad = new VoicePad();
    voicePad.setVisible(true);
}

The entire source code of the VoicePad application can be viewed here.

Compiling and Running the VoicePad Application

We need to have a few software components installed and properly configured before we compile and execute the VoicePad application.

FreeTTS Speech Synthesis System

We will use the FreeTTS 1.1.2 JSAPI-compliant speech synthesis engine for our VoicePad application. The FreeTTS engine can be downloaded from FreeTTS on SourceForge.net.

Follow the instructions to install and configure the speech synthesis engine. Verify that the sample “HelloWorld” program runs with the FreeTTS engine before proceeding to compile the VoicePad application.

If you encounter any problems (normally related to either CLASSPATH or FreeTTS-speech-engine configuration), please refer to the FreeTTS manual and troubleshooting guide. Once the test applications work fine with the FreeTTS speech engine, you can proceed to compile and run the VoicePad application.

In case you wish to use another speech synthesis engine, then you need to install and configure it and ensure that it runs properly before you use it in the VoicePad application. Apart from this, you need to change the environment-setting script to ensure that the necessary library files are used.

JDK 1.4

Download and install JDK 1.4. JDK 1.4 is preferred because of advantages such as improved IO. The FreeTTS speech synthesis system also lists use of the JDK 1.4.

Edit the setEnv.bat script to define your environment-specific JAVA_HOME and SPEECH_SYNTHESIS_HOME directories. Executing this script on the console will set up the PATH, CLASSPATH, and other required environment settings. Run the command:

$ java -version

to verify that the JDK 1.4 is being used.

Compile the VoicePad application using the JDK 1.4 compiler using the command:

$ javac -d . VoicePad.java

If you encounter any PATH or CLASSPATH problems, check the setEnv.bat script for incorrect settings. After the source code compiles without any problems, run the VoicePad text editor using the command:

$ java com.novusware.speech.example.VoicePad

The VoicePad application screen
Figure 6. The VoicePad application screen

You should see the application screen come up, as shown in Figure 6. Now we can verify that both text editing and speech capabilities work properly.

First try the menu options of the File menu. Open an existing text file or create a new file and save the changes to the file. Take a look at the console from which the VoicePad application was executed. You should see the execution trace of the methods that were executed to give an idea of the processing sequence of the VoicePad application.

Now you are ready to hear the application speak to you. Open a file in the text editor. Select the “Play” option from the “Speech” menu. The VoicePad application will process the contents of the file opened in the editor and speak the contents using the FreeTTS speech synthesis engine. Next, select the “Pause” and “Resume” menu options to simulate pausing and resuming the speech output.

This concludes our initial exploration of the Java Speech API for speech synthesis.

Summary

The Java Speech API provides a simple and elegant way to integrate speech capability in Java applications. The VoicePad speech-enabled text editor gave an idea of how easy it is to integrate speech synthesis capability in Java applications. In the next article, we will discuss the speech recognition support in the Java Speech API and discuss the application areas for integrating speech capability using the Java Speech API. Other topics that will be covered in the next article are the support APIs -- Java Speech API Markup Language (JSML) and the Java Speech API Grammar Format Specification (JSGF).

Resources

Mandar S. Chitnis is a co-founder of Novusware inc.

Lakshmi Ananthamurthy is a co-founder of Novusware inc.


Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.