Roomba, I command thee! The author demonstrates voice command with an Arduino and Raspberry Pi.
The author demonstrates voice control of a domestic vacuum-robot using an Arduino and Raspberry Pi.

It’s an old saying that computers aren’t really smart, because they only do what we tell them to do. But now you can actually tell your Raspberry Pi to do something smart — to take control of connected devices in your home — using only your voice. It’s not hard — you can do it easily using some open source software and a Raspberry Pi. Just add an Arduino with an infrared (IR) LED and you can tell your Roomba what to do, too.

Shall We Say a Game?

This project is made possible by years of research by scores of scientists, engineers, and linguists around the world working to enable real-time voice recognition that can run on modest hardware — the sort of advances that have brought us Siri on Apple devices and the voice recognition capabilities built into Google’s Android.

Specifically, I’ll show you how to use an open source speech recognition toolkit from the Speech Group at Carnegie Mellon University called PocketSphinx, designed for use in embedded systems. This amazing library allows us to delegate the heavy lifting of turning sound into text, so we can focus on implementing higher-level logic to turn the text into meaningful actions, like commanding the lighting systems in your home and even your Roomba vacuum cleaner. It also allows us, as Makers, to get under the hood and experiment with aspects of a speech recognition engine that are usually reserved for those implementing the engines or studying them in academia.

What will the next generation of Make: look like? We’re inviting you to shape the future by investing in Make:. By becoming an investor, you help decide what’s next. The future of Make: is in your hands. Learn More.

Project Steps

PREPARING THE PI

The Raspberry Pi credit-card-sized computer has enough power to translate your spoken commands into computer commands. For about $35.
For about $35, the Raspberry Pi, a credit-card-sized computer, has enough power to translate your spoken commands into computer commands for home automation and robotics.

Before you can use the Raspberry Pi for speech recognition, you need to make sure you’re able to record sound from your USB microphone. The Raspbian distribution is configured to use the Advanced Linux Sound Architecture (ALSA) for sound. Despite its name this framework is quite mature and will work with many software packages and sound cards without alteration or configuration difficulties. For the specific case of the Pi there are a few tweaks we need to make to ensure that the USB card is preferred over the built-in audio.

First plug in your USB headset and power on the Pi. Once everything has finished loading, you can check that your USB audio is detected and ready to use by running the aplay -L command. This should display the name of your card such as in our example: Logitech H570e Stereo, USB Audio. If your USB sound card appears in this list then you can move on to making it the default for the system. To do this, use your favorite text editor to edit the alsa-base.conf file like so:

sudo nano /etc/modprobe.d/alsa-base.conf

You need to change the line options snd-usb-audio index=-2 to options snd-usb-audio index=0 and add a line below it with options snd_bcm2835 index=1. Once done, save the file and sudo reboot the Pi to use the new configuration.

To test your changes, use the arecord -vv --duration=7 -fdat ~/test.wav command to record a short 7-second piece of audio from the microphone. Try playing it back with aplay ~/test.wav and you should hear what you recorded earlier through your USB headphones. If not, try playing back a prerecorded sound such as aplay /usr/share/sounds/alsa/Front_Center.wav to determine if the issue lies with your microphone or speakers. (The internet will be a great help when troubleshooting these issues.)

If you heard your recording, great! You’re ready to move on to the software setup.

COMPILING THE PREREQUISITE SOFTWARE

As we are standing on the shoulders of software giants, there are a few packages to install and several pieces of software to compile before your Pi is ready to host the voice control software.

First, go get the packages required for SphinxBase by executing:

sudo apt-get install libasound2-dev autoconf libtool bison \
swig python-dev python-pyaudio

You’ll also need to install some Python libraries for use with our demo application. To do this, you’ll install and use the Python pip command with the following commands:

curl -O https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py
sudo pip install gevent grequests

TIP: If your connection to the Pi is a bit flaky and prone to disconnects, you can save yourself some heartache by running these commands in a screen session. To do so, run the following before continuing.

sudo apt-get install screen
screen -DR sphinx

If at any stage you get disconnected from your Pi (and it hasn’t restarted) you can run screen -DR sphinx again to reconnect and continue where you left off.

OBTAINING THE SPHINX TOOLS

Now you can go about getting the SphinxBase package, which is used by PocketSphinx as well as other software in the CMU Sphinx family.

To obtain SphinxBase execute the following commands:

git clone git://github.com/cmusphinx/sphinxbase.git
cd sphinxbase
git checkout 3b34d87
./autogen.sh
make

(At this stage you may want to go make coffee …)

sudo make install
cd ..

You’re ready to move on to PocketSphinx. To obtain PocketSphinx, execute the following commands:

git clone git://github.com/cmusphinx/pocketsphinx.git
cd pocketsphinx
git checkout 4e4e607
./autogen.sh
make

(Time for a second cup of coffee …)

sudo make install
cd ..

To update the system with your new libraries, run sudo ldconfig.

TESTING THE SPEECH RECOGNITION

Now that you have the building blocks of your speech recognition in place, you’ll want to test that it actually works before continuing.

Now you can run a test of PocketSphinx using pocketsphinx_continuous -inmic yes.

You should see something like the following, which indicates the system is ready for you to start speaking:

Listening...
Input overrun, read calls are too rare (non-fatal)

You can safely ignore the warning. Go ahead and speak!

When you’re finished, you should see some technical information along with PocketSphinx’s best guess as to what you said, and then another READY prompt letting you know it’s ready for more input.

INFO: ngram_search.c(874): bestpath 0.10 CPU 0.071 xRT
INFO: ngram_search.c(877): bestpath 0.11 wall 0.078 xRT
what
READY....

At this point, speech recognition is up and running. You’re ready to move onto the real fun of creating your custom voice control application!

CONTROL ALL THE THINGS

For our demo application, I’ve programmed our system to be able to control three separate systems: Philips Hue and Insteon lighting systems, and an iRobot Roomba robot vacuum cleaner. With the first two, you’ll communicate via a network-connected bridge or hub. For the third, you’ll communicate with an Arduino via a serial over a USB connection, then the Arduino will translate your commands into infrared (IR) signals that emulate a Roomba remote control.

If you just want to dive into the demo application and try it out, you can use the following commands to retrieve the Python source code and run it on the Raspberry Pi:

git clone https://github.com/bynds/makevoicedemo
cd makevoicedemo
python main.py

At this stage you should have a prompt on the screen telling you that your Pi is ready for Input. Try saying one of the commands — “Turn on the kitchen light” or “Turn off the bedroom light” — and watch the words as they appear on the screen. As we haven’t yet set up the configuration.json file, the kitchen light should still be off.

USING POCKETSPHINX

There are several modes that you can configure for PocketSphinx. For example, it can be asked to listen for a specific keyword (it will attempt to ignore everything it hears except the keyword), or it can be asked to use a grammar that you specify (it will try to fit everything it hears into the confines of the grammar). We are using the grammar mode in our example, with a grammar that’s designed to allow us to capture all the commands we’ll be using. The grammar file is specified in JSGF or JSpeech Grammar Format which has a powerful yet straightforward syntax for specifying the speech that it expects to hear in terms of simple rules.

In addition to the grammar file, you’re going to need three more files in order to use PocketSphinx in our application: a dictionary file which will define words in terms of how they sound, a language model file which contains statistics about the words and their order, and an acoustic model which is used to determine how audio correlates with the sounds in words. The grammar file, dictionary, and language model will all be generated specifically for our project, while the acoustic model will be a generic model for U.S. English.

GENERATING THE DICTIONARY

In order to generate our dictionary, we will be making use of lmtool, the web based tool hosted by CMU specifically for quickly generating these files. The input to lmtool is a corpus file which contains all or most of the sentences that you would like to be able to recognize. In our simple use case, we have the following sentences in our corpus:

turn on the kitchen light
turn off the kitchen light
turn on the bedroom light
turn off the bedroom light
turn on the roomba
turn off the roomba
roomba clean
roomba go home

You can type these into a text editor and save the file as corpus.txt or you can download a readymade version from the Github repository.

Now that you have your corpus file, go use lmtool. To upload your corpus file, click the Browse button which will bring up a dialog box that allows you to select the corpus file you just created.

Then click the button Compile Knowledge Base. You’ll be taken to a page with links to download the result. You can either download the compressed .tgz file which contains all the files generated or simply download the .dic file labeled Pronunciation Dictionary. Copy this file to the same makevoicedemo directory that was created on the Pi earlier. You can rename the file using the command  mv *.dic dictionary.dic to make it easier to work with.

While you’re at it, download the prebuilt acoustic model from the Sphinx Sourceforge. Once you’ve moved it to the makevoicedemo directory, extract it with:

tar -xvf cmusphinx-en-us-ptm-5.2.tar.gz.

CREATING THE GRAMMAR FILE

As I mentioned earlier, everything that PocketSphinx hears, it will try and fit into the words of the grammar. Check out how the JSGF format is described in the W3C note. It starts with a declaration of the format followed by a declaration of the grammar name. We simply called ours “commands.”

We have chosen to use three main rules: an action, an object, and a command. For each rule, you’ll define “tokens” which are what you expect the user to say. For example, the two tokens for our action rule are TURN ON and TURN OFF. We therefore represent the rule as:


<action> = TURN ON  |
TURN OFF ;

Similarly the _object_ rule we define as:

<object> =  KITCHEN LIGHT|
BEDROOM LIGHT|
ROOMBA       ;

Finally, to demonstrate that we can nest rules or create them with explicit tokens, we define a command as:

public <command> = <action> THE <object>  |
ROOMBA CLEAN           |
ROOMBA GO HOME         ;

Notice the public keyword in front of the <command>. This allows us to use the <command> rule by importing it into other grammar files in the future.

INITIALIZING THE DECODER

We are using Python as our programming language because it is easy to read, powerful, and thanks to the foresight of the PocketSphinx developers, it’s also very easy to use with PocketSphinx.

The main workhorse when recognizing speech with PocketSphinx is the decoder. In order to use the decoder we must first set a config for the decoder to use.

from pocketsphinx import *

hmm = 'cmusphinx-5prealpha-en-us-ptm-2.0/'
dic = 'dictionary.dic'
grammar = 'grammar.jsgf'

config = Decoder.default_config()
config.set_string('-hmm', hmm)
config.set_string('-dict', dic)
config.set_string('-jsgf', grammar)

Once this is done, initializing a decoder is as simple as decoder = Decoder(config).

For the example application, we’re using the pyAudio library to get the user’s speech from the microphone for processing by PocketSphinx. The specifics of this library are less important for our purposes (investigating speech recognition) and we will therefore simply take it for granted that pyAudio works as advertised.

The specifics of obtaining the decoder’s text output are a bit complex, however the basic process can be distilled down to the following steps.

\# Start an 'utterance'
decoder.start_utt()
\# Process a soundbite
decoder.process_raw(soundBite, False, False)
\# End the utterance when the user finishes speaking
decoder.end_utt()
\# Retrieve the hypothesis (for what was said)
hypothesis = decoder.hyp()
\# Get the text of the hypothesis
bestGuess = hypothesis.hypstr
\# Print out what was said
print 'I just heard you say:"{}"'.format(bestGuess)

Those interested in learning more about the gritty details of this process should turn their attention to the pocketSphinxListener.py code from the example project.

There are a lot of different configuration parameters that you can experiment with, and as previously mentioned, other modes of recognition to try. For instance, investigate the -allphone_ci PocketSphinx configuration option and its impact on decoding accuracy. Or try keyword spotting for activating a light. Or try a statistical language model, like the one that was generated when we were using the lmtool earlier, instead of a grammar file. As a practitioner you can experiment almost endlessly to explore the fringes of what’s possible. One thing you’ll quickly notice is that PocketSphinx is an actively developed research system and this will sometimes mean you need to rewrite your application to match the new APIs and function names.

Now that we’ve covered what’s required to turn speech into text let’s do something interesting with it! In the next section we will look at some rudimentary communications with the Insteon and Philips Hue networked lights.

LET THERE BE GET /LIGHTS HTTP/1.1

Over the years, countless systems have been designed, built, and deployed to turn on and off the humble light bulb. The Insteon and Philips Hue systems both have this capability and both also have so much more. They both speak over wireless protocols, with the Insteon system having the added advantage of also communicating over a house’s power lines. Communicating directly with the bulbs in both of these systems would make for some epic hacks, however for the time being we’ve set our sights a little lower and have settled for communicating through a middleman.

An Insteon hub for home automation.
An Insteon “hub” for home automation
A Philips Hue "bridge" for home automation.
A Philips Hue “bridge” for home automation

Both the systems come with a networked “hub” or “bridge” which performs the work required to allow devices on the network, such as smartphones with the respective apps installed, to communicate commands to the lights.

It turns out, both of these systems also have HTTP-based APIs which we can utilize for our example voice-controlled system. And both companies have developer programs that you can join to take full advantage of the APIs:
» Philips Hue Developer Program
» Insteon Developer Program

For those who like to earn their knowledge a little more “guerrilla style” there are plenty of resources online that explain the basics of communicating with the hubs from both of these manufacturers. Many tinkerers with published web articles on the subject learned their secrets through the age-old skills of careful inspection, network analysis, and reverse engineering. These skills are invaluable as a Maker and these systems both offer plenty of experience for those willing to work for it.

The example project has a minimal set of Python commands that can be used to communicate with both the older Insteon 2242-222 Hub and the current Philips Hue bridge to get you started.

I COMMAND THEE ROBOT

Roomba, prepare to do my bidding!

To round out our menagerie of colorful devices we have also commandeered an Arduino Leonardo, which we have connected to an infrared LED and programmed to send commands to our iRobot Roomba robot vacuum cleaner.

The Arduino connects to the Raspberry Pi using a USB cable which both supplies power and allows for serial communications. We’re using the IRremote library to do the heavy lifting of blinking the IR LED with the appropriate precise timing.

The DIY infrared LED circuit connected to the Arduino. You could use a lot smaller perfboard.
The DIY infrared LED circuit connected to the Arduino. You could use a lot smaller perfboard.
Schematic for simple IR LED circuit. NOTE: On the Leonardo, the IRremote library expects the output pin for the transistor/LED to be Pin 13. This varies for other Arduino boards.
Schematic for simple IR LED circuit.

In order for the library to have something to communicate with, we need to connect some IR transmission circuitry to our Arduino on the appropriate pins. The hardware can be boiled down to a transistor, IR LED, and a resistor or two, which can be connected on a breadboard or stripboard and connected to the Arduino. For our example build we tested both the SparkFun Max Power IR LED Kit (since discontinued) and a minimalist IR transmitter setup of a 330-ohm resistor, a 2N3904 NPN transistor, and a through-hole IR LED connected to the Arduino via three male headers.

We’re using the IRremote library to allow us to compose and send raw IR signals to the Roomba which emulate the Roomba’s remote control. These signals are a series of encoded pulses whose timing correspond with binary signals. When encoded for transmission by the library they look something like the following from our example sketch:

// Clean
const unsigned int clean[15] = {3000,1000,1000,3000,1000,3000,1000,3000,3000,1000,1000,3000,1000,3000,1000};
// Power Button
const unsigned int power[15] = {3000,1000,1000,3000,1000,3000,1000,3000,3000,1000,1000,3000,3000,1000,1000};
// Dock
const unsigned int dock[15] = {3000,1000,1000,3000,1000,3000,1000,3000,3000,1000,3000,1000,3000,1000,3000};

In order for the Roomba to receive and understand these signals, I found it best to send them four times, which we do with the following sendRoombaCommand function.


void sendRoombaCommand(unsigned int* command){
for (int i = 0; i < 4; i++){
irsend.sendRaw(command, 15, 38);
delay(50);
}
}

Using the IRremote library in your Arduino code lets you compose and send raw infrared (IR) signals to command the Roomba.
By including the IRremote library in your Arduino code, you can send raw infrared (IR) signals to command a Roomba.

Once the IR hardware has been connected to the appropriate pins and the sketch compiled and uploaded through the Arduino IDE, you’re able to send commands over the serial connection that will be translated into something the Roomba can understand. Binary truly is becoming the universal language!

ALL TOGETHER NOW

So there you have it, a complete system to control Insteon lights, Philips Hue lights, and an iRobot Roomba with nothing but the sound of your voice!

Raspberry Pi (in black enclosure) + Arduino + infrared LED = complete hardware for yelling orders at your lights and robots.
Raspberry Pi (in black enclosure) + Arduino + infrared LED = complete hardware for yelling orders at your lights and robots.

This wouldn’t be possible without the generous contribution of the many open source projects we have used. This example project is also released as open source so that you can refer to it when implementing your own voice control projects on the Raspberry Pi and remix parts of it for your next project. We can’t wait to see what you come up with — please tell us what you make in the comments below.

Master of all devices within earshot!
Master of all devices within earshot!