This article appeared in Make: Vol. 91. Subscribe for more maker projects and articles!

Chat and command your own embedded-AI companion bot using local LLMs.

Imagine a fully autonomous robotic companion, like Baymax from Disney’s Big Hero 6 — a friendly, huggable mechanical being that can walk, hold lifelike interactive conversations, and, when necessary, fight crime. Thanks to the advent of large language models (LLMs), we’re closer to this science fiction dream becoming a reality — at least for lifelike conversations.

In this guide I’ll introduce Digit, a companion bot that I helped create with Jorvon Moss (@Odd_jayy). It uses a small LLM running locally on an embedded computer to hold conversations without the need for an internet connection.

I’ll also walk you through the process of running a similar, lightweight LLM on a Raspberry Pi so you can begin making your own intelligent companion bot.

What Is a Large Language Model (LLM)?

A large language model is a specific type of AI that can understand and generate natural, human-like text. The most popular example of an LLM right now is OpenAI’s ChatGPT, which is used to answer questions for the curious, automatically generate social media content, create code snippets and, to the chagrin of many English teachers, write term papers. LLMs are, in essence, the next evolution of chatbots.

LLMs are based on the neural network architecture known as a transformer. Like all neural networks, transformers have a series of tunable weights for each node to help perform the mathematical calculations required to achieve their desired task. A weight in this case is just a number — think of it like a dial in a robot’s brain that can be turned to increase or decrease the importance of some piece of information. In addition to weights, transformers have other types of tunable dials, known as parameters, that help convert words and phrases into numbers as well as determine how much focus should be given to a particular piece of information.

Instead of humans manually tuning these dials, imagine if the robot could tune them itself. That is the magic of machine learning: training algorithms adjust the values of the parameters (dials) automatically based on some goal set by humans. These training algorithms are just steps that a computer can easily follow to calculate the parameters. Humans set a goal and provide training data with correct answers to the training algorithms. The AI looks at the training data and guesses an answer. The training algorithm determines how far off the AI’s result is from the correct answer and updates the parameters in the AI to make it better next time. Rinse and repeat until the AI performs at some acceptable level.

To give you an idea of complexity, a machine learning model that can read only the handwritten digits 0 through 9 with about 99% accuracy requires around 500,000 parameters. Comprehending and generating text are vastly more complicated. LLMs are trained on large quantities of human-supplied text, such as books, articles, and websites. The main goal of LLMs is to predict the next word in a sequence given a long string of previous words. As a result, the AI must understand the context and meaning of the text. To achieve this, LLMs are made up of massive amounts of parameters. ChatGPT-4, released in June 2023, is built from eight separate models, each containing around 220 billion parameters — about 1.7 trillion total.

Why a Local LLM?

The billions of calculations needed for ChatGPT to hold a simple conversation require lots of large, internet-connected servers. If you tried to run ChatGPT on your laptop, assuming you even had enough memory to store the model, it would take hours or days to get a response! In most cases, relying on servers to do the heavy lifting is perfectly acceptable. After all, we use copious cloud services as consumers, such as video streaming, social media, file sharing, and email.

However, running an LLM locally on a personal computer might be enticing for a few reasons:

  • Maybe you require access to your AI in areas with limited internet access, such as remote islands, tropical rainforests, underwater, underground caves, and most technology conferences!
  • By running the LLM locally, you can also reduce network latency — the time it takes for packets to travel to and from servers. That being said, the extra computing power from the servers often makes up for the latency time for complex tasks like LLMs.
  • Additionally, you can assume greater privacy and security for your data, which includes the prompts, responses, and model itself, as it does not need to leave the privacy of your computer or local network. If you’re an AI researcher developing the next great LLM, you can better protect your intellectual property by not exposing it to the outside.
  • Personal computers and home network servers are often smaller than their corporate counterparts used to run commercial LLMs. While this might limit the size and complexity of your LLM, it often means reduced costs for such operations.
  • Finally, most commercial LLMs contain a variety of guardrails and limits to prevent misuse. If you need an LLM to operate outside of commercial limits — say, to inject your own biases to help with a particular task, such as creative writing — then a local LLM might be your only option.

Thanks to these benefits, local LLMs can be found in a variety of instances, including healthcare and financial systems to protect user data, industrial systems in remote locations, and some autonomous vehicles to interact with the driver without the need for an internet connection. While these commercial applications are compelling, we should focus on the real reason for running a local LLM: building an adorable companion bot that we can talk to.

Introducing Digit

Jorvon Moss’s robotic designs have improved and evolved since his debut with Dexter (Make: Volume 73), but his vision remains constant: create a fully functioning companion robot that can walk and talk. In fact, he often cites Baymax as his goal for functionality. In recent years, Moss has drawn upon insects and arachnids for design inspiration. “I personally love bugs,” he says. “I think they have the coolest design in nature.”

Digit’s body consists of a white segmented exoskeleton, similar to a pill bug’s, that protects the sensitive electronics. The head holds an LED array that can express emotions through a single, animated “eye” along with a set of light-up antennae and controllable mandibles. It sits on top of a long neck that can be swept to either side thanks to a servomotor. Digit’s legs cannot move on their own but can be positioned manually.

Like other companion bots, Digit can be perched on Moss’s shoulder to catch a ride. A series of magnets on Digit’s body and feet help keep it in place.

Courtesy of Jorvon Moss

But Digit is unique from Moss’s other companion bots thanks to its advanced brain — an LLM running locally on an Nvidia Jetson Orin Nano embedded computer. Digit is capable of understanding human speech (English for now), generating a text response, and speaking that response aloud — without the need for an internet connection. To help maintain Digit’s relatively small size and weight, the embedded Jetson Orin Nano was mounted on a wooden slab along with an LCD for startup and debugging. Moss totes both the Orin Nano and the appropriate battery in a backpack. You could design your own companion bot differently to house the Orin Nano inside.

Courtesy of DigiKey

How Digit’s Brain Works

I helped Moss design and program the software system to act as Digit’s AI brain. This system is comprised of three main components: a service running the LLM, a service running the text-to-speech system, and a client program that interacts with these two services.

Courtesy of DigiKey

The client, called hopper-chat, controls everything. It continuously listens for human speech from a microphone and converts everything it hears using the Alpha Cephei Vosk speech-to-text (STT) library. Any phrases it hears are compared to a list of wake words/phrases, similar to how you might say “Alexa” or “Hey, Siri” to get your smart speaker to start listening. For Digit, the wake phrase is, unsurprisingly, “Hey, Digit.” Upon hearing that phrase, any new utterances are converted to text using the same Vosk system.

The newly generated text is then sent to the LLM service. This service is a Docker container running Ollama, an open-source tool for running LLMs. In this case, the LLM is Meta’s Llama3:8b model with 8 billion parameters. While not as complex as OpenAI’s ChatGPT-4, it still has impressive conversational skills. The service sends the response back to the hopper-chat client, which immediately forwards it to the text-to-speech (TTS) service.

TTS for hopper-chat is a service running Rhasspy Piper that encapsulates the en_US-lessac-low model, a neural network trained to produce sounds when given text. In this case, the model is specifically trained to produce English words and phrases in an American dialect. The “low” suffix indicates that the model is low quality — smaller size, more robotic sounds, but faster execution. The hopper-chat program plays any sounds it receives from the TTS service through a connected speaker.

On Digit, the microphone is connected to a USB port on the Orin Nano and simply draped over a backpack strap. The speaker is connected via Bluetooth. Moss uses an Arduino to monitor activity in the Bluetooth speaker and move the mandibles during activity to give Digit the appearance of speaking.

Moss added several fun features to give Digit a distinct personality. First, Digit tells a random joke, often a bad pun, every minute if the wake phrase is not heard. Second, Moss experimented with various default prompts to entice the LLM to respond in particular ways. This includes making random robot noises when generating a response and adopting different personalities, from helpful to sarcastic and pessimistic.

Courtesy of Jorvon Moss

Agency: From Text to Action

The next steps for Digit involve giving it a form of self-powered locomotion, such as walking, and having the LLM perform actions based on commands. On their own, LLMs cannot perform actions. They simply generate text responses based on input. However, adjustments and add-ons can be made that allow such systems to take action. For example, ChatGPT already has several third-party plugins that can perform actions, such as fetching local weather information. The LLM recognizes the intent of the query, such as, “What’s the weather like in Denver, Colorado?” and makes the appropriate API call using the plugin. (We’ll look at some very recent developments in function calling, below.)

At the moment, Digit can identify specific phrases using its STT library, but the recorded phrase must exactly match the expected phrase. For example, you couldn’t say “What’s the weather like?” when the expected phrase is “Tell me the local weather forecast.” A well-trained LLM, however, could infer that intention. Moss and I plan to experiment with Ollama and Llama3:8b to add such intention and command recognition.

The code for hopper-chat is open source and can be found on GitHub. Follow along with us as we make Digit even more capable.

DIY Robot Makers

Science fiction is overflowing with shiny store-bought robots and androids created by mega corporations and the military. We’re more inspired by the DIY undercurrent — portrayals of solo engineers cobbling together their own intelligent and helpful companions. We’ve always believed this day would come, in part because we’ve seen it so many times on screen. —Keith Hammond

  • Dr. Tenma and Toby/Astro (Astro Boy manga, anime, and films, 1952–2014)
  • J.F. Sebastian and his toys Kaiser and Bear (Blade Runner, 1982)
  • Wallace and his Techno-Trousers (Wallace & Gromit: The Wrong Trousers, 1993)
  • Anakin Skywalker and C-3PO (Star Wars: The Phantom Menace, 1999)
  • Sheldon J. Plankton and Karen (Spongebob Squarepants, 1999–2024)
  • Dr. Heinz Doofenshmirtz and Norm (Phineas and Ferb, 2008–2024)
  • The Scientist and 9 (9, 2009)
  • Charlie Kenton and Atom (Real Steel, 2011)
  • Tadashi Hamada and Baymax (Big Hero 6, 2014)
  • Simone Giertz and her Shitty Robots (YouTube, 2016–2018)
  • Kuill and IG-11 (The Mandalorian, 2019)
  • Finch and Jeff (Finch, 2021)
  • Brian and Charles (Brian and Charles, 2022)

Roll Your Own Local LLM Chatbot

I will walk you through the process of running an LLM on a Raspberry Pi. I specifically chose the Raspberry Pi 5 due to its increased computational power. LLMs are notoriously complex, so earlier versions of the Pi might need several minutes to produce an answer, even from relatively small LLMs. My Pi 5 had 8GB RAM; these LLMs may not run with less.

What will the next generation of Make: look like? We’re inviting you to shape the future by investing in Make:. By becoming an investor, you help decide what’s next. The future of Make: is in your hands. Learn More.

Project Steps

1. Set up the Pi 5 with Ollama

Follow the official Raspberry Pi Getting Started guide to install the latest Raspberry Pi OS (64-bit) and configure your Raspberry Pi. You should use an SD card with at least 16 GB.
Once you have booted into your Raspberry Pi, make sure you are connected to the internet and open a terminal window. Enter the following commands to updated the Pi and to install Ollama:

$ sudo apt update
$ sudo apt upgrade
$ curl -fsSL https://ollama.com/install.sh | sh

Next, start the Ollama service:

$ ollama serve

You might see a message that says “Error: listen tcp 127.0.0.1:11434: bind: address already in use.” Ignore this, as it just indicates that Ollama is already running as a service in the background.

2. Try out TinyLlama

Meta’s LLaMa models are almost open source, with some caveats for commercial usage. AI researcher Zhang Peiyuan started the TinyLlama project in September 2023. TinyLlama is a truly open-source (Apache-2.0 license), highly-optimized LLM with only 1.1 billion parameters. It is based on the LLaMa 2 model and can generate responses quite quickly. It’s not as accurate as the newer generation of small LLMs, such as Llama3, but it will run on hobby-level hardware like our Pi 5.

Download the latest version of TinyLlama with ollama:

$ ollama pull tinyllama

Run an interactive shell to chat with TinyLlama:

$ ollama run tinyllama

You should be presented with a prompt. Try asking the AI a question or have it tell you a joke.

Courtesy of Shawn Hymel

Press Ctrl+D or enter /bye to exit the interactive shell.

3. Set up the Ollama Python package

By default, Ollama runs as a background server and exposes port 11434. You can communicate with that service by making HTTP requests. To make life easier, Ollama maintains a Python library that communicates directly with that locally running service. Create a virtual environment and install the package:

$ python -m venv venv-ollama --system-site-packages
$ source venv-ollama/bin/activate
$ python -m pip install ollama==0.3.3

Open a new document:

$ nano tinyllama-client.py

Enter the following Python code:

import ollama
# Settings
prompt = "You are a helpful assistant. Tell me a joke. " \
    "Limit your response to 2 sentences or fewer."
model = "tinyllama"

# Configure the client
client = ollama.Client(host="http://0.0.0.0:11434")
# The message history is an array of prompts and responses
messages = [{
    "role": "user",
    "content": prompt
}]

# Send prompt to Ollama server and save the response
response = client.chat(
    model=model,
    messages=messages,
    stream=False
)

# Print the response
print(response["message"]["content"])

Close the file by pressing Ctrl+X, press Y when asked to save the document, and press Enter.

4. Chat with your LLM bot!

Run the Python script by entering:

$ python tinyllama-client.py

TinyLlama can take some time to generate a response, especially on a small computer like the Raspberry Pi — 30 seconds or more —  but here you are, chatting locally with an AI!

Courtesy of Shawn Hymel

This should give you a sense of how to run local LLMs on a Raspberry Pi and interact with them using Python. Feel free to try different prompts, save the chat history using the append() method, and build your own voice-activated chatbot.

Local LLM Chatbot with Function Calling

LLMs have traditionally been self-contained models that accept text input and respond with text. In the last couple of years, we’ve seen multimodal LLMs enter the scene, like GPT-4o, that can accept and respond with other forms of media, such as images and videos.

But in just the past few months, some LLMs have been granted a powerful new ability — to call arbitrary functions — which opens a huge world of possible AI actions. ChatGPT and Ollama both call this ability tools. To enable such tools, you must define the functions in a Python dictionary and fully describe their use and available parameters. The LLM tries to figure out what you’re asking and maps that request to one of the available tools/functions. We then parse the response before calling the actual function.

Let’s demonstrate this concept with a simple function that turns an LED on and off. Connect an LED with a limiting resistor to pin GPIO 17 on your Raspberry Pi 5.

Courtesy of Shawn Hymel / Fritzing

Make sure you’re in the venv-ollama virtual environment we configured earlier and install some dependencies:

$ source venv-ollama/bin/activate
$ sudo apt update
$ sudo apt upgrade
$ sudo apt install -y libportaudio2
$ python -m pip install ollama==0.3.3 vosk==0.3.45 sounddevice==0.5.0

You’ll need to download a new LLM model and the Vosk speech-to-text (STT) model:

$ ollama pull allenporter/xlam:1b
$ python -c "from vosk import Model; Model(lang='en-us')"

As this example uses speech-to-text to convey information to the LLM, you will need a USB microphone, such as Adafruit 3367. With the microphone connected, run the following command to discover the USB microphone device number:

$ python -c "import sounddevice; print(sounddevice.query_devices())"

You should see an output such as:

  0 USB PnP Sound Device: Audio (hw:2,0), ALSA (1 in, 0 out)
  1 pulse, ALSA (32 in, 32 out)
* 2 default, ALSA (32 in, 32 out)

Note the device number of the USB microphone. In this case, my microphone is device number 0, as given by USB PnP Sound Device. Copy this code to a file named ollama-light-assistant.py on your Raspberry Pi.

You can also download this file directly with the command:

$ wget https://gist.githubusercontent.com/ShawnHymel/16f1228c92ad0eb9d5fbebbfe296ee6a/raw/6161a9cb38d3f3c4388a82e5e6c6c58a150111cc/ollama-light-assistant.py

Open the code and change the AUDIO_INPUT_INDEX value to your USB microphone device number. For example, mine would be:

AUDIO_INPUT_INDEX = 0

Run the code with:

$ python ollama-light-assistant.py

You should see the Vosk STT system boot up and then the script will say “Listening…” At that point, try asking the LLM to “turn the light on.” Because the Pi is not optimized for LLMs, the response could take 30–60 seconds. With some luck, you should see that the led_write function was called, and the LED has turned on!

Courtesy of Shawn Hymel

The xLAM model is an open-source LLM developed by the SalesForce AI Research team. It is trained and optimized to understand requests rather than necessarily providing text-based answers to questions. The allenporter version has been modified to work with Ollama tools. The 1-billion-parameter model can run on the Raspberry Pi, but as you probably noticed, it is quite slow and misinterprets requests easily.

For an LLM that better understands requests, I recommend the Llama3.1:8b model. In the command console, download the model with:

$ ollama pull llama3.1:8b

Note that the Llama 3.1:8b model is almost 5 GB. If you’re running out of space on your flash storage, you can remove previous models. For example:

$ ollama rm tinyllama

In the code, change:

MODEL = "allenporter/xlam:1b"

to:

MODEL = "llama3.1:8b"

Run the script again. You’ll notice that the model is less picky about the exact phrasing of the request, but it takes much longer to respond — up to 3 minutes on a Raspberry Pi 5 (8GB RAM).
When you are done, you can exit the virtual environment with the following command:

$ deactivate

A Closer Look at Ollama Tools

Let’s take a moment to discuss how tools work in Ollama. Feel free to open the ollama-light-assistant.py file to follow along.
First, you need to define the function you want to call. In our example, we create a simple led_write() function that accepts an led object (as created by the Raspberry Pi gpiozero library) and an integer value: 0 for off, 1 for on.

def led_write(led, value):
    """
    Turn the LED on or off.
    """
    if int(value) > 0:
        led.on()
        print("The LED is now on")
    else:
        led.off()
        print("The LED is now off")

The trick is to get the LLM to understand that calling this function is a possibility! Since the LLM does not have direct access to your code, the ollama library acts as an intermediary. By defining a set of tools, the LLM can return one of those tools as a response instead of (or in addition to) its usual text-based answer. This response comes in the form of a JSON or Python dictionary that our code can parse and call the related function.

You must define the tools in a list of dictionary objects. As these small LLMs struggle with the concept of an “LED,” we’ll call this a “light.” In our code, we provide the following description of the led_write() function to Ollama:

TOOLS = [
    {
        'type': 'function',
        'function': {
            'name': "led_write",
            'description': "Turn the light off or on",
            'parameters': {
                'type': 'object',
                'properties': {
                    'value': {
                        'type': 'number',
                        'description': "The value to write to the light pin " \
                            "to turn it off and on. 0 for off, 1 for on.",
                    },
                },
                'required': ['value'],
            },
        }
    }
]

In the send() function, we send our query to the Ollama server running the LLM. This query is captured by the Vosk STT functions and converted to text before being added to the message history buffer msg_history.

response = client.chat(
    model=model,
    messages=msg_history.get(),
    tools=TOOLS,
    stream=False
)

When we receive the response from the LLM, we check to see if it contains an entry with the key tool_calls. If so, it means the LLM decided to use one of the defined tools! We then need to figure out which tool the LLM intended to use by cycling through all of the returned tool names. If the name led_write is given for one of the tools, which we defined in the original TOOLS dictionary, we call the led_write() function. We provide the function call with the pre-defined led object and argument value that the LLM decided to give.

if response['message'].get('tool_calls') is None:
    print("Tools not used.")
    return
else:
    print("Tools used. Calling:")
    for tool in response['message']['tool_calls']:
        print(tool)
        if tool['function']['name'] == "led_write":
            led_write(led, tool['function']['arguments']['value'])

The properties defined in the TOOLS dictionary give the LLM context about the function, such as its use case and the necessary arguments it needs to provide. Think of it like giving an AI agent a form to fill out. The AI will first determine which form to use based on the request (e.g. “control a light”) and then figure out how to fill in the various fields. For example, the value parameter says that the field must be a number and it should be a 0 for “off” and 1 for “on.” The LLM uses these context clues to figure out how to craft an appropriate response.

Conclusion

Robot Powers, Activate!

This example demonstrates the possibilities of using LLMs for understanding user intention and for processing requests to call arbitrary functions. Such technology is extremely powerful — we can connect AI agents to the internet to make requests, and control hardware! — but it’s still new and experimental. You will likely run into bugs, and you can expect the code interface to change. It also demonstrates the need for better-optimized models and more powerful hardware. A few boards such as the Jetson Orin Nano and accelerators like the new Hailo-10H enable low-cost local LLM execution today. I’m excited to see this tech get better!

More about Ollama:

This article appeared in Make: Vol. 91. Subscribe for more maker projects and articles!