Hermes Voice Control from Your Phone

Talk to Hermes from your phone

Page content

You already chat to Hermes Agent from your phone with text. Now you want to talk to it directly and get spoken replies back. That is usually the right move, especially if you already use Hermes as a persistent self-hosted assistant. Typing long prompts on a small screen is slow and error-prone

Voice mode makes Hermes practical in the moments where it matters most, while walking, commuting, or doing admin work away from your desk.

hermes voice control from  mobile on telegram

The good news is that voice mode can run with zero paid APIs. A local faster-whisper model handles transcription, and Edge TTS handles spoken output for free. This guide covers setup, provider choices, platform differences, practical command patterns, and the failure modes that usually block first-time users.

How the Pipeline Works

Three stages, no magic:

  1. Transcription STT — Your voice message becomes text.
  2. Reasoning — Hermes processes that text exactly like a typed request.
  3. Synthesis TTS — The response text is converted back to audio.

The important distinction from consumer assistants is execution depth. Hermes is not just answering trivia. It can call tools, inspect files, run code paths, and continue multi-step work from memory. In practice, that means voice can trigger real workflows such as incident triage, draft generation, and targeted debugging. If you want the broader architecture context, the AI Systems pillar explains how this voice layer fits into local agent infrastructure.

What Voice Control Is Great For

Use voice mode when keyboard precision is not required yet:

  • Operational checks while away from your laptop.
  • Idea capture for drafts, outlines, and rough specs.
  • Fast triage of alerts and errors before deeper desktop follow-up.
  • Hands-busy workflows where speaking is the only realistic input channel.

Voice Input: Pick an STT Provider

Provider Cost API Key Notes
Local faster-whisper Free None On-device, ~150 MB model, 90+ languages
Groq Whisper Free tier GROQ_API_KEY Fast cloud inference
OpenAI Whisper Paid VOICE_TOOLS_OPENAI_KEY Highest accuracy
Mistral Voxtral Paid MISTRAL_API_KEY Alternative cloud option

Configuration in ~/.hermes/config.yaml:

stt:
  enabled: true
  provider: local
  local:
    model: base  # tiny, base, small, medium, large-v3

Start with local. It works immediately, handles multilingual speech, and adds no recurring cost. Move to Groq or OpenAI only if your local setup cannot meet your latency or accuracy requirements. For command-level setup and diagnostics while testing providers, keep the Hermes CLI cheat sheet nearby.

Faster Whisper Model Selection

Use a simple progression:

  • tiny for very low-power devices where speed matters most.
  • base as the default balance for laptops and small servers.
  • small when accents, noisy environments, or domain terms reduce accuracy.
  • medium or large-v3 when quality is critical and hardware budget is higher.

If your transcripts are consistently wrong, increase model size first before adding more prompt complexity.

Voice Output: TTS Providers

Provider Quality Cost Best For
Edge TTS (default) Good Free Quick start, 322 voices, 74 languages
ElevenLabs Excellent Paid Premium quality, voice cloning
OpenAI TTS Good Paid Natural voices, 6 options
MiniMax TTS Excellent Paid Fine-grained speed/volume/pitch control
NeuTTS Good Free (local) Fully offline, voice cloning

Configuration:

tts:
  provider: "edge"
  speed: 1.0

  edge:
    voice: "en-US-AriaNeural"

One critical detail is output format. Telegram voice bubbles are most reliable when audio is encoded as OGG with Opus. Hermes relies on ffmpeg for these conversions in common setups. If ffmpeg is missing, replies often show up as file attachments instead of inline voice bubbles.

Install ffmpeg early:

sudo apt install ffmpeg  # Ubuntu/Debian
brew install ffmpeg       # macOS

Platform Workflows and Practical Differences

Telegram

Telegram is the easiest place to start. Voice messages are first-class on mobile, and the interaction loop is simple hold, speak, release, receive.

Setup:

# 1. Create a bot via @BotFather, get your token
# 2. Add to ~/.hermes/.env:
TELEGRAM_BOT_TOKEN=***
TELEGRAM_ALLOWED_USERS=your_user_id

# 3. Start the gateway
hermes gateway start

Then open the Hermes chat, tap the microphone, and speak. If STT and TTS are enabled, Hermes transcribes your request, executes it, and sends a voice reply.

Discord

Discord supports two useful modes. Voice messages in DMs or channels are close to Telegram behavior.

The more advanced option is live voice channels. In that flow, Hermes can participate continuously, transcribing speech and responding without explicit message bubbles.

Requirements:

  • Message Content Intent enabled in your bot settings
  • Server Members Intent enabled
  • Bot permissions: Connect and Speak

Signal

Signal works through the signal-cli daemon. Voice messages still use the same Hermes STT and TTS pipeline.

A useful pattern is running signal-cli as a linked device and using Signal Note to Self. You can leave yourself a voice note and get Hermes output in the same thread.

WhatsApp

WhatsApp follows the same gateway model. Audio messages transcribe automatically once the connector is configured.

Mobile App Permissions

Both iOS and Android need microphone access for the messaging app you’re using.

iOS: Settings → Telegram (or Discord) → Permissions → Microphone → Allow. Enable Background App Refresh for instant responses.

Android: Settings → Apps → Telegram → Permissions → Microphone → Allow. For Discord voice channels, enable overlay permission.

Pinning the Hermes bot chat to your home screen helps — one tap to start speaking.

Speaking Patterns That Work Reliably

Voice interaction has different ergonomics than typing. You cannot easily paste logs or quote long stack traces, so structure matters:

  • Be explicit. Say the action, scope, and output format in one sentence.
  • Keep one objective per message. Split multi-step jobs into short follow-ups.
  • Constrain output. Ask for numbered actions or a 3-point summary when mobile readability matters.
  • Stay short. Around 10 to 30 seconds per message usually transcribes better.
  • Use iterative turns. Correct and refine in the next voice message instead of overloading the first.

Example Prompts You Can Speak

  • “Check deployment logs for the last one hour and report only critical errors.”
  • “Create a draft outline for a post about OpenTelemetry migration with five sections.”
  • “Summarize this bug in three bullets and propose the most likely root cause.”
  • “Review the config and tell me what to change for lower transcription latency.”

Common Use Cases with Concrete Outcomes

  • Operations — “Check production health and list failed services.”
    Outcome is a focused status update you can act on immediately.
  • Writing — “Turn these rough points into a publishable intro paragraph.”
    Outcome is polished text from spoken notes.
  • Debug triage — “Investigate this TypeError and suggest the first fix to test.”
    Outcome is a concrete next step before opening the IDE.
  • Research — “Find three recent sources on topic X and summarize differences.”
    Outcome is a compressed briefing for later deep work.
  • Automation — “Run the home routine and confirm device states.”
    Outcome is direct action plus confirmation.

Troubleshooting

Voice messages not transcribing: Confirm stt.enabled: true in config.yaml. Verify local dependencies are installed. Then restart with hermes gateway restart.

TTS not responding: Confirm tts.provider is set. If using a paid provider, verify the API key in .env. Validate current voice settings from the Hermes CLI status commands.

Poor transcription quality: Increase stt.local.model from base to small or medium. Reduce noise and speak in shorter segments. If needed, switch to cloud STT for better accuracy.

Voice bubbles showing as files on Telegram: Install ffmpeg and restart the gateway. This is the most common issue.

The Free Stack

For cost-conscious setups, this baseline is strong:

  • STT: Local faster-whisper with no API key
  • TTS: Edge TTS with wide language coverage
  • Total cost: $0

This is a meaningful advantage over many closed assistants where voice quality and automation quickly become paid-only features.

If quality requirements increase, upgrade one layer at a time. Usually STT upgrades produce the biggest immediate gain, then TTS quality can be improved later if needed.

FAQ Topics in Practice

The four most common user questions are predictable. They also overlap with memory and profile design concerns covered in Hermes Agent Memory System and Hermes production setup patterns.

  • Whether voice commands get the same tool access as text.
  • Whether a free stack is viable for daily use.
  • Why Telegram sometimes shows attachments instead of voice bubbles.
  • Which local Whisper model should be used first.

This guide addresses each of these directly in setup, tuning, and troubleshooting sections so you can move from first run to stable daily usage quickly.

Quick Start Recap

# 1. Install voice extras
pip install "hermes-agent[all]"

# 2. Set up Telegram gateway
hermes gateway setup

# 3. Install ffmpeg (required for Telegram voice bubbles)
sudo apt install ffmpeg

# 4. Send a voice message from your phone
# Hermes transcribes, processes, and responds

From there, iterate based on your real bottleneck. If latency is the issue, tune model size or cloud STT. If audio quality is the issue, tune TTS provider and voice preset. Start free, measure, then upgrade only where it actually improves your workflow.

Subscribe

Get new posts on AI systems, Infrastructure, and AI engineering.