Your Voice Agents Just Got Eyes: What ElevenLabs' Multimodal Update Means for Developers
ElevenLabs just gave its voice agents the ability to "see" images and PDFs during real-time calls. Here's how the new multimodal support and scoped conversation analysis will change how you build and debug voice apps.
TL;DR:
- Multimodal Support: Voice agents can now process images and PDFs mid-conversation using the new
sendMultimodalMessagefunction in the JS SDK. - Scoped Conversation Analysis: Debugging multi-agent workflows is now drastically easier, allowing you to isolate metrics for specific sub-agents rather than parsing entire call transcripts.
- Apple Ecosystem Upgrades: The new Swift SDK v3.1.2 brings ultra-low latency LiveKit WebRTC and reactive SwiftUI integration for Mac and iOS developers.
- Workflow Overrides: Developers can now restrict specific agents to distinct
tool_idsandknowledge_basedocuments to prevent hallucinations.
If you build or use voice AI tools daily, you already know the frustration of a "blind" voice agent. Imagine a user trying to read a 16-character router serial number aloud to a support bot, or spelling out a complex foreign address. It's a massive friction point that text-to-speech (TTS) and speech-to-text (STT) alone simply cannot solve.
In a major update to its ElevenAgents platform, ElevenLabs has fundamentally changed this dynamic. By introducing Multimodal Support and Scoped Conversation Analysis, the company is aggressively pivoting from a specialized voice-cloning provider into a comprehensive "Agentic AI" powerhouse.
Here is exactly what this means for your daily workflows, your app development, and the future of voice interfaces.
Multimodal Support: The End of "Spelling It Out"
The most immediately impactful feature for end-users is the addition of multimodal message support. The JavaScript SDK (@elevenlabs/client) now includes a sendMultimodalMessage hook.
Instead of forcing users to choose between a text chat or a voice call, developers can now build hybrid interactions. During a live, real-time voice conversation, a user can upload a photo of a broken product, a screenshot of an error code, or a PDF of a receipt. The agent can "see" this visual data and respond verbally in real-time.
This is a massive leap for data extraction and CRM integration. By allowing users to augment their voice with visual context, businesses can drastically reduce call times and eliminate the hallucination risks associated with poor phonetic transcriptions of complex data.
Scoped Conversation Analysis: Debugging the Multi-Agent Mess
As enterprises have started deploying ElevenAgents for complex tasks, they've run into a scaling problem: debugging a multi-agent workflow is a nightmare.
Previously, if you had a "Greeting Agent" that routed to a "Billing Agent" or a "Tech Support Agent," conversation analysis was applied to the entire call transcript. If an evaluation failed, pinpointing exactly which sub-agent dropped the ball required tedious manual review.
With Scoped Conversation Analysis, developers can now apply evaluation criteria and data collection items to either the full conversation or a specific agent node.
Technical Implementation
For the developers under the hood, here is how the new tools shape up:
| Feature | Technical Implementation |
|---|---|
| Analysis API | POST /v1/convai/conversations/{id}/analysis/run |
| New Schema | ScopedAnalysisResult (array containing per-agent evaluation breakdowns) |
| JS SDK Hook | useConversationControls().sendMultimodalMessage |
| Input Type | MultimodalMessageInput (exported from @elevenlabs/client) |
| Workflow Config | PromptAgentAPIModelOverrideConfig now includes tool_ids and knowledge_base |
By utilizing the new tool_ids and knowledge_base overrides, you can ensure your Billing Agent only has access to billing APIs, while your Tech Support Agent only searches your technical documentation. This sandbox approach is the most effective way to reduce hallucinations in production environments.
Mac and iOS Developers Get a Massive Boost
ElevenLabs has clearly prioritized the Apple ecosystem in this rollout. If you are building voice apps for Mac or iOS, the new Swift SDK v3.1.2 brings several quality-of-life improvements.
The SDK now utilizes LiveKit WebRTC for ultra-low latency audio streaming, ensuring that conversational prosody feels natural and uninterrupted. Furthermore, it features deep SwiftUI Integration. The SDK is fully reactive, meaning your iOS app's UI will automatically update its transcripts and visual states as the AI speaks, requiring zero manual state management from the developer.
ElevenLabs also added environment-specific agent connections, making it infinitely easier for iOS devs to toggle between "Development" and "Production" versions of their agents while testing on TestFlight.
The Competitive Landscape: ElevenLabs vs. OpenAI
Industry analysts are already dubbing ElevenLabs the "audio layer" of the internet, but they face stiff competition. OpenAI's Realtime API offers a highly capable "single-brain" multimodal experience.
However, where ElevenLabs continues to win is in production-ready prosody. While pure LLM-voice models might have a slight edge in raw latency, ElevenLabs' underlying models (like the newly available Eleven v3 and Scribe v2) offer unmatched voice quality, emotional nuance, and character consistency. With the addition of "Versioning" for A/B testing live traffic and structured "Agent Test Folders" for automated testing, ElevenLabs is clearly targeting serious, enterprise-grade developers who need granular control over their voice outputs.
The Privacy Angle: Cloud vs. Local
While the ability to send images, PDFs, and real-time voice data to a cloud-based agent is incredibly powerful, it also introduces significant privacy and cost concerns. Every multimodal message sent to a cloud API consumes tokens, and transmitting sensitive documents (like invoices or personal IDs) to third-party servers is often a non-starter for healthcare, finance, and privacy-conscious users.
If you love the power of voice AI but need to keep your data strictly on your own hardware, cloud APIs aren't the only way forward.
About FreeVoice Reader
FreeVoice Reader is a privacy-first voice AI suite that runs 100% locally on your device:
- Mac App - Lightning-fast dictation, natural TTS, voice cloning, meeting transcription
- iOS App - Custom keyboard for voice typing in any app
- Android App - Floating voice overlay with custom commands
- Web App - 900+ premium TTS voices in your browser
One-time purchase. No subscriptions. Your voice never leaves your device.
Transparency Notice: This article was written by AI, reviewed by humans. We fact-check all content for accuracy and ensure it provides genuine value to our readers.