Gemini 3.5 Live Translate API Guide

Get the latest on AI, LLMs & developer tools

New MCP servers, model updates, and guides like this one — delivered weekly.

What Google Launched

Gemini 3.5 Live Translate is not a chat model with a translation prompt. It is a dedicated Live API translation mode: stream 16kHz PCM speech in, choose a target language, and receive translated 24kHz audio plus optional transcripts from gemini-3.5-live-translate-preview.

Model ID

gemini-3.5-live-translate-preview

Launch

June 9, 2026

Languages

70+ supported

Input

Audio only

Output

Translated audio

Status

Public preview for developers

This guide is based on Google's launch post, the Gemini Live API docs, the official Google AI Developers thread, Google's LiveKit example, and the Gemini 3.5 Audio model card. Community reactions are intentionally left out so the API details stay tied to primary sources.

What Changed for Developers

Google's launch post positions Gemini 3.5 Live Translate as an audio model for live speech-to-speech translation. The developer-facing shift is that realtime translation is now exposed through the Gemini Live API and Google AI Studio, not only inside end-user Google products.

Area	What changed	Developer impact
Developer access	Gemini 3.5 Live Translate is available in public preview through the Gemini Live API and Google AI Studio.	Developers can prototype speech-to-speech translation without waiting for a separate product surface.
Model ID	The Live API translation model is `gemini-3.5-live-translate-preview`.	Treat it as a preview model and isolate it behind config flags before production rollout.
Interaction model	Live Translation behaves like a realtime interpreter, not a conversational Live Agent.	Do not design prompts, tools, function calls, or turn-taking flows around this mode.
Audio pipeline	Input is audio-only raw PCM at 16kHz; output is translated audio at 24kHz.	Your product needs capture, resampling, buffering, playback, and transcript handling.
Safety signal	Google says model-generated audio is watermarked with SynthID.	Apps using generated audio should disclose AI audio and preserve provenance expectations.

Official X Thread and Video

The Google AI Developers launch thread is useful because it frames the developer capabilities in product terms: multilingual input, automatic language detection, native audio processing, and robustness in noisy environments.

Our latest audio model, Gemini 3.5 Live Translate, takes real-time speech translation to the next level for developers.
— Google AI Developers (@googleaidevs)June 9, 2026

The embedded post includes Google's official launch video. The important takeaway for builders is not only that translation is faster; it is that the product surface is designed for continuous speech, where the system stays close to the speaker instead of waiting for a complete turn.

Mental Model: Live Agent vs. Live Translation

The Gemini Live API can support realtime agent interactions, but Live Translation is a narrower mode. Google's docs describe it as an interpreter pipeline. That distinction changes the whole product design.

Dimension	Live Agent	Live Translation
Role	Assistant that listens, reasons, and can act.	Interpreter pipeline for speech-to-speech translation.
Interaction	Turn-based realtime conversation.	Continuous stream processing while the speaker talks.
Tools	Can use Live API tool and agent capabilities.	Translation-only; no tools or instructions.
Inputs	Text, audio, video, image depending on feature.	Audio input only for translation latency.
Main config	Generation, speech, tools, and instructions.	`targetLanguageCode` plus `echoTargetLanguage`.

The practical implication: do not prompt Live Translate like a multilingual assistant. Build a media pipeline, not a chatbot. The API surface is about audio chunks, language codes, transcripts, and output playback.

Smallest API Shape

The docs show Python, JavaScript, and raw WebSocket options. For most web teams, the JavaScript SDK shape is the clearest starting point, but client-side apps should still use ephemeral tokens instead of exposing an API key.

import { GoogleGenAI, Modality } from "@google/genai";

const ai = new GoogleGenAI({});

const session = await ai.live.connect({
  model: "gemini-3.5-live-translate-preview",
  config: {
    responseModalities: [Modality.AUDIO],
    inputAudioTranscription: {},
    outputAudioTranscription: {},
    translationConfig: {
      targetLanguageCode: "es",
      echoTargetLanguage: false,
    },
  },
  callbacks: {
    onmessage: (message) => {
      const content = message.serverContent;
      const transcript = content?.outputTranscription?.text;
      const translatedAudio = content?.modelTurn?.parts?.find((part) => part.inlineData);

      if (transcript) console.log("Translated transcript:", transcript);
      if (translatedAudio) {
        // Decode and play the translated PCM audio chunk.
      }
    },
  },
});

Field	Value	Why it matters
`model`	`gemini-3.5-live-translate-preview`	Use the preview Live Translate model.
`responseModalities`	`AUDIO`	The API returns translated audio chunks.
`inputAudioTranscription`	object	Optional input transcript stream.
`outputAudioTranscription`	object	Optional translated transcript stream.
`targetLanguageCode`	BCP-47 code	Target output language, such as `pl`, `es`, or `ja`. Defaults to English.
`echoTargetLanguage`	boolean	When true, target-language input is echoed; when false, the model stays silent for target-language speech.

Audio Contract: PCM In, Translated Audio Out

The Live Translate docs are explicit about the media contract. Input audio must be raw, little-endian, 16-bit PCM at 16kHz mono. Output audio is raw 16-bit PCM at 24kHz mono. Google recommends 100ms chunks for low-latency streaming.

// Browser microphone audio usually needs conversion before sending.
// Target input for Live Translate:
// - raw PCM
// - 16-bit
// - little-endian
// - mono
// - 16kHz sample rate
// - roughly 100ms chunks

session.sendRealtimeInput({
  audio: {
    data: pcm16MonoChunk.toString("base64"),
    mimeType: "audio/pcm;rate=16000",
  },
});

That means the hard part of a real app is often not the API call. It is capture, resampling, voice activity handling, buffering, playback drift, and UI feedback when the network or microphone gets rough.

Client Security with Ephemeral Tokens

Google's docs recommend ephemeral tokens for client-to-server applications so browser clients do not expose the API key. For translation, the safer default is to lock translationConfig in the token constraints on the server.

Choice	Use when	Risk
Lock target language on server	Kiosk, classroom, broadcast, support room, meeting workflow.	Less flexible, but the client cannot tamper with translation settings.
Unlock target language on client	User must choose language dynamically in the browser.	Requires stricter validation, logging, and abuse controls.

A production design should keep the API key server-side, mint short-lived tokens, limit allowed models, constrain target languages where possible, and log enough metadata to debug latency without storing sensitive raw audio unnecessarily.

Limitations You Should Design Around

The launch framing is strong, but the official docs also list practical caveats. These limitations are exactly where a polished app needs UX support.

Limitation	Official caveat	Product response
Audio only	Translation mode does not accept text input.	Keep text translation, chat, and function calling in separate flows.
Voice consistency	Voices can shift after long pauses or rapid speaker changes.	Do not promise perfect speaker identity preservation.
Language detection	Heavy accents, similar languages, and fast language switches can affect the input transcript.	Show transcript confidence and let users correct language when needed.
Background audio	Noise and music are filtered, but not every background signal is ignored.	Test real rooms, cars, crowds, and cheap microphones.
Echo artifacts	`echoTargetLanguage: true` can introduce artifacts when target-language input contains background audio.	Default to false unless your UX really needs echoing.

Reference Architecture

Google's example app shows a useful broadcast pattern with LiveKit: the organizer publishes audio, a translation bridge subscribes, one Gemini Live API session is created per target language, and attendees subscribe to the translated audio track for their chosen language.

Organizer microphone
  -> realtime room audio
  -> translation bridge per target language
  -> Gemini Live API translationConfig
  -> translated 24kHz audio
  -> attendee playback + optional transcript

The demo's most important scaling idea is session sharing. If fifty attendees choose Spanish, they should not create fifty identical Gemini sessions. A bridge can publish one Spanish translation stream that all Spanish listeners share.

Official Google visual showing speech translation in a video meeting — Official Google visual for speech translation in meetings. Google says Meet will use 3.5 Live Translate in private preview before broader rollout.

Rollout Across Google Products

The launch is not only an API announcement. Google says Gemini 3.5 Live Translate is rolling out through three surfaces: public preview for developers through the Gemini Live API and AI Studio, private preview for Google Meet enterprise customers, and the Google Translate app on Android and iOS.

Surface	Status from Google	Developer takeaway
Gemini Live API	Public preview for developers.	Best place to build and test custom realtime translation flows.
Google AI Studio	Available for trying model capabilities.	Fastest way to test before wiring a media stack.
Google Meet	Private preview for selected Workspace customers, broader rollout later.	Shows the model is aimed at live meeting translation, not offline batch dubbing only.
Google Translate app	Rolling out globally on Android and iOS.	Good reference for UX expectations around headphones, listening mode, and natural voice output.

Build Checklist

If you are building with Live Translate this week, start with the media pipeline and failure modes before you polish the interface.

1. Start in Google AI Studio to test target languages.
2. Use gemini-3.5-live-translate-preview behind a feature flag.
3. Capture microphone audio and convert to 16kHz mono PCM.
4. Send roughly 100ms chunks over the Live API session.
5. Request input and output transcripts for debugging.
6. Keep API keys on the server; use ephemeral tokens for browser clients.
7. Decide whether target language is locked server-side or user-selectable.
8. Test accents, background music, overlapping speakers, long pauses, and rapid language switches.
9. Add visible latency and transcript status in the UI.
10. Disclose AI-generated translated audio and preserve SynthID expectations.

FAQ

What is Gemini 3.5 Live Translate?

Gemini 3.5 Live Translate is Google's audio model for near realtime speech-to-speech translation. Developers use it through the Gemini Live API with the `gemini-3.5-live-translate-preview` model.

Is Live Translate the same as a Gemini Live Agent?

No. Live Translation is an interpreter pipeline. It does not support tools, function calling, free-form instructions, text input, or general agent behavior in the translation mode.

What audio format does the API expect?

The docs specify raw little-endian 16-bit PCM audio at 16kHz mono for input, translated audio output at 24kHz mono, and 100ms input chunks.

Can a browser app call Live Translate directly?

Use ephemeral tokens for client-side applications. The docs recommend locking translation configuration on the server so a browser client cannot tamper with model or language settings.

Should I use this for production today?

Treat it as a preview capability. It is useful for prototypes and controlled pilots, but production apps need latency testing, fallback UX, privacy review, audio quality checks, and limits around voice consistency.

Official Sources and Links

Gemini 3.5 Flash guideThe broader Gemini 3.5 developer model context.Gemini CLI migrationHow Google is moving coding workflows into Antigravity CLI.Gemini CLI setupUseful background for Gemini API and CLI workflows.