Gemini API

Gemini 3.5 Live Translate: API Guide for Realtime Voice Translation

Gemini 3.5 Live Translate is Google's preview Live API model for low-latency speech-to-speech translation across 70+ languages. This guide explains what launched, how the API mode works, what audio format it expects, where it is limited, and how to design the first production-safe prototype.

Official Google image announcing Gemini 3.5 Live Translate
Official Google launch image for Gemini 3.5 Live Translate. The implementation details below come from the Gemini Live API docs and Google's official launch post.

Get the latest on AI, LLMs & developer tools

New MCP servers, model updates, and guides like this one — delivered weekly.

What Google Launched

Gemini 3.5 Live Translate is not a chat model with a translation prompt. It is a dedicated Live API translation mode: stream 16kHz PCM speech in, choose a target language, and receive translated 24kHz audio plus optional transcripts from gemini-3.5-live-translate-preview.

Model ID

gemini-3.5-live-translate-preview

Launch

June 9, 2026

Languages

70+ supported

Input

Audio only

Output

Translated audio

Status

Public preview for developers

This guide is based on Google's launch post, the Gemini Live API docs, the official Google AI Developers thread, Google's LiveKit example, and the Gemini 3.5 Audio model card. Community reactions are intentionally left out so the API details stay tied to primary sources.

What Changed for Developers

Google's launch post positions Gemini 3.5 Live Translate as an audio model for live speech-to-speech translation. The developer-facing shift is that realtime translation is now exposed through the Gemini Live API and Google AI Studio, not only inside end-user Google products.

AreaWhat changedDeveloper impact
Developer accessGemini 3.5 Live Translate is available in public preview through the Gemini Live API and Google AI Studio.Developers can prototype speech-to-speech translation without waiting for a separate product surface.
Model IDThe Live API translation model is `gemini-3.5-live-translate-preview`.Treat it as a preview model and isolate it behind config flags before production rollout.
Interaction modelLive Translation behaves like a realtime interpreter, not a conversational Live Agent.Do not design prompts, tools, function calls, or turn-taking flows around this mode.
Audio pipelineInput is audio-only raw PCM at 16kHz; output is translated audio at 24kHz.Your product needs capture, resampling, buffering, playback, and transcript handling.
Safety signalGoogle says model-generated audio is watermarked with SynthID.Apps using generated audio should disclose AI audio and preserve provenance expectations.

Official X Thread and Video

The Google AI Developers launch thread is useful because it frames the developer capabilities in product terms: multilingual input, automatic language detection, native audio processing, and robustness in noisy environments.

The embedded post includes Google's official launch video. The important takeaway for builders is not only that translation is faster; it is that the product surface is designed for continuous speech, where the system stays close to the speaker instead of waiting for a complete turn.

Mental Model: Live Agent vs. Live Translation

The Gemini Live API can support realtime agent interactions, but Live Translation is a narrower mode. Google's docs describe it as an interpreter pipeline. That distinction changes the whole product design.

DimensionLive AgentLive Translation
RoleAssistant that listens, reasons, and can act.Interpreter pipeline for speech-to-speech translation.
InteractionTurn-based realtime conversation.Continuous stream processing while the speaker talks.
ToolsCan use Live API tool and agent capabilities.Translation-only; no tools or instructions.
InputsText, audio, video, image depending on feature.Audio input only for translation latency.
Main configGeneration, speech, tools, and instructions.`targetLanguageCode` plus `echoTargetLanguage`.

The practical implication: do not prompt Live Translate like a multilingual assistant. Build a media pipeline, not a chatbot. The API surface is about audio chunks, language codes, transcripts, and output playback.

Smallest API Shape

The docs show Python, JavaScript, and raw WebSocket options. For most web teams, the JavaScript SDK shape is the clearest starting point, but client-side apps should still use ephemeral tokens instead of exposing an API key.

import { GoogleGenAI, Modality } from "@google/genai";

const ai = new GoogleGenAI({});

const session = await ai.live.connect({
  model: "gemini-3.5-live-translate-preview",
  config: {
    responseModalities: [Modality.AUDIO],
    inputAudioTranscription: {},
    outputAudioTranscription: {},
    translationConfig: {
      targetLanguageCode: "es",
      echoTargetLanguage: false,
    },
  },
  callbacks: {
    onmessage: (message) => {
      const content = message.serverContent;
      const transcript = content?.outputTranscription?.text;
      const translatedAudio = content?.modelTurn?.parts?.find((part) => part.inlineData);

      if (transcript) console.log("Translated transcript:", transcript);
      if (translatedAudio) {
        // Decode and play the translated PCM audio chunk.
      }
    },
  },
});
FieldValueWhy it matters
model`gemini-3.5-live-translate-preview`Use the preview Live Translate model.
responseModalities`AUDIO`The API returns translated audio chunks.
inputAudioTranscriptionobjectOptional input transcript stream.
outputAudioTranscriptionobjectOptional translated transcript stream.
targetLanguageCodeBCP-47 codeTarget output language, such as `pl`, `es`, or `ja`. Defaults to English.
echoTargetLanguagebooleanWhen true, target-language input is echoed; when false, the model stays silent for target-language speech.

Audio Contract: PCM In, Translated Audio Out

The Live Translate docs are explicit about the media contract. Input audio must be raw, little-endian, 16-bit PCM at 16kHz mono. Output audio is raw 16-bit PCM at 24kHz mono. Google recommends 100ms chunks for low-latency streaming.

// Browser microphone audio usually needs conversion before sending.
// Target input for Live Translate:
// - raw PCM
// - 16-bit
// - little-endian
// - mono
// - 16kHz sample rate
// - roughly 100ms chunks

session.sendRealtimeInput({
  audio: {
    data: pcm16MonoChunk.toString("base64"),
    mimeType: "audio/pcm;rate=16000",
  },
});

That means the hard part of a real app is often not the API call. It is capture, resampling, voice activity handling, buffering, playback drift, and UI feedback when the network or microphone gets rough.

Client Security with Ephemeral Tokens

Google's docs recommend ephemeral tokens for client-to-server applications so browser clients do not expose the API key. For translation, the safer default is to lock translationConfig in the token constraints on the server.

ChoiceUse whenRisk
Lock target language on serverKiosk, classroom, broadcast, support room, meeting workflow.Less flexible, but the client cannot tamper with translation settings.
Unlock target language on clientUser must choose language dynamically in the browser.Requires stricter validation, logging, and abuse controls.

A production design should keep the API key server-side, mint short-lived tokens, limit allowed models, constrain target languages where possible, and log enough metadata to debug latency without storing sensitive raw audio unnecessarily.

Limitations You Should Design Around

The launch framing is strong, but the official docs also list practical caveats. These limitations are exactly where a polished app needs UX support.

LimitationOfficial caveatProduct response
Audio onlyTranslation mode does not accept text input.Keep text translation, chat, and function calling in separate flows.
Voice consistencyVoices can shift after long pauses or rapid speaker changes.Do not promise perfect speaker identity preservation.
Language detectionHeavy accents, similar languages, and fast language switches can affect the input transcript.Show transcript confidence and let users correct language when needed.
Background audioNoise and music are filtered, but not every background signal is ignored.Test real rooms, cars, crowds, and cheap microphones.
Echo artifacts`echoTargetLanguage: true` can introduce artifacts when target-language input contains background audio.Default to false unless your UX really needs echoing.

Reference Architecture

Google's example app shows a useful broadcast pattern with LiveKit: the organizer publishes audio, a translation bridge subscribes, one Gemini Live API session is created per target language, and attendees subscribe to the translated audio track for their chosen language.

Organizer microphone
  -> realtime room audio
  -> translation bridge per target language
  -> Gemini Live API translationConfig
  -> translated 24kHz audio
  -> attendee playback + optional transcript

The demo's most important scaling idea is session sharing. If fifty attendees choose Spanish, they should not create fifty identical Gemini sessions. A bridge can publish one Spanish translation stream that all Spanish listeners share.

Official Google visual showing speech translation in a video meeting
Official Google visual for speech translation in meetings. Google says Meet will use 3.5 Live Translate in private preview before broader rollout.

Rollout Across Google Products

The launch is not only an API announcement. Google says Gemini 3.5 Live Translate is rolling out through three surfaces: public preview for developers through the Gemini Live API and AI Studio, private preview for Google Meet enterprise customers, and the Google Translate app on Android and iOS.

SurfaceStatus from GoogleDeveloper takeaway
Gemini Live APIPublic preview for developers.Best place to build and test custom realtime translation flows.
Google AI StudioAvailable for trying model capabilities.Fastest way to test before wiring a media stack.
Google MeetPrivate preview for selected Workspace customers, broader rollout later.Shows the model is aimed at live meeting translation, not offline batch dubbing only.
Google Translate appRolling out globally on Android and iOS.Good reference for UX expectations around headphones, listening mode, and natural voice output.

Build Checklist

If you are building with Live Translate this week, start with the media pipeline and failure modes before you polish the interface.

1. Start in Google AI Studio to test target languages.
2. Use gemini-3.5-live-translate-preview behind a feature flag.
3. Capture microphone audio and convert to 16kHz mono PCM.
4. Send roughly 100ms chunks over the Live API session.
5. Request input and output transcripts for debugging.
6. Keep API keys on the server; use ephemeral tokens for browser clients.
7. Decide whether target language is locked server-side or user-selectable.
8. Test accents, background music, overlapping speakers, long pauses, and rapid language switches.
9. Add visible latency and transcript status in the UI.
10. Disclose AI-generated translated audio and preserve SynthID expectations.

FAQ

What is Gemini 3.5 Live Translate?

Gemini 3.5 Live Translate is Google's audio model for near realtime speech-to-speech translation. Developers use it through the Gemini Live API with the `gemini-3.5-live-translate-preview` model.

Is Live Translate the same as a Gemini Live Agent?

No. Live Translation is an interpreter pipeline. It does not support tools, function calling, free-form instructions, text input, or general agent behavior in the translation mode.

What audio format does the API expect?

The docs specify raw little-endian 16-bit PCM audio at 16kHz mono for input, translated audio output at 24kHz mono, and 100ms input chunks.

Can a browser app call Live Translate directly?

Use ephemeral tokens for client-side applications. The docs recommend locking translation configuration on the server so a browser client cannot tamper with model or language settings.

Should I use this for production today?

Treat it as a preview capability. It is useful for prototypes and controlled pilots, but production apps need latency testing, fallback UX, privacy review, audio quality checks, and limits around voice consistency.

Official Sources and Links

Sponsored AI assistant. Recommendations may be paid.