Google Vertex AI

This page owns provider-specific configuration examples for Google Vertex AI routes.

Current Runtime Boundary

The gateway uses one gcp_vertex provider type for multiple Vertex publisher families:

google/* chat upstream models use Vertex generateContent and streamGenerateContent
supported google/* text-embedding upstream models use Vertex :predict through the public /v1/embeddings path
anthropic/* upstream models use Anthropic-on-Vertex rawPredict and streamRawPredict
/v1/responses is not implemented for gcp_vertex routes in this slice

Vertex routes require Google Cloud authentication with the https://www.googleapis.com/auth/cloud-platform scope. The provider supports Application Default Credentials, service-account JSON from a mounted path, and static bearer tokens for constrained environments.

The auth.mode: service_account examples on this page are upstream Google Cloud credentials used by the gateway when it calls Vertex. They are not gateway service accounts, do not grant callers access to /v1/*, and do not participate in gateway team service-account management.

Provider

yaml

providers:
  - id: vertex-global
    type: gcp_vertex
    project_id: env.GCP_PROJECT_ID
    location: global
    auth:
      mode: adc
    display:
      label: Google Vertex AI
      icon_key: vertexai

api_host is optional. When omitted, the gateway uses aiplatform.googleapis.com, which is the right default for the global endpoint. For Vertex multi-region endpoints, set api_host explicitly to aiplatform.us.rep.googleapis.com or aiplatform.eu.rep.googleapis.com. For a regional endpoint, set it to the regional Vertex host such as us-east5-aiplatform.googleapis.com. Anthropic-on-Vertex pricing is currently supported only for location: global.

Service-account and bearer examples:

yaml

providers:
  - id: vertex-service-account
    type: gcp_vertex
    project_id: env.GCP_PROJECT_ID
    location: us
    api_host: aiplatform.us.rep.googleapis.com
    auth:
      mode: service_account
      credentials_path: /var/run/secrets/gcp/service-account.json

  - id: vertex-bearer
    type: gcp_vertex
    project_id: env.GCP_PROJECT_ID
    location: us-central1
    api_host: us-central1-aiplatform.googleapis.com
    auth:
      mode: bearer
      token: env.GCP_VERTEX_ACCESS_TOKEN

For service-account JSON:

provision the Google service account in the target project
grant the least-privilege Vertex AI permissions needed for the configured models
mount the JSON as a file and point credentials_path at that mounted path
rotate the JSON or move to ADC/workload identity outside the gateway, then restart or reload the gateway path that reads it

Do not put the JSON document itself in gateway.yaml. Use a mounted secret path or a runtime identity mechanism such as ADC.

Model Identity

Use publisher-qualified upstream_model values:

Google models: google/<model-id>
Anthropic models: anthropic/<model-id>

The publisher prefix selects the request mapper and pricing family. The model ID after the slash is passed to the Vertex endpoint path.

Examples verified against Anthropic and Google Cloud docs on 2026-05-01:

Use case	Gateway model id	Vertex `upstream_model`	Notes
Latest high-capability Claude	`claude-opus-vertex`	`anthropic/claude-opus-4-7`	Claude Opus 4.7 is available through Anthropic-on-Vertex and supports adaptive thinking.
Claude coding and agent workloads	`claude-sonnet-vertex`	`anthropic/claude-sonnet-4-6`	Claude Sonnet 4.6 supports adaptive thinking with effort.
Older pinned Claude	`claude-sonnet-45-vertex`	`anthropic/claude-sonnet-4-5@20250929`	Versioned Anthropic model IDs use the `@YYYYMMDD` suffix on Vertex.
Gemini chat	`gemini-flash-vertex`	`google/gemini-2.0-flash`	Uses the Vertex Google publisher request shape.
Gemini embeddings	`gemini-embedding-vertex`	`google/gemini-embedding-001`	Uses Vertex text embeddings `:predict` through `/v1/embeddings`.
Gemini Embedding 2	`gemini-embedding-2-vertex`	`google/gemini-embedding-2`	Uses Vertex `:embedContent` for text-only OpenAI-compatible embeddings.
Vertex text embeddings	`text-embedding-vertex`	`google/text-embedding-005`	Older text embedding model using the Vertex text-embedding `:predict` contract.
Vertex multilingual embeddings	`text-multilingual-embedding-vertex`	`google/text-multilingual-embedding-002`	Multilingual text embedding model using the Vertex text-embedding `:predict` contract.

Google documents that Claude model availability varies by endpoint and region. Prefer global when your residency policy allows it; use us, eu, or a regional location when you need a geography-specific processing boundary.

Claude Example

Anthropic-on-Vertex uses the Anthropic Messages body shape with Vertex transport requirements:

the model stays in the endpoint path, not the JSON request body
the body includes anthropic_version: "vertex-2023-10-16"
non-streaming requests use rawPredict
streaming requests use streamRawPredict

yaml

models:
  - id: claude-opus-vertex
    description: Claude Opus on Google Vertex AI
    tags: [vertex, claude, reasoning]
    routes:
      - provider: vertex-global
        upstream_model: anthropic/claude-opus-4-7
        capabilities:
          chat_completions: true
          responses: false
          embeddings: false
          stream: true
          tools: true
          vision: false
          json_schema: false

Native Claude invocation requires max_tokens. If callers omit it, the gateway currently supplies max_tokens: 1024 for Anthropic-on-Vertex routes.

Anthropic-on-Vertex routes can enable tools: true when the upstream Claude model supports tool use. The gateway maps OpenAI Chat Completions function tools, assistant tool_calls, tool-result continuations, and streaming tool-use deltas to and from the Anthropic Messages shape used by Vertex. Keep vision: false unless you have tested image/document content blocks for the exact route; the Anthropic-on-Vertex mapper still rejects non-text content blocks in this slice.

Claude Thinking Compatibility

For Anthropic-on-Vertex, OpenAI-shaped reasoning_effort maps to Anthropic Messages output_config.effort without forwarding the OpenAI-only field. The gateway also applies model-aware thinking policy before sending the Vertex request.

Adaptive example for Claude Opus 4.7:

json

{
  "anthropic_version": "vertex-2023-10-16",
  "max_tokens": 16000,
  "thinking": {
    "type": "adaptive"
  },
  "output_config": {
    "effort": "xhigh"
  },
  "messages": [
    {
      "role": "user",
      "content": "Review this implementation plan."
    }
  ]
}

Gateway callers can request the same shape with OpenAI-compatible fields:

json

{
  "model": "claude-opus-vertex",
  "max_tokens": 16000,
  "reasoning_effort": "xhigh",
  "messages": [
    {
      "role": "user",
      "content": "Review this implementation plan."
    }
  ]
}

The gateway sends thinking: { "type": "adaptive" } and output_config.effort upstream, and removes reasoning_effort.

Model behavior:

Model family	Gateway behavior
Claude Opus 4.7 and later	`reasoning_effort` or `reasoning.effort` maps to `thinking: { "type": "adaptive" }` plus `output_config.effort`. Manual `thinking.type: "enabled"` and `budget_tokens` are rejected. Non-default `temperature`, `top_p`, and `top_k` are rejected; default `temperature: 1` and `top_p: 1` are omitted.
Claude Opus 4.6 and Claude Sonnet 4.6	`reasoning_effort` maps to adaptive thinking and `output_config.effort`. Caller-supplied manual budgets remain pass-through because Anthropic still accepts them, but they are deprecated upstream.
Claude Mythos Preview	Adaptive thinking is the default when `thinking` is unset. `reasoning_effort` maps to `output_config.effort`; `thinking.type: "disabled"` is rejected.
Claude Opus 4.5	Adaptive thinking is rejected. `reasoning_effort` maps to `output_config.effort` only when a manual thinking budget is also supplied.
Claude Sonnet/Haiku 4.5 and older Claude models	Adaptive thinking is rejected. These models require an explicit manual budget from `reasoning.budget_tokens`, `reasoning_budget_tokens`, `thinking_budget_tokens`, or caller-supplied `thinking.type: "enabled"` with `budget_tokens`; the gateway does not add `output_config.effort`.

Manual budget example for an older Claude model:

json

{
  "model": "claude-sonnet-45-vertex",
  "max_tokens": 8192,
  "reasoning": {
    "effort": "medium",
    "budget_tokens": 2048
  },
  "messages": [
    {
      "role": "user",
      "content": "Analyze this migration risk."
    }
  ]
}

For Claude Sonnet 4.5, the gateway sends manual thinking.type: "enabled" with budget_tokens and omits output_config.effort. For Claude Opus 4.5, it sends the manual budget and output_config.effort.

Chat Completions hides Claude thinking from normal content and delta.content. Native Anthropic thinking, redacted_thinking, thinking_delta, and signature_delta blocks are preserved under provider_metadata.gcp_vertex.reasoning for debugging and provider continuity. The gateway does not yet rehydrate that provider metadata into future Anthropic content blocks when callers send tool results. Anthropic documents that tool-use continuations with thinking may require complete unmodified thinking blocks, so gateway-managed replay remains tracked by issue #140.

Gemini Example

Google publisher routes use Vertex generateContent and streamGenerateContent.

yaml

models:
  - id: gemini-flash-vertex
    description: Gemini Flash on Google Vertex AI
    tags: [vertex, gemini]
    routes:
      - provider: vertex-global
        upstream_model: google/gemini-2.0-flash
        capabilities:
          chat_completions: true
          responses: false
          embeddings: false
          stream: true
          tools: false
          vision: true
          json_schema: false

Vertex Google multimodal inputs currently accept gs:// image and file URIs through OpenAI-compatible typed content. Inline/base64 data and remote HTTP URLs are not supported in this gateway slice.

Text Embeddings Example

Native Vertex text embeddings are exposed through the OpenAI-compatible gateway endpoint:

text

POST /v1/embeddings

Use an embedding-only route. Do not make a Gemini chat route embedding-capable just because it also uses the google/* publisher prefix.

yaml

models:
  - id: gemini-embedding
    description: Gemini embeddings on Vertex AI
    tags: [vertex, embeddings]
    routes:
      - provider: vertex-global
        upstream_model: google/gemini-embedding-001
        capabilities:
          chat_completions: false
          responses: false
          embeddings: true
          stream: false
          tools: false
          vision: false
          json_schema: false

Supported native Vertex text-embedding upstream models:

Upstream model	Default/maximum output dimensions	Notes
`google/gemini-embedding-001`	3072	Supports lower `dimensions` values through Vertex `outputDimensionality`. The gateway fans out array input as independent embedding operations when needed to preserve OpenAI array semantics.
`google/gemini-embedding-2`	3072	Uses Vertex `:embedContent`. The gateway supports text-only OpenAI-compatible embeddings; image, audio, video, and PDF multimodal inputs remain unsupported on `/v1/embeddings`. `task_type`, `input_type`, `title`, and `auto_truncate` are not accepted for this model; put task instructions in the input text.
`google/text-embedding-005`	768	Uses the same Vertex `:predict` text-embedding contract.
`google/text-multilingual-embedding-002`	768	Uses the same Vertex `:predict` text-embedding contract.

Request example:

bash

curl "$OCEANS_BASE_URL/v1/embeddings" \
  -H "Authorization: Bearer $OCEANS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-embedding",
    "input": ["search query", "document text"],
    "dimensions": 768,
    "task_type": "SEMANTIC_SIMILARITY",
    "encoding_format": "float"
  }'

OpenAI SDK example:

python

from openai import OpenAI

client = OpenAI(base_url="https://gateway.example.com/v1", api_key="...")

response = client.embeddings.create(
    model="gemini-embedding",
    input=["search query", "document text"],
    dimensions=768,
    extra_body={
        "task_type": "SEMANTIC_SIMILARITY",
        "auto_truncate": False,
    },
)

Parameter mapping:

Public request field	Vertex field	Gateway behavior
`input: "text"`	`instances[].content` for `:predict`; `content.parts[].text` for `google/gemini-embedding-2` `:embedContent`	Returns one embedding with `index: 0`. Empty strings are rejected locally.
`input: ["a", "b"]`	independent Vertex requests	Returns one embedding per input in original order. Empty arrays, nested arrays, token arrays, non-string values, and multimodal payloads are rejected locally.
`dimensions`	`parameters.outputDimensionality` for `:predict`; `embedContentConfig.outputDimensionality` for `google/gemini-embedding-2`	Must be a positive integer within the supported model maximum.
`output_dimensionality` / `outputDimensionality`	Same as `dimensions`	Provider-specific aliases; conflicting aliases are rejected locally.
`encoding_format: "float"` or omitted	n/a	Accepted. `base64` is rejected locally.
`task_type`	`instances[].task_type` for `:predict` models only	Must be one of Google's supported task enum values. Rejected for `google/gemini-embedding-2`; put task instructions in the input text.
`input_type`	Alias for `task_type` for `:predict` models only	Conflicts are rejected. Rejected for `google/gemini-embedding-2`.
`title`	`instances[].title` for `:predict` models only	Accepted only for retrieval-document embeddings. Rejected for `google/gemini-embedding-2`.
`auto_truncate` / `autoTruncate`	`parameters.autoTruncate` for `:predict` models only	Boolean. When `false`, overlong input is left for Vertex to reject instead of truncating. Rejected for `google/gemini-embedding-2`.

Allowed task types are RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING, QUESTION_ANSWERING, FACT_VERIFICATION, and CODE_RETRIEVAL_QUERY.

Usage, pricing, and budgets:

The gateway uses real Vertex token counts only: predictions[].embeddings.statistics.token_count for :predict models and usageMetadata.promptTokenCount for google/gemini-embedding-2. It does not convert character counts or byte counts into tokens.
When token counts and exact pricing are available, embedding spend is charged through the same user, service-account, and user-model budgets as other gateway traffic.
If Vertex omits token counts, the ledger row is usage_missing; if exact catalog pricing is unavailable, the row is unpriced. Both remain visible in reporting but do not consume budgets.

Troubleshooting:

Symptom	Check
`/v1/embeddings` returns a capability or invalid-request error	Confirm the selected route has `embeddings: true` and uses one of the supported embedding upstream models, not a Gemini chat model.
`encoding_format` fails	Use `float`; native Vertex `base64` encoding is not implemented.
Token-array or nested-array input fails	Send text strings. OpenAI token arrays cannot be translated safely to Vertex text content.
Spend row is `usage_missing`	Vertex did not return usable token counts, so the request is visible but not budget-consuming.
Spend row is `unpriced`	The pricing catalog did not have an exact supported price for the selected Vertex model/location.

Operational Notes

Keep responses: false on all Vertex routes. Keep embeddings: false on Vertex chat routes and enable embeddings: true only on explicit google/gemini-embedding-001, google/gemini-embedding-2, google/text-embedding-005, or google/text-multilingual-embedding-002 routes.
Use upstream_model: anthropic/<model-id> for Claude and upstream_model: google/<model-id> for Gemini; unqualified model IDs fail at the gateway edge.
Vertex AI limits Anthropic request payloads to 30 MB. Large documents and many images can hit that byte limit before the model token limit.
Keep json_schema: false unless a route has explicit provider-specific overrides and tests.
Use extra_body only for additive provider fields you have tested for the exact publisher and model family.
Anthropic-on-Vertex routes may set tools: true for tested Claude tool-use models. Keep vision: false unless you have gateway fixtures for multimodal Anthropic content blocks. Upstream Claude model capability is not enough by itself; route capability flags should reflect the gateway mapper and tests.
Check Anthropic and Google Cloud model pages before adding a new Claude route; model IDs, endpoint availability, context windows, and retirement dates vary by model and location.

Validation

Validate documentation-only edits with mise run docs-check. For runtime Vertex adapter changes, run cargo test -p gateway-providers vertex::tests and cargo clippy -p gateway-providers --all-targets -- -D warnings.

AWS Bedrock

Observability and Request Logs

Google Vertex AI

Current Runtime Boundary

Provider

Model Identity

Claude Example

Claude Thinking Compatibility

Gemini Example

Text Embeddings Example

Operational Notes

Validation

Google Vertex AI ​

Current Runtime Boundary ​

Provider ​

Model Identity ​

Claude Example ​

Claude Thinking Compatibility ​

Gemini Example ​

Text Embeddings Example ​

Operational Notes ​

Validation ​

Google Vertex AI

Current Runtime Boundary

Provider

Model Identity

Claude Example

Claude Thinking Compatibility

Gemini Example

Text Embeddings Example

Operational Notes

Validation