Admin Runbooks

This page is action-oriented. It is not the place for broad topology or config reference detail.

First Deploy

Compose

copy deploy/.env.example to deploy/.env
set image tags and secret values
inspect the mounted config at ../deploy/config/gateway.yaml
start the stack:

bash

docker compose -f deploy/compose.yaml up -d

confirm the containers are healthy
call /healthz and /readyz
confirm whether admin bootstrap is enabled in the mounted config before assuming /admin is ready
confirm the seeded gateway key works for /v1/models

If the deploy path is meant to support the admin UI on first boot, the mounted config needs a real bootstrap-admin plan or a pre-existing admin row.

Helm

create the namespace and runtime secrets outside the chart
render the intended values:

bash

helm template oceans-llm deploy/helm/oceans-llm --values values.yaml

confirm ingress routes only target the gateway service
confirm gateway.config.database.url matches the selected database mode
install the chart:

bash

helm install oceans-llm oci://ghcr.io/ahstn/charts/oceans-llm \
  --namespace <namespace> \
  --version <version> \
  --values values.yaml

confirm the migration Job completed
confirm gateway and admin UI pods are ready
call /healthz and /readyz through the gateway service or ingress
confirm bootstrap-admin and seed-config Jobs were enabled only when intended
inspect completed hook Job logs before TTL cleanup if migration or bootstrap behavior needs review

Upgrade Flow

Compose

pick the target image tags
review release notes and image caveats
confirm database backup or recreate policy for the target environment
update deploy/.env
restart the stack with the new tags
recheck /readyz
recheck admin login or seeded API-key access
spot-check one live /v1/* request

If the change touches admin APIs, also recheck the live admin-backed pages rather than only the public API.

Helm

pick the target chart version
review release notes, chart values changes, and image caveats
confirm database backup or recreate policy for the target environment
render the upgrade:

bash

helm template oceans-llm oci://ghcr.io/ahstn/charts/oceans-llm \
  --version <version> \
  --values values.yaml

apply the upgrade:

bash

helm upgrade oceans-llm oci://ghcr.io/ahstn/charts/oceans-llm \
  --namespace <namespace> \
  --version <version> \
  --values values.yaml

confirm the migration hook Job completed
recheck gateway rollout, /readyz, admin login, and one live /v1/* request
inspect completed hook Job logs before TTL cleanup if the upgrade changed database or seed behavior

If the upgrade fails after chart rendering but before pods are healthy, inspect hook Jobs first, then the gateway deployment events.

Helm Rollback

inspect revisions:

bash

helm history oceans-llm --namespace <namespace>

confirm the target revision and database compatibility
roll back:

bash

helm rollback oceans-llm <revision> --namespace <namespace>

confirm the gateway deployment becomes ready
recheck /readyz, admin login, and one live /v1/* request

Do not treat Helm rollback as a database rollback. If a migration already changed the database, review the migration notes before rolling application code back.

Helm Scheduling and HA Checks

For HA gateway installs:

confirm gateway.replicaCount > 1 or autoscaling.minReplicas > 1
confirm the rendered PodDisruptionBudget matches the intended disruption budget
confirm scheduling.topologySpreadConstraints and affinity rules do not make pods unschedulable
if using Karpenter or another dynamic node provisioner, confirm node selectors, tolerations, and priority class match available node pools
confirm HPA metrics are available before relying on autoscaling behavior

Failed Migration Recovery

Start with the least destructive path.

inspect gateway logs
run the explicit migrate command against the active config:

bash

mise run gateway-migrate

confirm the database URL points at the intended backend
confirm the process did not start with migrations disabled

If the migration error says database reset required, the running database carries pre-baseline history that this release no longer accepts. Recreate the libsql/Postgres database, then rerun migrations and seeding/bootstrap steps against the fresh V17 baseline.

If the error is not a reset-required failure, stop and inspect it before retrying. Do not assume manual repair is safer than recreation.

Work through these checks in order:

confirm /admin is being served through the gateway, not only through the UI server on :3001
confirm bootstrap admin is enabled in the active config if this is a fresh environment
confirm the bootstrap admin command against the active config:

bash

mise run gateway-bootstrap-admin

confirm the expected first-login rule
- local config does not force password rotation
- production-shaped local config does force password rotation
confirm the session is not simply expired or stale

If the environment relies on SSO, also review oidc-and-sso-status.md and confirm the active OIDC/OAuth provider, public base URL, callback URL, and client secret.

Provider Auth Failure

Provider auth failures usually come from config shape or missing secrets.

confirm the provider exists in the active config
confirm the secret references resolve in the runtime environment
confirm openai_compat providers have a supported pricing_provider_id
confirm Vertex routes use <publisher>/<model_id> in upstream_model
confirm the route is enabled and has positive weight
confirm the model is not only visible in /v1/models, but actually viable for the requested operation

If the symptom is “model is visible but fails,” follow request-lifecycle-and-failure-modes.md.

Missing OTLP Collector

The checked-in deploy path does not ship a collector by default.

If OTLP export is configured but no collector is reachable:

inspect gateway startup logs
confirm server.otel_endpoint and server.otel_metrics_endpoint
confirm the collector address is reachable from the gateway container
decide whether the environment should:
- wire a real collector, or
- run without one and rely on logs plus request-log storage

The request-log admin APIs can still work without a collector. OTLP export and request-log persistence are related, but they are not the same dependency.

For Helm installs, wire collector access through gateway.config.server.otel_endpoint, gateway.config.server.otel_metrics_endpoint, and observability.* values. The chart does not install a collector.

Request-Log Retention Purge

Use the supported purge command instead of deleting request-log tables by hand.

Preview a purge:

bash

mise run gateway-purge-request-logs-dry-run
mise run gateway-purge-request-logs-dry-run-prod

Apply it:

bash

mise run gateway-purge-request-logs
mise run gateway-purge-request-logs-prod

Supported windows are 1d, 3d, and 7d; set RETENTION=1d|3d|7d before the mise task to override the 7d default. The default admin choice is 7d; use 1d or 3d only when the environment has tight storage requirements and admins are comfortable losing request-detail history quickly.

The purge removes old parent request-log rows and their detail children, including payloads, caller tags, and provider execution attempts. It does not remove usage_cost_events, so spend and budget reporting remain ledger-backed after old request-log detail is gone.

Recurring purge is off by default. If enabled in config, use request_logging.purge.schedule with a daily 5-field cron expression and rely on the runtime UTC-day guard as a backstop, not as the primary scheduler. Each gateway process starts its own recurring purge loop, so HA deployments should enable recurring purge on only the intended process or accept that every replica will independently evaluate the same retention schedule.

Secret Rotation Checkpoints

When rotating secrets, check the dependent path instead of rotating blindly.

Gateway API key

update the config source
restart or reseed as needed
verify /v1/models with the new key
verify the old key fails if revocation was intended

For service-account callers, use a gateway API key attached to an explicit service account with a narrow model grant set and an active service-account budget. Name the key after the workload, keep the raw secret in your secret manager, and rotate by creating a replacement key before revoking the old one.

Bootstrap admin password

rotate through the admin UI or the normal auth flow
confirm the new password works
confirm the old password does not

Provider token or service account

update the runtime secret source
restart or reload the affected service path
run one live request through the affected provider
confirm request logs show the expected provider key

Provider service-account credentials are upstream cloud credentials, not gateway caller identities. Rotating a Vertex service-account JSON file or an AWS IAM role changes provider access only; it does not rotate gateway API keys used by clients.

What This Page Does Not Own

compose file syntax: ../deploy/README.md
Kubernetes chart contract: kubernetes-and-helm.md
startup and first-access rules: runtime-bootstrap-and-access.md
topology and same-origin contract: deploy-and-operations.md
identity lifecycle rules: identity-and-access.md
request-log payload policy: observability-and-request-logs.md

AWS Bedrock

Observability and Request Logs

Admin Runbooks

First Deploy

Compose

Helm

Upgrade Flow

Compose

Helm

Helm Rollback

Helm Scheduling and HA Checks

Failed Migration Recovery

Provider Auth Failure

Missing OTLP Collector

Request-Log Retention Purge

Secret Rotation Checkpoints

Gateway API key

Bootstrap admin password

Provider token or service account

What This Page Does Not Own

Admin Runbooks ​

First Deploy ​

Compose ​

Helm ​

Upgrade Flow ​

Compose ​

Helm ​

Helm Rollback ​

Helm Scheduling and HA Checks ​

Failed Migration Recovery ​

Broken Admin Login ​

Provider Auth Failure ​

Missing OTLP Collector ​

Request-Log Retention Purge ​

Secret Rotation Checkpoints ​

Gateway API key ​

Bootstrap admin password ​

Provider token or service account ​

What This Page Does Not Own ​

Admin Runbooks

First Deploy

Compose

Helm

Upgrade Flow

Compose

Helm

Helm Rollback

Helm Scheduling and HA Checks

Failed Migration Recovery

Broken Admin Login

Provider Auth Failure

Missing OTLP Collector

Request-Log Retention Purge

Secret Rotation Checkpoints

Gateway API key

Bootstrap admin password

Provider token or service account

What This Page Does Not Own