Appearance
Admin Runbooks
See also: Runtime Bootstrap and Access, Deploy and Operations, Deploy, Configuration Reference, Identity and Access, Observability and Request Logs
This page is action-oriented. It is not the place for broad topology or config reference detail.
First Deploy
Compose
- copy
deploy/.env.exampletodeploy/.env - set image tags and secret values
- inspect the mounted config at ../deploy/config/gateway.yaml
- start the stack:
bash
docker compose -f deploy/compose.yaml up -d- confirm the containers are healthy
- call
/healthzand/readyz - confirm whether admin bootstrap is enabled in the mounted config before assuming
/adminis ready - confirm the seeded gateway key works for
/v1/models
If the deploy path is meant to support the admin UI on first boot, the mounted config needs a real bootstrap-admin plan or a pre-existing admin row.
Helm
- create the namespace and runtime secrets outside the chart
- render the intended values:
bash
helm template oceans-llm deploy/helm/oceans-llm --values values.yaml- confirm ingress routes only target the gateway service
- confirm
gateway.config.database.urlmatches the selected database mode - install the chart:
bash
helm install oceans-llm oci://ghcr.io/ahstn/charts/oceans-llm \
--namespace <namespace> \
--version <version> \
--values values.yaml- confirm the migration Job completed
- confirm gateway and admin UI pods are ready
- call
/healthzand/readyzthrough the gateway service or ingress - confirm bootstrap-admin and seed-config Jobs were enabled only when intended
- inspect completed hook Job logs before TTL cleanup if migration or bootstrap behavior needs review
Upgrade Flow
Compose
- pick the target image tags
- review release notes and image caveats
- confirm database backup or recreate policy for the target environment
- update
deploy/.env - restart the stack with the new tags
- recheck
/readyz - recheck admin login or seeded API-key access
- spot-check one live
/v1/*request
If the change touches admin APIs, also recheck the live admin-backed pages rather than only the public API.
Helm
- pick the target chart version
- review release notes, chart values changes, and image caveats
- confirm database backup or recreate policy for the target environment
- render the upgrade:
bash
helm template oceans-llm oci://ghcr.io/ahstn/charts/oceans-llm \
--version <version> \
--values values.yaml- apply the upgrade:
bash
helm upgrade oceans-llm oci://ghcr.io/ahstn/charts/oceans-llm \
--namespace <namespace> \
--version <version> \
--values values.yaml- confirm the migration hook Job completed
- recheck gateway rollout,
/readyz, admin login, and one live/v1/*request - inspect completed hook Job logs before TTL cleanup if the upgrade changed database or seed behavior
If the upgrade fails after chart rendering but before pods are healthy, inspect hook Jobs first, then the gateway deployment events.
Helm Rollback
- inspect revisions:
bash
helm history oceans-llm --namespace <namespace>- confirm the target revision and database compatibility
- roll back:
bash
helm rollback oceans-llm <revision> --namespace <namespace>- confirm the gateway deployment becomes ready
- recheck
/readyz, admin login, and one live/v1/*request
Do not treat Helm rollback as a database rollback. If a migration already changed the database, review the migration notes before rolling application code back.
Helm Scheduling and HA Checks
For HA gateway installs:
- confirm
gateway.replicaCount > 1orautoscaling.minReplicas > 1 - confirm the rendered
PodDisruptionBudgetmatches the intended disruption budget - confirm
scheduling.topologySpreadConstraintsand affinity rules do not make pods unschedulable - if using Karpenter or another dynamic node provisioner, confirm node selectors, tolerations, and priority class match available node pools
- confirm HPA metrics are available before relying on autoscaling behavior
Failed Migration Recovery
Start with the least destructive path.
- inspect gateway logs
- run the explicit migrate command against the active config:
bash
mise run gateway-migrate- confirm the database URL points at the intended backend
- confirm the process did not start with migrations disabled
If the migration error says database reset required, the running database carries pre-baseline history that this release no longer accepts. Recreate the libsql/Postgres database, then rerun migrations and seeding/bootstrap steps against the fresh V17 baseline.
If the error is not a reset-required failure, stop and inspect it before retrying. Do not assume manual repair is safer than recreation.
Broken Admin Login
Work through these checks in order:
- confirm
/adminis being served through the gateway, not only through the UI server on:3001 - confirm bootstrap admin is enabled in the active config if this is a fresh environment
- confirm the bootstrap admin command against the active config:
bash
mise run gateway-bootstrap-admin- confirm the expected first-login rule
- local config does not force password rotation
- production-shaped local config does force password rotation
- confirm the session is not simply expired or stale
If the environment relies on SSO, also review oidc-and-sso-status.md and confirm the active OIDC/OAuth provider, public base URL, callback URL, and client secret.
Provider Auth Failure
Provider auth failures usually come from config shape or missing secrets.
- confirm the provider exists in the active config
- confirm the secret references resolve in the runtime environment
- confirm
openai_compatproviders have a supportedpricing_provider_id - confirm Vertex routes use
<publisher>/<model_id>inupstream_model - confirm the route is enabled and has positive weight
- confirm the model is not only visible in
/v1/models, but actually viable for the requested operation
If the symptom is “model is visible but fails,” follow request-lifecycle-and-failure-modes.md.
Missing OTLP Collector
The checked-in deploy path does not ship a collector by default.
If OTLP export is configured but no collector is reachable:
- inspect gateway startup logs
- confirm
server.otel_endpointandserver.otel_metrics_endpoint - confirm the collector address is reachable from the gateway container
- decide whether the environment should:
- wire a real collector, or
- run without one and rely on logs plus request-log storage
The request-log admin APIs can still work without a collector. OTLP export and request-log persistence are related, but they are not the same dependency.
For Helm installs, wire collector access through gateway.config.server.otel_endpoint, gateway.config.server.otel_metrics_endpoint, and observability.* values. The chart does not install a collector.
Request-Log Retention Purge
Use the supported purge command instead of deleting request-log tables by hand.
Preview a purge:
bash
mise run gateway-purge-request-logs-dry-run
mise run gateway-purge-request-logs-dry-run-prodApply it:
bash
mise run gateway-purge-request-logs
mise run gateway-purge-request-logs-prodSupported windows are 1d, 3d, and 7d; set RETENTION=1d|3d|7d before the mise task to override the 7d default. The default admin choice is 7d; use 1d or 3d only when the environment has tight storage requirements and admins are comfortable losing request-detail history quickly.
The purge removes old parent request-log rows and their detail children, including payloads, caller tags, and provider execution attempts. It does not remove usage_cost_events, so spend and budget reporting remain ledger-backed after old request-log detail is gone.
Recurring purge is off by default. If enabled in config, use request_logging.purge.schedule with a daily 5-field cron expression and rely on the runtime UTC-day guard as a backstop, not as the primary scheduler. Each gateway process starts its own recurring purge loop, so HA deployments should enable recurring purge on only the intended process or accept that every replica will independently evaluate the same retention schedule.
Secret Rotation Checkpoints
When rotating secrets, check the dependent path instead of rotating blindly.
Gateway API key
- update the config source
- restart or reseed as needed
- verify
/v1/modelswith the new key - verify the old key fails if revocation was intended
For service-account callers, use a gateway API key attached to an explicit service account with a narrow model grant set and an active service-account budget. Name the key after the workload, keep the raw secret in your secret manager, and rotate by creating a replacement key before revoking the old one.
Bootstrap admin password
- rotate through the admin UI or the normal auth flow
- confirm the new password works
- confirm the old password does not
Provider token or service account
- update the runtime secret source
- restart or reload the affected service path
- run one live request through the affected provider
- confirm request logs show the expected provider key
Provider service-account credentials are upstream cloud credentials, not gateway caller identities. Rotating a Vertex service-account JSON file or an AWS IAM role changes provider access only; it does not rotate gateway API keys used by clients.
What This Page Does Not Own
- compose file syntax: ../deploy/README.md
- Kubernetes chart contract: kubernetes-and-helm.md
- startup and first-access rules: runtime-bootstrap-and-access.md
- topology and same-origin contract: deploy-and-operations.md
- identity lifecycle rules: identity-and-access.md
- request-log payload policy: observability-and-request-logs.md
