Skip to content

Security model

This document describes the trust boundaries of the OpenInfra control plane + agent system, the assets it protects, the threats against each boundary, and the mitigations in place today. It is deliberately honest about residual and accepted risks — a threat model that only lists wins is not useful for the next person hardening the system.

Scope: the live worldwide deployment shape — one control plane on a public VPS (cp.seppelabs.com), provider agents that dial out from untrusted networks, and tenants that submit container workloads over a public REST API. See deploy/seppelabs/README.md for the deployment runbook and ARCHITECTURE.md for the full system design.


┌─────────────────────── PUBLIC INTERNET ───────────────────────┐
│ │
Tenant ───┼── HTTPS (Caddy TLS) ──▶ :443 ┐ │
(API key) │ │ │
│ ┌──────▼─────── control plane (VPS) ──┐│
Agent ────┼── gRPC mTLS ──────────▶ :9090 │ api-server (chi REST :8080) ││
(client │ (outbound dial) │ │ gRPC server :9090 ││
cert) │ │ │ postgres (compose net only)││
│ │ │ registry (compose net only)││
│ │ │ pg-backup ──▶ Backblaze B2 ││
└──────────────────────────┘ └───────────────────────────────┘│
Provider host (e.g. Kuma, home box) runs the agent + tenant containers ──────┘
│ docker.sock mounted │ tenant images pulled & executed here

Trust boundaries (where data crosses a privilege level):

#BoundaryCrossingAuthentication
B1Tenant → control planepublic HTTPS RESTAPI key (oi_…) or admin JWT
B2Agent → control planepublic gRPCmutual TLS (client cert)
B3Control plane → provider hostworkload dispatch over B2’s stream(rides the mTLS channel)
B4Tenant workload → provider host kernelcontainer runtimeOS/container isolation
B5Control plane → Postgres / registrycompose-internalnot published to host
B6Control plane → Backblaze B2outbound HTTPSscoped application key
B7Control plane → Solana devnetoutbound RPCtreasury keypair

What an attacker would want, ranked by blast radius:

  1. The settlement ledger (transactions, double-entry). Tampering = theft of credits / unbilled compute. Integrity is paramount.
  2. Tenant secrets injected into workloads (DB URLs, cloud creds in env_template / the per-dispatch secrets map). Confidentiality.
  3. API keys & JWT secret. A leaked tenant key spends that tenant’s credits and runs workloads as them; a leaked JWT_SECRET forges admin sessions → full control-plane compromise.
  4. The mTLS CA private key. Signs agent client certs; leak = an attacker can register a rogue provider host.
  5. The Postgres database (everything above at rest).
  6. The Solana treasury keypair (on-chain settlement authority).
  7. Provider host integrity — tenants run code on someone else’s box.

3. Boundary-by-boundary threats & mitigations

Section titled “3. Boundary-by-boundary threats & mitigations”

B1 — Tenant → control plane (public REST)

Section titled “B1 — Tenant → control plane (public REST)”

Authentication. Bearer token. A token that LooksLikeKey (oi_ prefix) is routed to the API-key branch; any failure there (revoked, expired, suspended tenant) returns 401 and does not fall through to the JWT path (internal/api/middleware/apikey.go, AuthEither). This closes the “present a stolen API key as a JWT” confusion class.

  • API keys are 16 bytes from crypto/rand, formatted oi_{env}_{base32}, stored only as a SHA-256 hash; lookup is by hash, comparison is constant-time (crypto/subtle). Plaintext is shown once at mint time and never persisted (internal/auth/apikey).
  • Keys carry scopes; RequireScope (403) narrows machine principals. Admin JWT sessions carry implicit-all — the intended delegation model (humans use the portal with full-power sessions, integrations use least-privilege scoped keys).
  • Per-tenant isolation: AuthEither always resolves TenantIDKey, so every handler scopes queries to the caller’s tenant regardless of auth branch. Threat: IDOR / cross-tenant access — mitigated by tenant scoping at the handler layer; this is the single most important invariant to preserve when adding endpoints (see §6 checklist).

Transport. Caddy terminates TLS with an auto-provisioned Let’s Encrypt cert. HTTP→HTTPS redirect; only 80/443/9090 are open on the firewall (Postgres 5432 and registry 5000 are not host-published).

Input handling. Customer-supplied secret maps for batch jobs are validated by internal/secretrules against a manifest-declared spec (required keys, pattern/enum/int-range, url/url_list). URL rules can require an SSRF guard that rejects loopback, link-local, RFC1918, ULA, and “this network” targets (ssrf.go) — so a tenant cannot coerce a workload into fetching http://169.254.169.254/… cloud metadata.

Residual / accepted:

  • Rate limiting is not yet enforced at the edge. A leaked key or a hostile tenant can hammer the API. rules/common/security.md calls for per-endpoint rate limiting; this is an open item (see §7).
  • DoS via expensive parsing on unauthenticated endpoints. The public /register handler calls mail.ParseAddress; govulncheck flagged GO-2025-4006 (CPU-DoS in that function) as reachable. Mitigated by pinning the build toolchain to a patched Go (go.modtoolchain go1.25.11) and enforcing it in CI (govulncheck job). The live binary clears this on its next rebuild+redeploy (Dockerfile floats golang:1.25-alpine to the latest patch).

B2 — Agent → control plane (gRPC, mutual TLS)

Section titled “B2 — Agent → control plane (gRPC, mutual TLS)”

The agent dials out; the control plane never initiates a connection to a provider host, so providers need no inbound exposure or VPN. The channel is mutual TLS: the agent verifies the server against the openinfra-server SAN (so the public hostname need not be in the cert), and the server verifies the agent’s client cert against the project CA (internal/certs, cmd/gen-certs). Onboarding may additionally require an invite token.

Residual / accepted:

  • CA key custody. The CA private key signs all agent certs. It is gitignored and lives only in deploy/seppelabs/certs/. Leak = rogue host registration. No HSM; rotation is manual. Accepted at current scale; revisit before onboarding third-party providers.
  • No cert revocation list (CRL/OCSP). A compromised agent cert is valid until expiry. Mitigation today is operational (rotate the CA / re-issue). Open item for multi-provider scale.

B3/B4 — Workload execution on the provider host

Section titled “B3/B4 — Workload execution on the provider host”

This is the highest-trust boundary in a DePIN system: the provider runs tenant-supplied container images on their own hardware.

  • The agent mounts docker.sock to launch workloads. Tenant code runs in a container, not on the host directly, but container ≠ VM isolation.
  • OPENINFRA_NETWORK_MODE can place a workload in another container’s network namespace (e.g. container:coledex-tailscale) to reach a private DB. This is powerful and is set by the service manifest (operator-controlled), not by the tenant request.

Residual / accepted:

  • Container escape (kernel/runtime 0-day) would compromise the provider host. Today’s tenants are first-party (Coledex), so this is an accepted risk. Before running untrusted third-party images, this boundary needs hardening: rootless/userns runtimes, seccomp/AppArmor profiles, or gVisor/Kata (VM-isolation; internal/agent/vm exists as a seam). Do not onboard untrusted tenant images until then.
  • A workload joining another container’s netns can talk to whatever that container can reach. Manifests granting container: network mode are a privilege grant and should be reviewed like one.

Neither is published to the host — they exist only on the compose network, reachable solely by the api-server. Compromise of B5 requires first compromising the control-plane container or the box.

Residual: secrets and the ledger live in Postgres in plaintext at rest (no column encryption). Disk-level protection is the VPS provider’s; the off-box backups (B6) are a separate confidentiality surface.

The pg-backup sidecar streams a daily pg_dump to a private B2 bucket via an application key scoped to that one bucket. The micro keeps no local copy. Optional age encryption (BACKUP_AGE_RECIPIENT) encrypts dumps before upload; the private key is kept offline.

Residual / accepted:

  • Burn-in runs unencrypted for easier restore verification. The dump contains tenant secrets and the ledger. Enabling age encryption is an open item before this is considered hardened (see §7).
  • B2 key compromise exposes all historical dumps. Scope the key to the bucket (done) and rotate on suspicion.

B7 — On-chain settlement (Solana devnet)

Section titled “B7 — On-chain settlement (Solana devnet)”

Settlement currently runs on devnet with a treasury keypair. Loss of the keypair affects devnet settlement only; no mainnet funds are at risk today. Promotion to mainnet is a separate, deliberate hardening exercise (key custody, multisig) — out of scope here.


The ledger is double-entry: every workload posts balanced debits/credits (tenant→suspense, suspense→provider, platform fee). Invariants:

  • Settlement writes go through the ledger code path, never ad-hoc SQL.
  • A tenant cannot post entries directly — they submit workloads; metering
    • settlement are server-side.
  • Backups make the ledger recoverable: deploy/seppelabs/restore-drill.sh is the tested path that restores the newest B2 dump into a throwaway Postgres and asserts row counts. An untested backup is not a backup.

SecretWhereProtection
JWT_SECRETcontrol-plane env (.env, gitignored)32+ random bytes; signs admin sessions
POSTGRES_PASSWORD.envnot host-exposed; compose net only
Tenant API keysDB (SHA-256 hash) + tenant’s .secrets/hashed at rest; shown once
mTLS CA + certsdeploy/seppelabs/certs/ (gitignored)filesystem perms; manual rotation
Tenant workload secretsdispatched per workloadvalidated (secretrules); injected as env
B2 application key.envscoped to the backup bucket
Solana treasury keycontrol-plane volumedevnet only today

Rules enforced in CI / review:

  • No hardcoded secrets in source (see rules/common/security.md).
  • .env and certs/ are gitignored; .env.example ships placeholders.
  • Build toolchain pinned to a patched Go; govulncheck blocks reachable stdlib/dependency CVEs in CI and at release.

6. Checklist when adding an endpoint or handler

Section titled “6. Checklist when adding an endpoint or handler”

Preserve these invariants — they are the load-bearing parts of the model:

  • Scope every query to GetTenantID(ctx); never trust a tenant id from the request body/path (prevents cross-tenant IDOR).
  • Put machine-facing routes behind RequireScope with the least scope that works; don’t reach for admin-only/JWT unless it’s a human portal action.
  • Validate all external input at the boundary; for any outbound fetch derived from tenant input, apply the SSRF guard.
  • Never log secrets, API-key plaintext, or full tokens.
  • Return generic auth errors (no “user not found” vs “bad password” oracles).
  • Add a test that an unauthenticated / wrong-tenant / wrong-scope caller is rejected.

Honest backlog — none are blocking first-party (Coledex) operation, but each is required before the corresponding expansion:

  1. Edge rate limiting on the public API (per-key + per-IP). Required before exposing self-serve signup widely.
  2. age-encrypt backups (BACKUP_AGE_RECIPIENT) with an offline/hardware-stored key. Required to consider B6 hardened.
  3. Workload isolation hardening (rootless/userns, seccomp/AppArmor, or gVisor/Kata). Required before running untrusted third-party tenant images at B4.
  4. Agent cert revocation (CRL/OCSP or short-lived certs + rotation). Required before onboarding third-party providers at B2.
  5. CA key custody (HSM/KMS, documented rotation). Same trigger as #4.
  6. Alertmanager + paging so the security/health alert rules actually notify (currently visible-only). See deploy/seppelabs/alerts.yml.
  7. Mainnet settlement hardening (key custody, multisig) before B7 leaves devnet.
  8. gosec static analysis in CI to complement govulncheck (vuln scanning ≠ static analysis).

Maintained alongside the code. When you change an auth path, a trust boundary, or a secret’s handling, update the relevant section here in the same PR.