Security model

This document describes the trust boundaries of the OpenInfra control plane + agent system, the assets it protects, the threats against each boundary, and the mitigations in place today. It is deliberately honest about residual and accepted risks — a threat model that only lists wins is not useful for the next person hardening the system.

Scope: the live worldwide deployment shape — one control plane on a public VPS (cp.seppelabs.com), provider agents that dial out from untrusted networks, and tenants that submit container workloads over a public REST API. See deploy/seppelabs/README.md for the deployment runbook and ARCHITECTURE.md for the full system design.

1. System overview & trust boundaries

            ┌─────────────────────── PUBLIC INTERNET ───────────────────────┐
            │                                                                │
  Tenant ───┼── HTTPS (Caddy TLS) ──▶  :443  ┐                               │
 (API key)  │                                │                               │
            │                          ┌──────▼─────── control plane (VPS) ──┐│
  Agent ────┼── gRPC mTLS ──────────▶  :9090 │  api-server (chi REST :8080)  ││
 (client    │   (outbound dial)        │     │  gRPC server      :9090       ││
  cert)     │                          │     │  postgres   (compose net only)││
            │                          │     │  registry   (compose net only)││
            │                          │     │  pg-backup ──▶ Backblaze B2    ││
            └──────────────────────────┘     └───────────────────────────────┘│
                                                                               │
  Provider host (e.g. Kuma, home box) runs the agent + tenant containers ──────┘
       │  docker.sock mounted │ tenant images pulled & executed here

Trust boundaries (where data crosses a privilege level):

#	Boundary	Crossing	Authentication
B1	Tenant → control plane	public HTTPS REST	API key (`oi_…`) or admin JWT
B2	Agent → control plane	public gRPC	mutual TLS (client cert)
B3	Control plane → provider host	workload dispatch over B2’s stream	(rides the mTLS channel)
B4	Tenant workload → provider host kernel	container runtime	OS/container isolation
B5	Control plane → Postgres / registry	compose-internal	not published to host
B6	Control plane → Backblaze B2	outbound HTTPS	scoped application key
B7	Control plane → Solana devnet	outbound RPC	treasury keypair

2. Assets

What an attacker would want, ranked by blast radius:

The settlement ledger (transactions, double-entry). Tampering = theft of credits / unbilled compute. Integrity is paramount.
Tenant secrets injected into workloads (DB URLs, cloud creds in env_template / the per-dispatch secrets map). Confidentiality.
API keys & JWT secret. A leaked tenant key spends that tenant’s credits and runs workloads as them; a leaked JWT_SECRET forges admin sessions → full control-plane compromise.
The mTLS CA private key. Signs agent client certs; leak = an attacker can register a rogue provider host.
The Postgres database (everything above at rest).
The Solana treasury keypair (on-chain settlement authority).
Provider host integrity — tenants run code on someone else’s box.

3. Boundary-by-boundary threats & mitigations

B1 — Tenant → control plane (public REST)

Authentication. Bearer token. A token that LooksLikeKey (oi_ prefix) is routed to the API-key branch; any failure there (revoked, expired, suspended tenant) returns 401 and does not fall through to the JWT path (internal/api/middleware/apikey.go, AuthEither). This closes the “present a stolen API key as a JWT” confusion class.

API keys are 16 bytes from crypto/rand, formatted oi_{env}_{base32}, stored only as a SHA-256 hash; lookup is by hash, comparison is constant-time (crypto/subtle). Plaintext is shown once at mint time and never persisted (internal/auth/apikey).
Keys carry scopes; RequireScope (403) narrows machine principals. Admin JWT sessions carry implicit-all — the intended delegation model (humans use the portal with full-power sessions, integrations use least-privilege scoped keys).
Per-tenant isolation: AuthEither always resolves TenantIDKey, so every handler scopes queries to the caller’s tenant regardless of auth branch. Threat: IDOR / cross-tenant access — mitigated by tenant scoping at the handler layer; this is the single most important invariant to preserve when adding endpoints (see §6 checklist).

Transport. Caddy terminates TLS with an auto-provisioned Let’s Encrypt cert. HTTP→HTTPS redirect; only 80/443/9090 are open on the firewall (Postgres 5432 and registry 5000 are not host-published).

Input handling. Customer-supplied secret maps for batch jobs are validated by internal/secretrules against a manifest-declared spec (required keys, pattern/enum/int-range, url/url_list). URL rules can require an SSRF guard that rejects loopback, link-local, RFC1918, ULA, and “this network” targets (ssrf.go) — so a tenant cannot coerce a workload into fetching http://169.254.169.254/… cloud metadata.

Residual / accepted:

Rate limiting is not yet enforced at the edge. A leaked key or a hostile tenant can hammer the API. rules/common/security.md calls for per-endpoint rate limiting; this is an open item (see §7).
DoS via expensive parsing on unauthenticated endpoints. The public /register handler calls mail.ParseAddress; govulncheck flagged GO-2025-4006 (CPU-DoS in that function) as reachable. Mitigated by pinning the build toolchain to a patched Go (go.mod → toolchain go1.25.11) and enforcing it in CI (govulncheck job). The live binary clears this on its next rebuild+redeploy (Dockerfile floats golang:1.25-alpine to the latest patch).

B2 — Agent → control plane (gRPC, mutual TLS)

The agent dials out; the control plane never initiates a connection to a provider host, so providers need no inbound exposure or VPN. The channel is mutual TLS: the agent verifies the server against the openinfra-server SAN (so the public hostname need not be in the cert), and the server verifies the agent’s client cert against the project CA (internal/certs, cmd/gen-certs). Onboarding may additionally require an invite token.

Residual / accepted:

CA key custody. The CA private key signs all agent certs. It is gitignored and lives only in deploy/seppelabs/certs/. Leak = rogue host registration. No HSM; rotation is manual. Accepted at current scale; revisit before onboarding third-party providers.
No cert revocation list (CRL/OCSP). A compromised agent cert is valid until expiry. Mitigation today is operational (rotate the CA / re-issue). Open item for multi-provider scale.

B3/B4 — Workload execution on the provider host

This is the highest-trust boundary in a DePIN system: the provider runs tenant-supplied container images on their own hardware.

The agent mounts docker.sock to launch workloads. Tenant code runs in a container, not on the host directly, but container ≠ VM isolation.
OPENINFRA_NETWORK_MODE can place a workload in another container’s network namespace (e.g. container:coledex-tailscale) to reach a private DB. This is powerful and is set by the service manifest (operator-controlled), not by the tenant request.

Residual / accepted:

Container escape (kernel/runtime 0-day) would compromise the provider host. Today’s tenants are first-party (Coledex), so this is an accepted risk. Before running untrusted third-party images, this boundary needs hardening: rootless/userns runtimes, seccomp/AppArmor profiles, or gVisor/Kata (VM-isolation; internal/agent/vm exists as a seam). Do not onboard untrusted tenant images until then.
A workload joining another container’s netns can talk to whatever that container can reach. Manifests granting container: network mode are a privilege grant and should be reviewed like one.

B5 — Datastores (Postgres, registry)

Neither is published to the host — they exist only on the compose network, reachable solely by the api-server. Compromise of B5 requires first compromising the control-plane container or the box.

Residual: secrets and the ledger live in Postgres in plaintext at rest (no column encryption). Disk-level protection is the VPS provider’s; the off-box backups (B6) are a separate confidentiality surface.

B6 — Off-box backups (Backblaze B2)

The pg-backup sidecar streams a daily pg_dump to a private B2 bucket via an application key scoped to that one bucket. The micro keeps no local copy. Optional age encryption (BACKUP_AGE_RECIPIENT) encrypts dumps before upload; the private key is kept offline.

Residual / accepted:

Burn-in runs unencrypted for easier restore verification. The dump contains tenant secrets and the ledger. Enabling age encryption is an open item before this is considered hardened (see §7).
B2 key compromise exposes all historical dumps. Scope the key to the bucket (done) and rotate on suspicion.

B7 — On-chain settlement (Solana devnet)

Settlement currently runs on devnet with a treasury keypair. Loss of the keypair affects devnet settlement only; no mainnet funds are at risk today. Promotion to mainnet is a separate, deliberate hardening exercise (key custody, multisig) — out of scope here.

4. Integrity of the credits ledger

The ledger is double-entry: every workload posts balanced debits/credits (tenant→suspense, suspense→provider, platform fee). Invariants:

Settlement writes go through the ledger code path, never ad-hoc SQL.
A tenant cannot post entries directly — they submit workloads; metering
- settlement are server-side.
Backups make the ledger recoverable: deploy/seppelabs/restore-drill.sh is the tested path that restores the newest B2 dump into a throwaway Postgres and asserts row counts. An untested backup is not a backup.

5. Secrets handling

Secret	Where	Protection
`JWT_SECRET`	control-plane env (`.env`, gitignored)	32+ random bytes; signs admin sessions
`POSTGRES_PASSWORD`	`.env`	not host-exposed; compose net only
Tenant API keys	DB (SHA-256 hash) + tenant’s `.secrets/`	hashed at rest; shown once
mTLS CA + certs	`deploy/seppelabs/certs/` (gitignored)	filesystem perms; manual rotation
Tenant workload secrets	dispatched per workload	validated (`secretrules`); injected as env
B2 application key	`.env`	scoped to the backup bucket
Solana treasury key	control-plane volume	devnet only today

Rules enforced in CI / review:

No hardcoded secrets in source (see rules/common/security.md).
.env and certs/ are gitignored; .env.example ships placeholders.
Build toolchain pinned to a patched Go; govulncheck blocks reachable stdlib/dependency CVEs in CI and at release.

6. Checklist when adding an endpoint or handler

Preserve these invariants — they are the load-bearing parts of the model:

Scope every query to GetTenantID(ctx); never trust a tenant id from the request body/path (prevents cross-tenant IDOR).
Put machine-facing routes behind RequireScope with the least scope that works; don’t reach for admin-only/JWT unless it’s a human portal action.
Validate all external input at the boundary; for any outbound fetch derived from tenant input, apply the SSRF guard.
Never log secrets, API-key plaintext, or full tokens.
Return generic auth errors (no “user not found” vs “bad password” oracles).
Add a test that an unauthenticated / wrong-tenant / wrong-scope caller is rejected.

7. Open security items (tracked)

Honest backlog — none are blocking first-party (Coledex) operation, but each is required before the corresponding expansion:

Edge rate limiting on the public API (per-key + per-IP). Required before exposing self-serve signup widely.
age-encrypt backups (BACKUP_AGE_RECIPIENT) with an offline/hardware-stored key. Required to consider B6 hardened.
Workload isolation hardening (rootless/userns, seccomp/AppArmor, or gVisor/Kata). Required before running untrusted third-party tenant images at B4.
Agent cert revocation (CRL/OCSP or short-lived certs + rotation). Required before onboarding third-party providers at B2.
CA key custody (HSM/KMS, documented rotation). Same trigger as #4.
Alertmanager + paging so the security/health alert rules actually notify (currently visible-only). See deploy/seppelabs/alerts.yml.
Mainnet settlement hardening (key custody, multisig) before B7 leaves devnet.
gosec static analysis in CI to complement govulncheck (vuln scanning ≠ static analysis).

Maintained alongside the code. When you change an auth path, a trust boundary, or a secret’s handling, update the relevant section here in the same PR.