| Category | Status | Created | Author |
|---|---|---|---|
| Job Agents | Draft | 2026-06-02 | Aditya Choudhari |
Summary
Provide a REST API that lets an external provider act as a job agent by pulling the jobs assigned to it, executing them, and reporting status back — rather than ctrlplane pushing work into the provider’s environment. The work is split into two parts:- V1 delivers the pull contract: an agent polls for queued jobs, claims one
atomically (at most once), runs it, and reports status. A new
queuedjob status marks a job as claimable. Polling is a side-effect-free list; a separate claim call transitions the job and returns its execution context. - V2 adds crash recovery: a lease, a heartbeat endpoint, and a reaper that returns abandoned jobs to the queue. V2 is purely additive — V1 is shippable and useful on its own.
Motivation
Ctrlplane’s existing job agents are push / dispatch-style. The workspace engine initiates execution inside the agent’s system: ArgoCD syncs an Application, GitHub Actions runs a workflow, Terraform Cloud applies a plan. In each case ctrlplane reaches outbound into the agent’s environment. This does not fit an external provider that:- cannot (or should not) be reached inbound by ctrlplane, and
- wants to integrate generically over HTTP rather than through a bespoke, per-system integration.
Proposal
Model: producer / consumer
A push agent’s dispatch step both produces the job and delivers it (fires the workflow). A pull agent splits these:- ctrlplane produces the job and marks it claimable.
- the external agent consumes it by polling, claiming, and running it.
Job status: queued
A new value queued is added to the job_status enum
(packages/db/src/schema/job.ts). It means: ctrlplane has finished preparing
the job, and it is available for an agent to claim.
queued is semantically distinct from the existing states:
pending— created, not yet processed by the dispatch controller.queued— prepared, waiting for an agent to claim it.in_progress— claimed by an agent and executing.
@ctrlplane/validators job statuses, the dbToOapiStatus / oapiToDbStatus
maps in apps/api/src/routes/v1/workspaces/jobs.ts, the OpenAPI JobStatus
schema, and the workspace-engine oapi enum plus its sqlc mappings.
Agent type: http-pull
A new agent type http-pull is registered in the workspace engine’s job agent
registry (apps/workspace-engine/pkg/jobagents/, registered in
apps/workspace-engine/svc/controllers/jobdispatch/controller.go). It
implements types.Dispatchable. Its Dispatch does not push to an external
system; it transitions the job to queued:
pending, enqueued for
dispatch, the controller creates verification specs as it does for every agent,
and the Dispatch call marks the job queued instead of pushing.
Verifications
Verifications are created by the dispatch controller at dispatch time, exactly as they are for the ArgoCD and Terraform Cloud agents. No change is made to the verification flow. As with those agents, verification metrics begin measuring when created rather than when execution starts. Forhttp-pull this means
measurements can begin before an agent claims the job; this matches existing
behavior and is accepted for V1. See Open Questions.
Poll endpoint (V1)
apps/api/src/routes/v1/workspaces/job-agents.ts.
The list response is intentionally lightweight — job id, deployment,
environment, resource, and created_at — and omits dispatch_context.
Resolved variables (including secret-flagged ones) are not returned here, so a
poll never broadcasts secrets for every queued job to every agent. Context is
returned only on claim, and only to the agent that wins it.
Claim endpoint (V1)
queued → in_progress and returns the
full job, including dispatch_context. Because the poll has no side effects,
this is the single mutating step that hands a job to an agent.
The claim is a conditional update guarded on the current status. Postgres row
locking — not the transaction boundary — provides the at-most-once guarantee:
status = 'queued'; the second matches zero rows
and receives 409 Conflict. No SELECT ... FOR UPDATE SKIP LOCKED scan is
needed because the agent names the job id explicitly — the status = 'queued'
predicate does the work the locking scan did in the next-job design. The
reconcile work queue uses the same conditional-claim shape
(ClaimReconcileWorkItems).
Job payload
The claim response returns the job as-is; the poll response omits it. The job’sdispatch_context column is a self-contained execution snapshot
already populated at job creation — deployment, environment, resource, release,
version, resolved inputs, and variables. No joins or additional assembly are
required; the existing toJobResponse shape already emits jobAgentConfig and
dispatchContext.
Note: dispatch_context includes resolved variable values, so secret-flagged
variables are returned to the external agent. Returning context only on claim —
not on poll — limits this exposure to the one job the agent actually runs,
rather than every queued job a poll would list. This data otherwise never leaves
ctrlplane for push agents. The endpoint must be served over TLS; per-agent
authentication is addressed under V2.
Status reporting
Status reporting reuses the existing endpoint:completed_at on terminal states, and
enqueues a desired-release evaluation to advance the release. No new endpoint is
required for V1.
Authentication (V1)
V1 uses the existingx-api-key authentication and verifies that the target job
agent belongs to the authenticated workspace. Per-agent credentials are
addressed under V2.
Concurrency
The issue identifies two failure modes. V1 addresses the first; V2 addresses the second.- Double-pickup — handled by the conditional claim above. A job is handed
out at most once, even under overlapping claims or client retries; losers get
409 Conflict. - Crash mid-job — not handled in V1. If an agent claims a job and dies, the
job remains
in_progress. Recovery is a manual transition back toqueued(the same transition V2 automates). V2 adds automatic recovery.
V1 implementation surface
| Area | Change |
|---|---|
job_status enum | add queued (schema + migration, validators, API status maps, OpenAPI, oapi/sqlc) |
| Agent type | new http-pull package; Dispatch sets queued; register in jobdispatch |
| Poll endpoint | GET .../job-agents/{id}/jobs?status=queued; lists queued jobs, no context; OpenAPI |
| Claim endpoint | POST .../job-agents/{id}/jobs/{jobId}/claim; conditional queued→in_progress, returns context; OpenAPI |
| Status reporting | reuse PUT .../jobs/{jobId}/status |
| Eligibility | unchanged |
| Dispatch flow | unchanged except the http-pull Dispatch body |
| Auth | reuse x-api-key + workspace ownership check |
V2: Lease, Heartbeat, and Reclaim (add-on)
V2 adds crash recovery. It is additive in the strongest sense: a new table, two endpoints, and a periodic sweep. Thejob table is not modified at all —
the queued enum value was already added in V1.
Claim table
Lease state lives in a dedicatedjob_claim table rather than as columns on
job:
status remains the state machine — the claim still flips
queued → in_progress on job — but the lease lifecycle and the high-frequency
heartbeat writes are isolated to this narrow table. The motivation is write
locality: heartbeats are the most frequent write in this feature (every in-flight
job, every interval), and job is a hot, heavily-joined table with several
indexes and an updated_at trigger. Keeping heartbeats off job avoids index
churn and MVCC bloat on the read path. claim_id is a fencing token, populated
for free.
Lease
The claim records lease state injob_claim in the same statement that flips
the job to in_progress, using a CTE so it remains a single atomic operation:
lease_seconds so the agent can choose a heartbeat interval.
Heartbeat
job_claim, never job:
Reaper
A periodic sweep returns abandoned claims to the queue — deleting the expired claim and flipping the job back toqueued in one statement:
CleanupExpiredClaims.
Reclaim is opt-in by construction: only jobs that have a job_claim row are
ever swept. A job claimed without recording lease state — or any V1-era agent
that never engages the lease protocol — has no claim row and is never reclaimed,
preserving V1 behavior after V2 ships. When a job reaches a terminal status, its
job_claim row is removed.
Reclaim and double-run
When a lease expires, the job returns toqueued and becomes claimable again.
The reaper cannot distinguish a crashed agent from one that is alive but quiet
for longer than the lease, so a long pause can cause a job to be reclaimed and
run twice. A generous lease relative to the heartbeat interval reduces this
window but does not close it.
If exactly-once execution is required, the claim_id fencing token is returned
on claim, echoed by the agent on heartbeat and status, and a write carrying a
stale claim_id (one whose claim row was already reclaimed and superseded) is
rejected. The token exists in the schema from the start; enforcing it is
optional.
V2 implementation surface
| Area | Change |
|---|---|
job schema | none — job is not modified |
job_claim | new table (single CREATE TABLE, no change to job) |
| Claim | record job_claim row in the claim CTE; return lease_seconds + claim_id |
| Heartbeat | new POST .../jobs/{jobId}/heartbeat, writes only job_claim; OpenAPI path |
| Reaper | periodic sweep deleting expired claims and returning jobs to queued |
| Terminal status | remove the job_claim row when a job reaches a terminal state |
| Optional | per-agent lease config, claim_id fencing enforcement, per-agent tokens |
Migration
- V1 adds the
queuedvalue to thejob_statusenum. V2 adds a newjob_claimtable and does not modifyjob. Both are additive; existing jobs are unaffected. - The dispatch controller, eligibility logic, and promotion lifecycle are
unchanged except for recognizing the
queuedstatus and thehttp-pullagent’sDispatchbody. - The status-reporting endpoint is reused unchanged. The poll and claim endpoints (V1) and heartbeat endpoint (V2) are new and do not alter existing endpoints.
- The V2 reaper only acts on jobs that have a
job_claimrow, so introducing it does not change the behavior of any agent that does not heartbeat.
Open Questions
- Long-poll vs. plain poll. V1 uses a plain poll: the list endpoint returns immediately with the current set of queued jobs (possibly empty). A long-poll variant (hold the request open until a job appears or a timeout elapses, bounded by a server-enforced maximum) reduces idle polling and is a candidate for V2. Backpressure and fairness limits on held connections are open.
-
Verification timing. Verifications begin measuring when created (at
dispatch), which for a pull agent can precede the claim by an unbounded
queue wait. For long verification windows this is harmless; a short window
could complete before the agent claims the job. If this becomes a problem,
verification creation can be moved to the claim transition, or measurement
can be gated on the job reaching
in_progress. Deferred until needed. -
Lease configuration. Should the lease duration be per-agent
(
job_agent.config, bounded) or a single global default? A global default is the V2 starting point; per-agent is a later refinement for agents with different reliability characteristics.
AI Generated Questions
-
Agent registration. A job is only routed to an agent that already exists
and is matched by a deployment’s
jobAgentSelector. Should an external agent be able to self-register itsjob_agentrow and credentials via the API, or must agents be pre-provisioned by an operator? -
Per-agent authentication. V1 reuses
x-api-key. V2 should issue a per-agent credential at registration so an agent authenticates as itself and can claim only its own jobs. What is the token model and rotation story? - Fencing. Should V2 include a fencing token from the start, or add it only if double-run under lease expiry proves to be a real problem for the workloads pull agents run?