RFC 0014: Pull-Based Job Agent API

Category	Status	Created	Author
Job Agents	Draft	2026-06-02	Aditya Choudhari

Summary

Provide a REST API that lets an external provider act as a job agent by pulling the jobs assigned to it, executing them, and reporting status back — rather than ctrlplane pushing work into the provider’s environment. The work is split into two parts:

V1 delivers the pull contract: an agent polls for queued jobs, claims one atomically (at most once), runs it, and reports status. A new queued job status marks a job as claimable. Polling is a side-effect-free list; a separate claim call transitions the job and returns its execution context.
V2 adds crash recovery: a lease, a heartbeat endpoint, and a reaper that returns abandoned jobs to the queue. V2 is purely additive — V1 is shippable and useful on its own.

Motivation

Ctrlplane’s existing job agents are push / dispatch-style. The workspace engine initiates execution inside the agent’s system: ArgoCD syncs an Application, GitHub Actions runs a workflow, Terraform Cloud applies a plan. In each case ctrlplane reaches outbound into the agent’s environment. This does not fit an external provider that:

cannot (or should not) be reached inbound by ctrlplane, and
wants to integrate generically over HTTP rather than through a bespoke, per-system integration.

There is currently no generic way for such a provider to pull the jobs assigned to its job agent and run them. This RFC adds that path while reusing the existing job model, status-reporting endpoint, and verification flow.

Proposal

Model: producer / consumer

A push agent’s dispatch step both produces the job and delivers it (fires the workflow). A pull agent splits these:

ctrlplane produces the job and marks it claimable.
the external agent consumes it by polling, claiming, and running it.

The job row in Postgres is the queue. The dispatch controller is the producer; the agent’s poll discovers work and its claim takes delivery.

Job status: `queued`

A new value queued is added to the job_status enum (packages/db/src/schema/job.ts). It means: ctrlplane has finished preparing the job, and it is available for an agent to claim.

ALTER TYPE job_status ADD VALUE 'queued' AFTER 'pending';

The lifecycle for a pull-agent job:

queued ───claim (poll)───► in_progress ───report───► successful / failure

queued is semantically distinct from the existing states:

pending — created, not yet processed by the dispatch controller.
queued — prepared, waiting for an agent to claim it.
in_progress — claimed by an agent and executing.

The new value must be mirrored everywhere the enum is represented: the @ctrlplane/validators job statuses, the dbToOapiStatus / oapiToDbStatus maps in apps/api/src/routes/v1/workspaces/jobs.ts, the OpenAPI JobStatus schema, and the workspace-engine oapi enum plus its sqlc mappings.

Agent type: `http-pull`

A new agent type http-pull is registered in the workspace engine’s job agent registry (apps/workspace-engine/pkg/jobagents/, registered in apps/workspace-engine/svc/controllers/jobdispatch/controller.go). It implements types.Dispatchable. Its Dispatch does not push to an external system; it transitions the job to queued:

package httppull

var _ types.Dispatchable = &HttpPull{}

func (a *HttpPull) Type() string { return "http-pull" }

func (a *HttpPull) Dispatch(ctx context.Context, job *oapi.Job) error {
    return a.setter.UpdateJob(ctx, job.Id, oapi.JobStatusQueued, "", nil)
}

This keeps the dispatch pipeline uniform. Eligibility and the dispatch controller are otherwise unchanged: a job is created pending, enqueued for dispatch, the controller creates verification specs as it does for every agent, and the Dispatch call marks the job queued instead of pushing.

Verifications

Verifications are created by the dispatch controller at dispatch time, exactly as they are for the ArgoCD and Terraform Cloud agents. No change is made to the verification flow. As with those agents, verification metrics begin measuring when created rather than when execution starts. For http-pull this means measurements can begin before an agent claims the job; this matches existing behavior and is accepted for V1. See Open Questions.

Poll endpoint (V1)

GET /v1/workspaces/{workspaceId}/job-agents/{jobAgentId}/jobs?status=queued

Returns all jobs for the agent in the requested status. This is a plain, side-effect-free poll: it lists what is claimable but claims nothing. The agent picks a job from the list and claims it with a separate call. Added to apps/api/src/routes/v1/workspaces/job-agents.ts. The list response is intentionally lightweight — job id, deployment, environment, resource, and created_at — and omits dispatch_context. Resolved variables (including secret-flagged ones) are not returned here, so a poll never broadcasts secrets for every queued job to every agent. Context is returned only on claim, and only to the agent that wins it.

Claim endpoint (V1)

POST /v1/workspaces/{workspaceId}/job-agents/{jobAgentId}/jobs/{jobId}/claim

Atomically transitions a specific job queued → in_progress and returns the full job, including dispatch_context. Because the poll has no side effects, this is the single mutating step that hands a job to an agent. The claim is a conditional update guarded on the current status. Postgres row locking — not the transaction boundary — provides the at-most-once guarantee:

UPDATE job
SET status = 'in_progress', started_at = now()
WHERE id = $1 AND status = 'queued' AND job_agent_id = $2
RETURNING *;

If two agents claim the same job id concurrently, the row lock serializes them and only the first still sees status = 'queued'; the second matches zero rows and receives 409 Conflict. No SELECT ... FOR UPDATE SKIP LOCKED scan is needed because the agent names the job id explicitly — the status = 'queued' predicate does the work the locking scan did in the next-job design. The reconcile work queue uses the same conditional-claim shape (ClaimReconcileWorkItems).

Job payload

The claim response returns the job as-is; the poll response omits it. The job’s dispatch_context column is a self-contained execution snapshot already populated at job creation — deployment, environment, resource, release, version, resolved inputs, and variables. No joins or additional assembly are required; the existing toJobResponse shape already emits jobAgentConfig and dispatchContext. Note: dispatch_context includes resolved variable values, so secret-flagged variables are returned to the external agent. Returning context only on claim — not on poll — limits this exposure to the one job the agent actually runs, rather than every queued job a poll would list. This data otherwise never leaves ctrlplane for push agents. The endpoint must be served over TLS; per-agent authentication is addressed under V2.

Status reporting

Status reporting reuses the existing endpoint:

PUT /v1/workspaces/{workspaceId}/jobs/{jobId}/status

It already records the status, sets completed_at on terminal states, and enqueues a desired-release evaluation to advance the release. No new endpoint is required for V1.

Authentication (V1)

V1 uses the existing x-api-key authentication and verifies that the target job agent belongs to the authenticated workspace. Per-agent credentials are addressed under V2.

Concurrency

The issue identifies two failure modes. V1 addresses the first; V2 addresses the second.

Double-pickup — handled by the conditional claim above. A job is handed out at most once, even under overlapping claims or client retries; losers get 409 Conflict.
Crash mid-job — not handled in V1. If an agent claims a job and dies, the job remains in_progress. Recovery is a manual transition back to queued (the same transition V2 automates). V2 adds automatic recovery.

V1 implementation surface

Area	Change
`job_status` enum	add `queued` (schema + migration, validators, API status maps, OpenAPI, oapi/sqlc)
Agent type	new `http-pull` package; `Dispatch` sets `queued`; register in `jobdispatch`
Poll endpoint	`GET .../job-agents/{id}/jobs?status=queued`; lists queued jobs, no context; OpenAPI
Claim endpoint	`POST .../job-agents/{id}/jobs/{jobId}/claim`; conditional `queued→in_progress`, returns context; OpenAPI
Status reporting	reuse `PUT .../jobs/{jobId}/status`
Eligibility	unchanged
Dispatch flow	unchanged except the `http-pull` `Dispatch` body
Auth	reuse `x-api-key` + workspace ownership check

V2: Lease, Heartbeat, and Reclaim (add-on)

V2 adds crash recovery. It is additive in the strongest sense: a new table, two endpoints, and a periodic sweep. The job table is not modified at all — the queued enum value was already added in V1.

Claim table

Lease state lives in a dedicated job_claim table rather than as columns on job:

CREATE TABLE job_claim (
  job_id           uuid PRIMARY KEY REFERENCES job(id) ON DELETE CASCADE,
  job_agent_id     uuid NOT NULL,
  claimed_at       timestamptz NOT NULL DEFAULT now(),
  claim_expires_at timestamptz NOT NULL,
  claim_id         uuid NOT NULL DEFAULT gen_random_uuid()
);

The job’s status remains the state machine — the claim still flips queued → in_progress on job — but the lease lifecycle and the high-frequency heartbeat writes are isolated to this narrow table. The motivation is write locality: heartbeats are the most frequent write in this feature (every in-flight job, every interval), and job is a hot, heavily-joined table with several indexes and an updated_at trigger. Keeping heartbeats off job avoids index churn and MVCC bloat on the read path. claim_id is a fencing token, populated for free.

Lease

The claim records lease state in job_claim in the same statement that flips the job to in_progress, using a CTE so it remains a single atomic operation:

WITH claimed AS (
  UPDATE job SET status = 'in_progress', started_at = now()
  WHERE id = $1 AND status = 'queued' AND job_agent_id = $2
  RETURNING id
)
INSERT INTO job_claim (job_id, job_agent_id, claim_expires_at)
SELECT id, $2, now() + make_interval(secs => $lease_seconds) FROM claimed
RETURNING *;

The lease is a liveness window, not an execution deadline. A job may run far longer than the lease as long as the agent keeps the claim alive. The claim response advertises lease_seconds so the agent can choose a heartbeat interval.

Heartbeat

POST /v1/workspaces/{workspaceId}/jobs/{jobId}/heartbeat

Extends the lease. This touches only job_claim, never job:

UPDATE job_claim
SET claim_expires_at = now() + make_interval(secs => $lease_seconds)
WHERE job_id = $1;

The agent calls this periodically while executing. The interval is the agent’s choice (a fraction of the advertised lease); the server does not store it.

Reaper

A periodic sweep returns abandoned claims to the queue — deleting the expired claim and flipping the job back to queued in one statement:

WITH expired AS (
  DELETE FROM job_claim WHERE claim_expires_at < now() RETURNING job_id
)
UPDATE job SET status = 'queued'
WHERE id IN (SELECT job_id FROM expired) AND status = 'in_progress';

Expiry is detected by this sweep, not by an event at the exact expiry time. The sweep mirrors the reconcile queue’s CleanupExpiredClaims. Reclaim is opt-in by construction: only jobs that have a job_claim row are ever swept. A job claimed without recording lease state — or any V1-era agent that never engages the lease protocol — has no claim row and is never reclaimed, preserving V1 behavior after V2 ships. When a job reaches a terminal status, its job_claim row is removed.

Reclaim and double-run

When a lease expires, the job returns to queued and becomes claimable again. The reaper cannot distinguish a crashed agent from one that is alive but quiet for longer than the lease, so a long pause can cause a job to be reclaimed and run twice. A generous lease relative to the heartbeat interval reduces this window but does not close it. If exactly-once execution is required, the claim_id fencing token is returned on claim, echoed by the agent on heartbeat and status, and a write carrying a stale claim_id (one whose claim row was already reclaimed and superseded) is rejected. The token exists in the schema from the start; enforcing it is optional.

V2 implementation surface

Area	Change
`job` schema	none — `job` is not modified
`job_claim`	new table (single `CREATE TABLE`, no change to `job`)
Claim	record `job_claim` row in the claim CTE; return `lease_seconds` + `claim_id`
Heartbeat	new `POST .../jobs/{jobId}/heartbeat`, writes only `job_claim`; OpenAPI path
Reaper	periodic sweep deleting expired claims and returning jobs to `queued`
Terminal status	remove the `job_claim` row when a job reaches a terminal state
Optional	per-agent lease config, `claim_id` fencing enforcement, per-agent tokens

Migration

V1 adds the queued value to the job_status enum. V2 adds a new job_claim table and does not modify job. Both are additive; existing jobs are unaffected.
The dispatch controller, eligibility logic, and promotion lifecycle are unchanged except for recognizing the queued status and the http-pull agent’s Dispatch body.
The status-reporting endpoint is reused unchanged. The poll and claim endpoints (V1) and heartbeat endpoint (V2) are new and do not alter existing endpoints.
The V2 reaper only acts on jobs that have a job_claim row, so introducing it does not change the behavior of any agent that does not heartbeat.

Open Questions

Long-poll vs. plain poll. V1 uses a plain poll: the list endpoint returns immediately with the current set of queued jobs (possibly empty). A long-poll variant (hold the request open until a job appears or a timeout elapses, bounded by a server-enforced maximum) reduces idle polling and is a candidate for V2. Backpressure and fairness limits on held connections are open.
Verification timing. Verifications begin measuring when created (at dispatch), which for a pull agent can precede the claim by an unbounded queue wait. For long verification windows this is harmless; a short window could complete before the agent claims the job. If this becomes a problem, verification creation can be moved to the claim transition, or measurement can be gated on the job reaching in_progress. Deferred until needed.
Lease configuration. Should the lease duration be per-agent (job_agent.config, bounded) or a single global default? A global default is the V2 starting point; per-agent is a later refinement for agents with different reliability characteristics.

AI Generated Questions

Agent registration. A job is only routed to an agent that already exists and is matched by a deployment’s jobAgentSelector. Should an external agent be able to self-register its job_agent row and credentials via the API, or must agents be pre-provisioned by an operator?
Per-agent authentication. V1 reuses x-api-key. V2 should issue a per-agent credential at registration so an agent authenticates as itself and can claim only its own jobs. What is the token model and rotation story?
Fencing. Should V2 include a fencing token from the start, or add it only if double-run under lease expiry proves to be a real problem for the workloads pull agents run?

​Summary

​Motivation

​Proposal

​Model: producer / consumer

​Job status: queued

​Agent type: http-pull

​Verifications

​Poll endpoint (V1)

​Claim endpoint (V1)

​Job payload

​Status reporting

​Authentication (V1)

​Concurrency

​V1 implementation surface

​V2: Lease, Heartbeat, and Reclaim (add-on)

​Claim table

​Lease

​Heartbeat

​Reaper

​Reclaim and double-run

​V2 implementation surface

​Migration

​Open Questions

​AI Generated Questions