docs: add DAG and worker system specifications

DAG.md describes the two-tier dependency graph (BOM DAG + feature DAG),
node/edge data model, validation states, dirty propagation, forward/backward
cone queries, DAG sync payload format, and REST API.

WORKERS.md describes the general-purpose async compute job system: YAML job
definitions, job lifecycle (pending→claimed→running→completed/failed),
runner registration and authentication, claim semantics (SELECT FOR UPDATE
SKIP LOCKED), timeout enforcement, SSE events, and REST API.
This commit is contained in:
Forbes
2026-02-14 13:03:48 -06:00
parent 376fa3db31
commit 9a8b3150ff
2 changed files with 610 additions and 0 deletions

364
docs/WORKERS.md Normal file
View File

@@ -0,0 +1,364 @@
# Worker System Specification
**Status:** Draft
**Last Updated:** 2026-02-13
---
## 1. Purpose
The worker system provides async compute job execution for Silo. Jobs are defined as YAML files, managed by the Silo server, and executed by external runner processes. The system is general-purpose -- while DAG validation is the first use case, it supports any compute workload: geometry export, thumbnail rendering, FEA/CFD batch jobs, report generation, and data migration.
---
## 2. Architecture
```
YAML Job Definitions (files on disk, version-controllable)
|
v
Silo Server (parser, scheduler, state machine, REST API, SSE events)
|
v
Runners (silorunner binary, polls via REST, executes Headless Create)
```
**Three layers:**
1. **Job definitions** -- YAML files in a configurable directory (default `/etc/silo/jobdefs`). Each file defines a job type: what triggers it, what it operates on, what computation to perform, and what runner capabilities are required. These are the source of truth and can be version-controlled alongside other Silo config.
2. **Silo server** -- Parses YAML definitions on startup and upserts them into the `job_definitions` table. Creates job instances when triggers fire (revision created, BOM changed, manual). Manages job lifecycle, enforces timeouts, and broadcasts status via SSE.
3. **Runners** -- Separate `silorunner` processes that authenticate with Silo via API tokens, poll for available jobs, claim them atomically, execute the compute, and report results. A runner host must have Headless Create and silo-mod installed for geometry jobs.
---
## 3. Job Definition Format
Job definitions are YAML files with the following structure:
```yaml
job:
name: assembly-validate
version: 1
description: "Validate assembly by rebuilding its dependency subgraph"
trigger:
type: revision_created # revision_created, bom_changed, manual, schedule
filter:
item_type: assembly # only trigger for assemblies
scope:
type: assembly # item, assembly, project
compute:
type: validate # validate, rebuild, diff, export, custom
command: create-validate # runner-side command identifier
args: # passed to runner as JSON
rebuild_mode: incremental
check_interference: true
runner:
tags: [create] # required runner capabilities
timeout: 900 # seconds before job is marked failed (default 600)
max_retries: 2 # retry count on failure (default 1)
priority: 50 # lower = higher priority (default 100)
```
### 3.1 Trigger Types
| Type | Description |
|------|-------------|
| `revision_created` | Fires when a new revision is created on an item matching the filter |
| `bom_changed` | Fires when a BOM merge completes |
| `manual` | Only triggered via `POST /api/jobs` |
| `schedule` | Future: cron-like scheduling (not yet implemented) |
### 3.2 Trigger Filters
The `filter` map supports key-value matching against item properties:
| Key | Description |
|-----|-------------|
| `item_type` | Match item type: `part`, `assembly`, `drawing`, etc. |
| `schema` | Match schema name |
All filter keys must match for the trigger to fire. An empty filter matches all items.
### 3.3 Scope Types
| Type | Description |
|------|-------------|
| `item` | Job operates on a single item |
| `assembly` | Job operates on an assembly and its BOM tree |
| `project` | Job operates on all items in a project |
### 3.4 Compute Commands
The `command` field identifies what the runner should execute. Built-in commands:
| Command | Description |
|---------|-------------|
| `create-validate` | Open file in Headless Create, rebuild features, report validation results |
| `create-export` | Open file, export to specified format (STEP, IGES, 3MF) |
| `create-dag-extract` | Open file, extract feature DAG, output as JSON |
| `create-thumbnail` | Open file, render thumbnail image |
Custom commands can be added by extending silo-mod's `silo.runner` module.
---
## 4. Job Lifecycle
```
pending → claimed → running → completed
→ failed
→ cancelled
```
| State | Description |
|-------|-------------|
| `pending` | Job created, waiting for a runner to claim it |
| `claimed` | Runner has claimed the job. `expires_at` is set. |
| `running` | Runner has started execution (reported via progress update) |
| `completed` | Runner reported success. `result` JSONB contains output. |
| `failed` | Runner reported failure, timeout expired, or max retries exceeded |
| `cancelled` | Admin cancelled the job before completion |
### 4.1 Claim Semantics
Runners claim jobs via `POST /api/runner/claim`. The server uses PostgreSQL's `SELECT FOR UPDATE SKIP LOCKED` to ensure exactly-once delivery:
```sql
WITH claimable AS (
SELECT id FROM jobs
WHERE status = 'pending'
AND runner_tags <@ $2::text[]
ORDER BY priority ASC, created_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
)
UPDATE jobs SET
status = 'claimed',
runner_id = $1,
claimed_at = now(),
expires_at = now() + (timeout_seconds || ' seconds')::interval
FROM claimable
WHERE jobs.id = claimable.id
RETURNING jobs.*;
```
The `runner_tags <@ $2::text[]` condition ensures the runner has all tags required by the job. A runner with tags `["create", "linux", "gpu"]` can claim a job requiring `["create"]`, but not one requiring `["create", "windows"]`.
### 4.2 Timeout Enforcement
A background sweeper runs every 30 seconds (configurable via `jobs.job_timeout_check`) and marks expired jobs as failed:
```sql
UPDATE jobs SET status = 'failed', error_message = 'job timed out'
WHERE status IN ('claimed', 'running')
AND expires_at < now();
```
### 4.3 Retry
When a job fails and `retry_count < max_retries`, a new job is created with the same definition and scope, with `retry_count` incremented.
---
## 5. Runners
### 5.1 Registration
Runners are registered via `POST /api/runners` (admin only). The server generates a token (shown once) and stores the SHA-256 hash in the `runners` table. This follows the same pattern as API tokens in `internal/auth/token.go`.
### 5.2 Authentication
Runners authenticate via `Authorization: Bearer silo_runner_<token>`. A dedicated `RequireRunnerAuth` middleware validates the token against the `runners` table and injects a `RunnerIdentity` into the request context.
### 5.3 Heartbeat
Runners send `POST /api/runner/heartbeat` every 30 seconds. The server updates `last_heartbeat` and sets `status = 'online'`. A background sweeper marks runners as `offline` if their heartbeat is older than `runner_timeout` seconds (default 90).
### 5.4 Tags
Each runner declares capability tags (e.g., `["create", "linux", "gpu"]`). Jobs require specific tags via the `runner.tags` field in their YAML definition. A runner can only claim jobs whose required tags are a subset of the runner's tags.
### 5.5 Runner Config
The `silorunner` binary reads its config from a YAML file:
```yaml
server_url: "https://silo.example.com"
token: "silo_runner_abc123..."
name: "worker-01"
tags: ["create", "linux"]
poll_interval: 5 # seconds between claim attempts
create_path: "/usr/bin/create" # path to Headless Create binary (with silo-mod installed)
```
Or via environment variables: `SILO_RUNNER_SERVER_URL`, `SILO_RUNNER_TOKEN`, etc.
### 5.6 Deployment
Runner prerequisites:
- `silorunner` binary (built from `cmd/silorunner/`)
- Headless Create (Kindred's fork of FreeCAD) with silo-mod workbench installed
- Network access to Silo server API
Runners can be deployed as:
- Bare metal processes alongside Create installations
- Docker containers with Create pre-installed
- Scaled horizontally by registering multiple runners with different names
---
## 6. Job Log
Each job has an append-only log stored in the `job_log` table. Runners append entries via `POST /api/runner/jobs/{jobID}/log`:
```json
{
"level": "info",
"message": "Rebuilding Pad003...",
"metadata": {"node_key": "Pad003", "progress_pct": 45}
}
```
Log levels: `debug`, `info`, `warn`, `error`.
---
## 7. SSE Events
All job lifecycle transitions are broadcast via Silo's SSE broker. Clients subscribe to `/api/events` and receive:
| Event Type | Payload | When |
|------------|---------|------|
| `job.created` | `{id, definition_name, item_id, status, priority}` | Job created |
| `job.claimed` | `{id, runner_id, runner_name}` | Runner claims job |
| `job.progress` | `{id, progress, progress_message}` | Runner reports progress (0-100) |
| `job.completed` | `{id, result_summary, duration_seconds}` | Job completed successfully |
| `job.failed` | `{id, error_message}` | Job failed |
| `job.cancelled` | `{id, cancelled_by}` | Admin cancelled job |
| `runner.online` | `{id, name, tags}` | Runner heartbeat (first after offline) |
| `runner.offline` | `{id, name}` | Runner heartbeat timeout |
---
## 8. REST API
### 8.1 Job Endpoints (user-facing, require auth)
| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `GET` | `/api/jobs` | viewer | List jobs (filterable by status, item, definition) |
| `GET` | `/api/jobs/{jobID}` | viewer | Get job details |
| `GET` | `/api/jobs/{jobID}/logs` | viewer | Get job log entries |
| `POST` | `/api/jobs` | editor | Manually trigger a job |
| `POST` | `/api/jobs/{jobID}/cancel` | editor | Cancel a pending/running job |
### 8.2 Job Definition Endpoints
| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `GET` | `/api/job-definitions` | viewer | List loaded definitions |
| `GET` | `/api/job-definitions/{name}` | viewer | Get specific definition |
| `POST` | `/api/job-definitions/reload` | admin | Re-read YAML from disk |
### 8.3 Runner Management Endpoints (admin)
| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `GET` | `/api/runners` | admin | List registered runners |
| `POST` | `/api/runners` | admin | Register runner (returns token) |
| `DELETE` | `/api/runners/{runnerID}` | admin | Delete runner |
### 8.4 Runner-Facing Endpoints (runner token auth)
| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `POST` | `/api/runner/heartbeat` | runner | Send heartbeat |
| `POST` | `/api/runner/claim` | runner | Claim next available job |
| `PUT` | `/api/runner/jobs/{jobID}/progress` | runner | Report progress |
| `POST` | `/api/runner/jobs/{jobID}/complete` | runner | Report completion with result |
| `POST` | `/api/runner/jobs/{jobID}/fail` | runner | Report failure |
| `POST` | `/api/runner/jobs/{jobID}/log` | runner | Append log entry |
| `PUT` | `/api/runner/jobs/{jobID}/dag` | runner | Sync DAG results after compute |
---
## 9. Configuration
Add to `config.yaml`:
```yaml
jobs:
directory: /etc/silo/jobdefs # path to YAML job definitions
runner_timeout: 90 # seconds before marking runner offline
job_timeout_check: 30 # seconds between timeout sweeps
default_priority: 100 # default job priority
```
---
## 10. Example Job Definitions
### Assembly Validation
```yaml
job:
name: assembly-validate
version: 1
description: "Validate assembly by rebuilding its dependency subgraph"
trigger:
type: revision_created
filter:
item_type: assembly
scope:
type: assembly
compute:
type: validate
command: create-validate
args:
rebuild_mode: incremental
check_interference: true
runner:
tags: [create]
timeout: 900
max_retries: 2
priority: 50
```
### STEP Export
```yaml
job:
name: part-export-step
version: 1
description: "Export a part to STEP format"
trigger:
type: manual
scope:
type: item
compute:
type: export
command: create-export
args:
format: step
output_key_template: "exports/{part_number}_rev{revision}.step"
runner:
tags: [create]
timeout: 300
max_retries: 1
priority: 100
```
---
## 11. References
- [DAG.md](DAG.md) -- Dependency DAG specification
- [MULTI_USER_EDITS.md](MULTI_USER_EDITS.md) -- Multi-user editing specification
- [ROADMAP.md](ROADMAP.md) -- Tier 0 Job Queue Infrastructure, Tier 1 Headless Create