Files
create/docs/src/silo-server/WORKERS.md
Kindred Bot d497471ed3
Some checks failed
Deploy Docs / build-and-deploy (push) Successful in 48s
Build and Test / build (push) Has been cancelled
docs: sync Silo server documentation
Auto-synced from kindred/silo main branch.
2026-02-15 23:12:55 +00:00

12 KiB

Worker System Specification

Status: Draft Last Updated: 2026-02-13


1. Purpose

The worker system provides async compute job execution for Silo. Jobs are defined as YAML files, managed by the Silo server, and executed by external runner processes. The system is general-purpose -- while DAG validation is the first use case, it supports any compute workload: geometry export, thumbnail rendering, FEA/CFD batch jobs, report generation, and data migration.


2. Architecture

YAML Job Definitions (files on disk, version-controllable)
         |
         v
Silo Server (parser, scheduler, state machine, REST API, SSE events)
         |
         v
Runners (silorunner binary, polls via REST, executes Headless Create)

Three layers:

  1. Job definitions -- YAML files in a configurable directory (default /etc/silo/jobdefs). Each file defines a job type: what triggers it, what it operates on, what computation to perform, and what runner capabilities are required. These are the source of truth and can be version-controlled alongside other Silo config.

  2. Silo server -- Parses YAML definitions on startup and upserts them into the job_definitions table. Creates job instances when triggers fire (revision created, BOM changed, manual). Manages job lifecycle, enforces timeouts, and broadcasts status via SSE.

  3. Runners -- Separate silorunner processes that authenticate with Silo via API tokens, poll for available jobs, claim them atomically, execute the compute, and report results. A runner host must have Headless Create and silo-mod installed for geometry jobs.


3. Job Definition Format

Job definitions are YAML files with the following structure:

job:
  name: assembly-validate
  version: 1
  description: "Validate assembly by rebuilding its dependency subgraph"

  trigger:
    type: revision_created       # revision_created, bom_changed, manual, schedule
    filter:
      item_type: assembly        # only trigger for assemblies

  scope:
    type: assembly               # item, assembly, project

  compute:
    type: validate               # validate, rebuild, diff, export, custom
    command: create-validate     # runner-side command identifier
    args:                        # passed to runner as JSON
      rebuild_mode: incremental
      check_interference: true

  runner:
    tags: [create]               # required runner capabilities

  timeout: 900                   # seconds before job is marked failed (default 600)
  max_retries: 2                 # retry count on failure (default 1)
  priority: 50                   # lower = higher priority (default 100)

3.1 Trigger Types

Type Description
revision_created Fires when a new revision is created on an item matching the filter
bom_changed Fires when a BOM merge completes
manual Only triggered via POST /api/jobs
schedule Future: cron-like scheduling (not yet implemented)

3.2 Trigger Filters

The filter map supports key-value matching against item properties:

Key Description
item_type Match item type: part, assembly, drawing, etc.
schema Match schema name

All filter keys must match for the trigger to fire. An empty filter matches all items.

3.3 Scope Types

Type Description
item Job operates on a single item
assembly Job operates on an assembly and its BOM tree
project Job operates on all items in a project

3.4 Compute Commands

The command field identifies what the runner should execute. Built-in commands:

Command Description
create-validate Open file in Headless Create, rebuild features, report validation results
create-export Open file, export to specified format (STEP, IGES, 3MF)
create-dag-extract Open file, extract feature DAG, output as JSON
create-thumbnail Open file, render thumbnail image

Custom commands can be added by extending silo-mod's silo.runner module.


4. Job Lifecycle

pending → claimed → running → completed
                            → failed
                            → cancelled
State Description
pending Job created, waiting for a runner to claim it
claimed Runner has claimed the job. expires_at is set.
running Runner has started execution (reported via progress update)
completed Runner reported success. result JSONB contains output.
failed Runner reported failure, timeout expired, or max retries exceeded
cancelled Admin cancelled the job before completion

4.1 Claim Semantics

Runners claim jobs via POST /api/runner/claim. The server uses PostgreSQL's SELECT FOR UPDATE SKIP LOCKED to ensure exactly-once delivery:

WITH claimable AS (
    SELECT id FROM jobs
    WHERE status = 'pending'
      AND runner_tags <@ $2::text[]
    ORDER BY priority ASC, created_at ASC
    LIMIT 1
    FOR UPDATE SKIP LOCKED
)
UPDATE jobs SET
    status = 'claimed',
    runner_id = $1,
    claimed_at = now(),
    expires_at = now() + (timeout_seconds || ' seconds')::interval
FROM claimable
WHERE jobs.id = claimable.id
RETURNING jobs.*;

The runner_tags <@ $2::text[] condition ensures the runner has all tags required by the job. A runner with tags ["create", "linux", "gpu"] can claim a job requiring ["create"], but not one requiring ["create", "windows"].

4.2 Timeout Enforcement

A background sweeper runs every 30 seconds (configurable via jobs.job_timeout_check) and marks expired jobs as failed:

UPDATE jobs SET status = 'failed', error_message = 'job timed out'
WHERE status IN ('claimed', 'running')
  AND expires_at < now();

4.3 Retry

When a job fails and retry_count < max_retries, a new job is created with the same definition and scope, with retry_count incremented.


5. Runners

5.1 Registration

Runners are registered via POST /api/runners (admin only). The server generates a token (shown once) and stores the SHA-256 hash in the runners table. This follows the same pattern as API tokens in internal/auth/token.go.

5.2 Authentication

Runners authenticate via Authorization: Bearer silo_runner_<token>. A dedicated RequireRunnerAuth middleware validates the token against the runners table and injects a RunnerIdentity into the request context.

5.3 Heartbeat

Runners send POST /api/runner/heartbeat every 30 seconds. The server updates last_heartbeat and sets status = 'online'. A background sweeper marks runners as offline if their heartbeat is older than runner_timeout seconds (default 90).

5.4 Tags

Each runner declares capability tags (e.g., ["create", "linux", "gpu"]). Jobs require specific tags via the runner.tags field in their YAML definition. A runner can only claim jobs whose required tags are a subset of the runner's tags.

5.5 Runner Config

The silorunner binary reads its config from a YAML file:

server_url: "https://silo.example.com"
token: "silo_runner_abc123..."
name: "worker-01"
tags: ["create", "linux"]
poll_interval: 5       # seconds between claim attempts
create_path: "/usr/bin/create"   # path to Headless Create binary (with silo-mod installed)

Or via environment variables: SILO_RUNNER_SERVER_URL, SILO_RUNNER_TOKEN, etc.

5.6 Deployment

Runner prerequisites:

  • silorunner binary (built from cmd/silorunner/)
  • Headless Create (Kindred's fork of FreeCAD) with silo-mod workbench installed
  • Network access to Silo server API

Runners can be deployed as:

  • Bare metal processes alongside Create installations
  • Docker containers with Create pre-installed
  • Scaled horizontally by registering multiple runners with different names

6. Job Log

Each job has an append-only log stored in the job_log table. Runners append entries via POST /api/runner/jobs/{jobID}/log:

{
  "level": "info",
  "message": "Rebuilding Pad003...",
  "metadata": {"node_key": "Pad003", "progress_pct": 45}
}

Log levels: debug, info, warn, error.


7. SSE Events

All job lifecycle transitions are broadcast via Silo's SSE broker. Clients subscribe to /api/events and receive:

Event Type Payload When
job.created {id, definition_name, item_id, status, priority} Job created
job.claimed {id, runner_id, runner_name} Runner claims job
job.progress {id, progress, progress_message} Runner reports progress (0-100)
job.completed {id, result_summary, duration_seconds} Job completed successfully
job.failed {id, error_message} Job failed
job.cancelled {id, cancelled_by} Admin cancelled job
runner.online {id, name, tags} Runner heartbeat (first after offline)
runner.offline {id, name} Runner heartbeat timeout

8. REST API

8.1 Job Endpoints (user-facing, require auth)

Method Path Auth Description
GET /api/jobs viewer List jobs (filterable by status, item, definition)
GET /api/jobs/{jobID} viewer Get job details
GET /api/jobs/{jobID}/logs viewer Get job log entries
POST /api/jobs editor Manually trigger a job
POST /api/jobs/{jobID}/cancel editor Cancel a pending/running job

8.2 Job Definition Endpoints

Method Path Auth Description
GET /api/job-definitions viewer List loaded definitions
GET /api/job-definitions/{name} viewer Get specific definition
POST /api/job-definitions/reload admin Re-read YAML from disk

8.3 Runner Management Endpoints (admin)

Method Path Auth Description
GET /api/runners admin List registered runners
POST /api/runners admin Register runner (returns token)
DELETE /api/runners/{runnerID} admin Delete runner

8.4 Runner-Facing Endpoints (runner token auth)

Method Path Auth Description
POST /api/runner/heartbeat runner Send heartbeat
POST /api/runner/claim runner Claim next available job
PUT /api/runner/jobs/{jobID}/progress runner Report progress
POST /api/runner/jobs/{jobID}/complete runner Report completion with result
POST /api/runner/jobs/{jobID}/fail runner Report failure
POST /api/runner/jobs/{jobID}/log runner Append log entry
PUT /api/runner/jobs/{jobID}/dag runner Sync DAG results after compute

9. Configuration

Add to config.yaml:

jobs:
  directory: /etc/silo/jobdefs    # path to YAML job definitions
  runner_timeout: 90              # seconds before marking runner offline
  job_timeout_check: 30           # seconds between timeout sweeps
  default_priority: 100           # default job priority

10. Example Job Definitions

Assembly Validation

job:
  name: assembly-validate
  version: 1
  description: "Validate assembly by rebuilding its dependency subgraph"
  trigger:
    type: revision_created
    filter:
      item_type: assembly
  scope:
    type: assembly
  compute:
    type: validate
    command: create-validate
    args:
      rebuild_mode: incremental
      check_interference: true
  runner:
    tags: [create]
  timeout: 900
  max_retries: 2
  priority: 50

STEP Export

job:
  name: part-export-step
  version: 1
  description: "Export a part to STEP format"
  trigger:
    type: manual
  scope:
    type: item
  compute:
    type: export
    command: create-export
    args:
      format: step
      output_key_template: "exports/{part_number}_rev{revision}.step"
  runner:
    tags: [create]
  timeout: 300
  max_retries: 1
  priority: 100

11. References