Files
solver/configs/dataset/synthetic.yaml
forbes-0023 f29060491e
Some checks failed
CI / lint (push) Has been cancelled
CI / type-check (push) Has been cancelled
CI / test (push) Has been cancelled
feat(datagen): add dataset generation CLI with sharding and checkpointing
- Add solver/datagen/dataset.py with DatasetConfig, DatasetGenerator,
  ShardSpec/ShardResult dataclasses, parallel shard generation via
  ProcessPoolExecutor, checkpoint/resume support, index and stats output
- Add scripts/generate_synthetic.py CLI entry point with Hydra-first
  and argparse fallback modes
- Add minimal YAML parser (parse_simple_yaml) for config loading
  without PyYAML dependency
- Add progress display with tqdm fallback to print-based ETA
- Update configs/dataset/synthetic.yaml with shard_size, checkpoint_every
- Update solver/datagen/__init__.py with DatasetConfig, DatasetGenerator
  exports
- Add tests/datagen/test_dataset.py with 28 tests covering config,
  YAML parsing, seed derivation, end-to-end generation, resume,
  stats/index structure, determinism, and CLI integration

Closes #10
2026-02-03 08:44:31 -06:00

27 lines
396 B
YAML

# Synthetic dataset generation config
name: synthetic
num_assemblies: 100000
output_dir: data/synthetic
shard_size: 1000
complexity_distribution:
simple: 0.4 # 2-5 bodies
medium: 0.4 # 6-15 bodies
complex: 0.2 # 16-50 bodies
body_count:
min: 2
max: 50
templates:
- chain
- tree
- loop
- star
- mixed
grounded_ratio: 0.5
seed: 42
num_workers: 4
checkpoint_every: 5