Dataset generation CLI #10

Closed
opened 2026-02-02 19:33:53 +00:00 by forbes · 0 comments
Owner

Summary

Implement scripts/generate_synthetic.py as the CLI entry point for synthetic dataset generation. Per Phase 1.4 of the repository plan.

Features

Hydra configuration

  • Uses configs/dataset/synthetic.yaml as base config
  • Override any parameter via CLI: python scripts/generate_synthetic.py num_assemblies=50000 complexity_distribution.simple=0.6
  • Config groups: complexity distribution, body count ranges, template selection, geometric diversity

Generation parameters

  • num_assemblies: total assemblies to generate (target: 100k)
  • complexity_distribution: ratio of simple/medium/complex assemblies
  • body_count.min, body_count.max: body count range
  • templates: list of enabled templates (chain, tree, loop, star, mixed)
  • grounded_ratio: fraction of grounded assemblies
  • output_dir: where to write output files
  • num_workers: parallel generation workers
  • seed: reproducibility

Progress tracking and resumability

  • Progress bar (tqdm or rich) showing assemblies generated
  • Periodic checkpointing: write partial results every N assemblies
  • Resume from checkpoint: detect existing output and skip completed assemblies
  • Log file with generation parameters and progress

Output format

  • One .pt file per batch (or configurable: single file vs sharded)
  • Each file contains serialized list of labeled assembly dicts
  • Index file mapping assembly IDs to shard files

Dataset statistics report

  • Print on completion:
    • Total assemblies generated
    • Classification distribution (rigid/under/over/mixed)
    • Body count histogram
    • Joint type distribution
    • DOF distribution
    • Geometric degeneracy rate
    • Generation time

Requirements

  • scripts/generate_synthetic.py with Hydra @hydra.main
  • Parallel generation via multiprocessing or concurrent.futures
  • Progress bar with ETA
  • Checkpoint/resume support
  • Statistics report printed and saved to {output_dir}/stats.json
  • Output compatible with Phase 3 PyG dataset adapter expectations

Depends on

  • #5 (generator)
  • #7 (assembly templates)
  • #8 (geometric diversity)
  • #9 (labeling pipeline)
## Summary Implement `scripts/generate_synthetic.py` as the CLI entry point for synthetic dataset generation. Per Phase 1.4 of the repository plan. ## Features ### Hydra configuration - Uses `configs/dataset/synthetic.yaml` as base config - Override any parameter via CLI: `python scripts/generate_synthetic.py num_assemblies=50000 complexity_distribution.simple=0.6` - Config groups: complexity distribution, body count ranges, template selection, geometric diversity ### Generation parameters - `num_assemblies`: total assemblies to generate (target: 100k) - `complexity_distribution`: ratio of simple/medium/complex assemblies - `body_count.min`, `body_count.max`: body count range - `templates`: list of enabled templates (chain, tree, loop, star, mixed) - `grounded_ratio`: fraction of grounded assemblies - `output_dir`: where to write output files - `num_workers`: parallel generation workers - `seed`: reproducibility ### Progress tracking and resumability - Progress bar (tqdm or rich) showing assemblies generated - Periodic checkpointing: write partial results every N assemblies - Resume from checkpoint: detect existing output and skip completed assemblies - Log file with generation parameters and progress ### Output format - One `.pt` file per batch (or configurable: single file vs sharded) - Each file contains serialized list of labeled assembly dicts - Index file mapping assembly IDs to shard files ### Dataset statistics report - Print on completion: - Total assemblies generated - Classification distribution (rigid/under/over/mixed) - Body count histogram - Joint type distribution - DOF distribution - Geometric degeneracy rate - Generation time ## Requirements - [ ] `scripts/generate_synthetic.py` with Hydra `@hydra.main` - [ ] Parallel generation via `multiprocessing` or `concurrent.futures` - [ ] Progress bar with ETA - [ ] Checkpoint/resume support - [ ] Statistics report printed and saved to `{output_dir}/stats.json` - [ ] Output compatible with Phase 3 PyG dataset adapter expectations ## Depends on - #5 (generator) - #7 (assembly templates) - #8 (geometric diversity) - #9 (labeling pipeline)
forbes added the phase:1feature labels 2026-02-02 19:33:53 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: kindred/solver#10