Dataset generation CLI #10

New Issue

forbes · 2026-02-02T19:33:53Z

forbes commented

2026-02-02 19:33:53 +00:00

Summary

Implement scripts/generate_synthetic.py as the CLI entry point for synthetic dataset generation. Per Phase 1.4 of the repository plan.

Features

Hydra configuration

Uses configs/dataset/synthetic.yaml as base config
Override any parameter via CLI: python scripts/generate_synthetic.py num_assemblies=50000 complexity_distribution.simple=0.6
Config groups: complexity distribution, body count ranges, template selection, geometric diversity

Generation parameters

num_assemblies: total assemblies to generate (target: 100k)
complexity_distribution: ratio of simple/medium/complex assemblies
body_count.min, body_count.max: body count range
templates: list of enabled templates (chain, tree, loop, star, mixed)
grounded_ratio: fraction of grounded assemblies
output_dir: where to write output files
num_workers: parallel generation workers
seed: reproducibility

Progress tracking and resumability

Progress bar (tqdm or rich) showing assemblies generated
Periodic checkpointing: write partial results every N assemblies
Resume from checkpoint: detect existing output and skip completed assemblies
Log file with generation parameters and progress

Output format

One .pt file per batch (or configurable: single file vs sharded)
Each file contains serialized list of labeled assembly dicts
Index file mapping assembly IDs to shard files

Dataset statistics report

Print on completion:
- Total assemblies generated
- Classification distribution (rigid/under/over/mixed)
- Body count histogram
- Joint type distribution
- DOF distribution
- Geometric degeneracy rate
- Generation time

Requirements

scripts/generate_synthetic.py with Hydra @hydra.main
Parallel generation via multiprocessing or concurrent.futures
Progress bar with ETA
Checkpoint/resume support
Statistics report printed and saved to {output_dir}/stats.json
Output compatible with Phase 3 PyG dataset adapter expectations

Depends on

#5 (generator)
#7 (assembly templates)
#8 (geometric diversity)
#9 (labeling pipeline)

## Summary Implement `scripts/generate_synthetic.py` as the CLI entry point for synthetic dataset generation. Per Phase 1.4 of the repository plan. ## Features ### Hydra configuration - Uses `configs/dataset/synthetic.yaml` as base config - Override any parameter via CLI: `python scripts/generate_synthetic.py num_assemblies=50000 complexity_distribution.simple=0.6` - Config groups: complexity distribution, body count ranges, template selection, geometric diversity ### Generation parameters - `num_assemblies`: total assemblies to generate (target: 100k) - `complexity_distribution`: ratio of simple/medium/complex assemblies - `body_count.min`, `body_count.max`: body count range - `templates`: list of enabled templates (chain, tree, loop, star, mixed) - `grounded_ratio`: fraction of grounded assemblies - `output_dir`: where to write output files - `num_workers`: parallel generation workers - `seed`: reproducibility ### Progress tracking and resumability - Progress bar (tqdm or rich) showing assemblies generated - Periodic checkpointing: write partial results every N assemblies - Resume from checkpoint: detect existing output and skip completed assemblies - Log file with generation parameters and progress ### Output format - One `.pt` file per batch (or configurable: single file vs sharded) - Each file contains serialized list of labeled assembly dicts - Index file mapping assembly IDs to shard files ### Dataset statistics report - Print on completion: - Total assemblies generated - Classification distribution (rigid/under/over/mixed) - Body count histogram - Joint type distribution - DOF distribution - Geometric degeneracy rate - Generation time ## Requirements - [ ] `scripts/generate_synthetic.py` with Hydra `@hydra.main` - [ ] Parallel generation via `multiprocessing` or `concurrent.futures` - [ ] Progress bar with ETA - [ ] Checkpoint/resume support - [ ] Statistics report printed and saved to `{output_dir}/stats.json` - [ ] Output compatible with Phase 3 PyG dataset adapter expectations ## Depends on - #5 (generator) - #7 (assembly templates) - #8 (geometric diversity) - #9 (labeling pipeline)

forbes added the phase:1 feature labels 2026-02-02 19:33:53 +00:00

forbes referenced this issue from a commit

2026-02-03 14:44:43 +00:00

feat(datagen): add dataset generation CLI with sharding and checkpointing

forbes closed this issue

2026-02-03 14:44:43 +00:00

forbes referenced this issue from a commit

2026-02-03 16:54:16 +00:00

Merge pull request #10 from Ondsel-Development/change_math.h_name

Sign in to join this conversation.