Dataset generation CLI #10
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Implement
scripts/generate_synthetic.pyas the CLI entry point for synthetic dataset generation. Per Phase 1.4 of the repository plan.Features
Hydra configuration
configs/dataset/synthetic.yamlas base configpython scripts/generate_synthetic.py num_assemblies=50000 complexity_distribution.simple=0.6Generation parameters
num_assemblies: total assemblies to generate (target: 100k)complexity_distribution: ratio of simple/medium/complex assembliesbody_count.min,body_count.max: body count rangetemplates: list of enabled templates (chain, tree, loop, star, mixed)grounded_ratio: fraction of grounded assembliesoutput_dir: where to write output filesnum_workers: parallel generation workersseed: reproducibilityProgress tracking and resumability
Output format
.ptfile per batch (or configurable: single file vs sharded)Dataset statistics report
Requirements
scripts/generate_synthetic.pywith Hydra@hydra.mainmultiprocessingorconcurrent.futures{output_dir}/stats.jsonDepends on