- Add solver/datagen/dataset.py with DatasetConfig, DatasetGenerator, ShardSpec/ShardResult dataclasses, parallel shard generation via ProcessPoolExecutor, checkpoint/resume support, index and stats output - Add scripts/generate_synthetic.py CLI entry point with Hydra-first and argparse fallback modes - Add minimal YAML parser (parse_simple_yaml) for config loading without PyYAML dependency - Add progress display with tqdm fallback to print-based ETA - Update configs/dataset/synthetic.yaml with shard_size, checkpoint_every - Update solver/datagen/__init__.py with DatasetConfig, DatasetGenerator exports - Add tests/datagen/test_dataset.py with 28 tests covering config, YAML parsing, seed derivation, end-to-end generation, resume, stats/index structure, determinism, and CLI integration Closes #10
27 lines
396 B
YAML
27 lines
396 B
YAML
# Synthetic dataset generation config
|
|
name: synthetic
|
|
num_assemblies: 100000
|
|
output_dir: data/synthetic
|
|
shard_size: 1000
|
|
|
|
complexity_distribution:
|
|
simple: 0.4 # 2-5 bodies
|
|
medium: 0.4 # 6-15 bodies
|
|
complex: 0.2 # 16-50 bodies
|
|
|
|
body_count:
|
|
min: 2
|
|
max: 50
|
|
|
|
templates:
|
|
- chain
|
|
- tree
|
|
- loop
|
|
- star
|
|
- mixed
|
|
|
|
grounded_ratio: 0.5
|
|
seed: 42
|
|
num_workers: 4
|
|
checkpoint_every: 5
|