Datasets:

Junjun2333
/

HPDv3-PlusPlus

Error code:   DatasetGenerationCastError
Exception:    DatasetGenerationCastError
Message:      An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 10 new columns ({'level', 'image_path', 'source', 'tier', 'capability', 'prompt', 'iter_step', 'group_id', 'iter_norm', 'ogd_std'}) and 2 missing columns ({'std', 'group_key'}).

This happened while the json dataset builder was generating data using

hf://datasets/Junjun2333/HPDv3-PlusPlus/train/rollout.json (at revision 5ccc5efa6024607a06ac502beb772d59235e7f93), ['hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/ogd_std.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/rollout.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_ref.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage2_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_aes.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_tf.json'], ['hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/ogd_std.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/rollout.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_ref.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage2_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_aes.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_tf.json']

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback:    Traceback (most recent call last):
                File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1837, in _prepare_split_single
                  writer.write_table(table)
                  ~~~~~~~~~~~~~~~~~~^^^^^^^
                File "/usr/local/lib/python3.14/site-packages/datasets/arrow_writer.py", line 765, in write_table
                  self._write_table(pa_table, writer_batch_size=writer_batch_size)
                  ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.14/site-packages/datasets/arrow_writer.py", line 773, in _write_table
                  pa_table = table_cast(pa_table, self._schema)
                File "/usr/local/lib/python3.14/site-packages/datasets/table.py", line 2369, in table_cast
                  return cast_table_to_schema(table, schema)
                File "/usr/local/lib/python3.14/site-packages/datasets/table.py", line 2297, in cast_table_to_schema
                  raise CastError(
                  ...<3 lines>...
                  )
              datasets.table.CastError: Couldn't cast
              group_id: string
              source: string
              prompt: string
              tier: string
              iter_step: string
              capability: string
              iter_norm: string
              level: string
              image_path: string
              ogd_std: string
              to
              {'group_key': Value('string'), 'std': Value('string')}
              because column names don't match
              
              During handling of the above exception, another exception occurred:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1369, in compute_config_parquet_and_info_response
                  parquet_operations, partial, estimated_dataset_info = stream_convert_to_parquet(
                                                                        ~~~~~~~~~~~~~~~~~~~~~~~~~^
                      builder, max_dataset_size_bytes=max_dataset_size_bytes
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                  )
                  ^
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 948, in stream_convert_to_parquet
                  builder._prepare_split(split_generator=splits_generators[split], file_format="parquet")
                  ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1683, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                                               ~~~~~~~~~~~~~~~~~~~~~~~~~~^
                      gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                  ):
                  ^
                File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1839, in _prepare_split_single
                  raise DatasetGenerationCastError.from_cast_error(
                  ...<4 lines>...
                  )
              datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
              
              All the data files must have the same columns, but at some point there are 10 new columns ({'level', 'image_path', 'source', 'tier', 'capability', 'prompt', 'iter_step', 'group_id', 'iter_norm', 'ogd_std'}) and 2 missing columns ({'std', 'group_key'}).
              
              This happened while the json dataset builder was generating data using
              
              hf://datasets/Junjun2333/HPDv3-PlusPlus/train/rollout.json (at revision 5ccc5efa6024607a06ac502beb772d59235e7f93), ['hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/ogd_std.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/rollout.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_ref.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage2_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_aes.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_tf.json'], ['hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/ogd_std.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/rollout.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_ref.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage2_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_aes.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_tf.json']
              
              Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

group_key string	std string
0_sd15_0	0.5196003317832947
2_qwen_image_0	1.8669826984405518
5_sdxl_0	0.8671074509620667
8_sd15_0	1.210698127746582
10_qwen_image_0	1.361228346824646
13_sdxl_0	1.5466495752334595
16_sd15_0	1.7577970027923584
18_qwen_image_0	1.1863906383514404
21_sdxl_0	1.5061284303665161
24_sd15_0	1.0432137250900269
26_qwen_image_0	1.3215934038162231
29_sdxl_0	1.774195671081543
32_sd15_0	1.0112184286117554
34_qwen_image_0	1.1256364583969116
37_sdxl_0	1.3799389600753784
40_sd15_0	1.5470871925354004
42_qwen_image_0	1.1408222913742065
45_sdxl_0	1.0573819875717163
48_sd15_0	1.2057279348373413
50_qwen_image_0	2.553287982940674
53_sdxl_0	1.4632068872451782
56_sd15_0	1.1907742023468018
58_qwen_image_0	1.2943367958068848
61_sdxl_0	0.9881778955459595
64_sd15_0	0.0109957754611969
66_qwen_image_0	0.9125379323959351
69_sdxl_0	1.4924921989440918
72_sd15_0	0.3233208954334259
74_qwen_image_0	0.873437225818634
77_sdxl_0	0.5395181775093079
80_sd15_0	1.2112849950790405
82_qwen_image_0	1.4900116920471191
85_sdxl_0	0.030752239748835564
88_sd15_0	1.0849272012710571
90_qwen_image_0	1.3268784284591675
93_sdxl_0	1.7986881732940674
96_sd15_0	1.8509434461593628
98_qwen_image_0	1.7981706857681274
101_sdxl_0	1.7227821350097656
104_sd15_0	1.2491554021835327
106_qwen_image_0	1.0433381795883179
109_sdxl_0	0.7634141445159912
112_sd15_0	1.009671688079834
114_qwen_image_0	0.5743687748908997
117_sdxl_0	0.9999319314956665
120_sd15_0	1.1115531921386719
122_qwen_image_0	1.9276930093765259
125_sdxl_0	1.0058438777923584
128_sd15_0	1.4179741144180298
130_qwen_image_0	0.5620606541633606
133_sdxl_0	0.9906840324401855
136_sd15_0	2.169646739959717
138_qwen_image_0	1.6004961729049683
141_sdxl_0	1.1230113506317139
144_sd15_0	1.620121955871582
146_qwen_image_0	2.150611162185669
149_sdxl_0	0.1710658222436905
152_sd15_0	0.9175712466239929
154_qwen_image_0	0.8258328437805176
157_sdxl_0	0.709878146648407
160_sd15_0	0.2822781801223755
162_qwen_image_0	0.5606086254119873
165_sdxl_0	0.193226620554924
168_sd15_0	0.5490515232086182
170_qwen_image_0	0.8859235644340515
173_sdxl_0	2.0872576236724854
176_sd15_0	1.367887020111084
178_qwen_image_0	0.8499795794487
181_sdxl_0	0.4932331442832947
184_sd15_0	1.7810335159301758
186_qwen_image_0	2.4625415802001953
189_sdxl_0	0.030361615121364594
192_sd15_0	0.20025479793548584
194_qwen_image_0	0.9771237969398499
197_sdxl_0	1.8507764339447021
200_sd15_0	0.7779256105422974
202_qwen_image_0	0.9881678819656372
205_sdxl_0	0.8148122429847717
208_sd15_0	2.012232780456543
210_qwen_image_0	2.4451675415039062
213_sdxl_0	1.174737811088562
216_sd15_0	0.4527755081653595
218_qwen_image_0	1.2261762619018555
221_sdxl_0	1.1549489498138428
224_sd15_0	1.5732609033584595
226_qwen_image_0	1.449184536933899
229_sdxl_0	1.2308263778686523
232_sd15_0	0.535926103591919
234_qwen_image_0	0.7999389171600342
237_sdxl_0	0.04111122339963913
240_sd15_0	1.6432368755340576
242_qwen_image_0	1.0480259656906128
245_sdxl_0	1.236539363861084
248_sd15_0	0.8481998443603516
250_qwen_image_0	0.3959624469280243
253_sdxl_0	1.6044071912765503
256_sd15_0	0.6340675354003906
258_qwen_image_0	1.2265187501907349
261_sdxl_0	1.2273329496383667
264_sd15_0	0.04196717590093613

End of preview.

HPDv3++: A Dual-Dimension Preference Dataset for Text-to-Image Reward Modeling

HPDv3++ is a large-scale human-preference dataset for text-to-image (T2I) generation, built on a frontier generator (Qwen-Image) and annotated along two axes: text-following (TF) and aesthetic quality (Aes). It is the dataset used to train HPSv3++, a capability-aware and RL-iteration-aware reward model.

Each preference pair stores a preferred image (path1) and a non-preferred image (path2) for the same prompt.

Quick start

pip install -U "huggingface_hub[cli]"
hf download Junjun2333/HPDv3-PlusPlus --repo-type dataset --local-dir HPDv3pp
cd HPDv3pp
# Reassemble and extract our image pool (split tar parts -> images/qwen_image, images/rollout, images/thumbs):
cat images.tar.part* | tar -xf -

The split tar contains only the images we generated (images/qwen_image/, images/rollout/, images/thumbs/). The stage1_ref.json reference pairs point to the original HPDv3 images (images/hpdv3/...), which we do not re-host here. If you need them (only required to reproduce HPSv3++ Stage 1 with the original HPDv3 reference set), download the HPDv3 images from the official repo and place them under images/hpdv3/:

# Original HPDv3 images (only needed for stage1_ref.json)
hf download MizzenAI/HPDv3 --repo-type dataset --include "images.tar.gz.*" --local-dir hpdv3_src
cat hpdv3_src/images.tar.gz.* | gunzip | tar -xv   # then move/symlink the resulting images into images/hpdv3/

After extraction you get an images/ directory. Every path in the JSON files (path1 / path2 / image_path) is relative and resolves against the repo root, e.g. images/qwen_image/prompt_000000/6.jpg. The four ready-to-use train/test files (train_aes, train_tf, test_aes, test_tf) reference only our own images and need no HPDv3 download.

What you can use directly

These four files are ready-to-use, self-contained, and do not require any HPSv3++ code or model -- just images + JSON. Each record is {"path1": <preferred>, "path2": <non-preferred>, "prompt": <text>} (the same format as HPSv3/HPDv3), with path1 preferred over path2.

File	Pairs	Use
`train/train_aes.json`	100,463	Training -- aesthetic preference
`train/train_tf.json`	90,908	Training -- text-following preference
`test/test_aes.json`	5,720	Evaluation -- aesthetic
`test/test_tf.json`	4,465	Evaluation -- text-following

The training and test sets are disjoint (no shared pairs), including across the two axes (aes/tf), so they can be used together without leakage.

Repository layout

HPDv3-PlusPlus/
|-- images.tar.part00, images.tar.part01, ...   # split tar of OUR images (~268 GB; qwen_image + rollout + thumbs)
|-- train/
|   |-- train_aes.json        # 100,463  ready-to-use aesthetic training pairs
|   |-- train_tf.json         # 90,908   ready-to-use text-following training pairs
|   |-- stage1_labeled.json   # 191,466  labeled pairs (used by HPSv3++ Stage 1)
|   |-- stage1_ref.json       # 284,974  original HPDv3 reference pairs (Stage 1 OGD anti-forgetting)
|   |-- stage2_labeled.json   # 111,650  labeled pairs (used by HPSv3++ Stage 2)
|   |-- rollout.json          # 322,452  unlabeled rollouts, long format, one image per row
|   `-- ogd_std.json          # 58,242   pre-computed per-group std (also embedded in rollout.json)
|-- test/
|   |-- test_aes.json         # 5,720   ready-to-use aesthetic test pairs
|   `-- test_tf.json          # 4,465   ready-to-use text-following test pairs
`-- images/                   # after extraction: qwen_image/, rollout/, thumbs/ (ours);
                              #   hpdv3/ must be downloaded separately from MizzenAI/HPDv3 (only for stage1_ref)

JSON formats

Preference pairs (train_aes, train_tf, stage1_labeled, stage1_ref, stage2_labeled, test_aes, test_tf):

Field	Meaning
`path1` / `path2`	Preferred / non-preferred image (relative `images/...` path)
`prompt`	Text prompt
`choice_dist` / `confidence` / `model1` / `model2`	(where annotated) vote distribution, confidence, generator names; `null` otherwise. The ready-to-use `train_aes`/`train_tf` and `test` files keep only `path1/path2/prompt`.

rollout.json (unlabeled rollouts for HPSv3++ Stage 2; long format, one image per row):

Field	Meaning
`group_id`	Group id (same prompt + tier + iter_step form one group)
`source`	`capability` or `iteration`
`prompt`	Text prompt
`tier`	Generator tier
`iter_step` / `iter_norm`	Raw / normalized RL iteration
`capability` / `level`	Continuous capability score / discrete level
`image_path`	Relative image path
`ogd_std`	Pre-computed per-group std

Notes

The images we host here (qwen_image + rollout + thumbs) are ~268 GB. The original HPDv3 images (hpdv3/, ~60 GB, referenced only by stage1_ref.json) are not re-hosted -- download them from MizzenAI/HPDv3 if needed (see Quick start).
The ready-to-use train/test files reference only our own images, so they work with just the split tar above (no HPDv3 download needed).
For the full two-stage training / evaluation pipeline (which additionally uses rollout.json, stage1_ref.json, etc.), see the HPSv3++ code repository.

Citation

@misc{hpsv3pp,
  title  = {HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities},
  author = {HPSv3++ Team},
  year   = {2026}
}

Downloads last month: 46

Total file size:

288 GB