The full dataset viewer is not available (click to read why). Only showing a preview of the rows.
Error code: DatasetGenerationCastError
Exception: DatasetGenerationCastError
Message: An error occurred while generating the dataset
All the data files must have the same columns, but at some point there are 10 new columns ({'level', 'image_path', 'source', 'tier', 'capability', 'prompt', 'iter_step', 'group_id', 'iter_norm', 'ogd_std'}) and 2 missing columns ({'std', 'group_key'}).
This happened while the json dataset builder was generating data using
hf://datasets/Junjun2333/HPDv3-PlusPlus/train/rollout.json (at revision 5ccc5efa6024607a06ac502beb772d59235e7f93), ['hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/ogd_std.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/rollout.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_ref.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage2_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_aes.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_tf.json'], ['hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/ogd_std.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/rollout.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_ref.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage2_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_aes.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_tf.json']
Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback: Traceback (most recent call last):
File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1837, in _prepare_split_single
writer.write_table(table)
~~~~~~~~~~~~~~~~~~^^^^^^^
File "/usr/local/lib/python3.14/site-packages/datasets/arrow_writer.py", line 765, in write_table
self._write_table(pa_table, writer_batch_size=writer_batch_size)
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.14/site-packages/datasets/arrow_writer.py", line 773, in _write_table
pa_table = table_cast(pa_table, self._schema)
File "/usr/local/lib/python3.14/site-packages/datasets/table.py", line 2369, in table_cast
return cast_table_to_schema(table, schema)
File "/usr/local/lib/python3.14/site-packages/datasets/table.py", line 2297, in cast_table_to_schema
raise CastError(
...<3 lines>...
)
datasets.table.CastError: Couldn't cast
group_id: string
source: string
prompt: string
tier: string
iter_step: string
capability: string
iter_norm: string
level: string
image_path: string
ogd_std: string
to
{'group_key': Value('string'), 'std': Value('string')}
because column names don't match
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1369, in compute_config_parquet_and_info_response
parquet_operations, partial, estimated_dataset_info = stream_convert_to_parquet(
~~~~~~~~~~~~~~~~~~~~~~~~~^
builder, max_dataset_size_bytes=max_dataset_size_bytes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 948, in stream_convert_to_parquet
builder._prepare_split(split_generator=splits_generators[split], file_format="parquet")
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1683, in _prepare_split
for job_id, done, content in self._prepare_split_single(
~~~~~~~~~~~~~~~~~~~~~~~~~~^
gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
):
^
File "/usr/local/lib/python3.14/site-packages/datasets/builder.py", line 1839, in _prepare_split_single
raise DatasetGenerationCastError.from_cast_error(
...<4 lines>...
)
datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
All the data files must have the same columns, but at some point there are 10 new columns ({'level', 'image_path', 'source', 'tier', 'capability', 'prompt', 'iter_step', 'group_id', 'iter_norm', 'ogd_std'}) and 2 missing columns ({'std', 'group_key'}).
This happened while the json dataset builder was generating data using
hf://datasets/Junjun2333/HPDv3-PlusPlus/train/rollout.json (at revision 5ccc5efa6024607a06ac502beb772d59235e7f93), ['hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/ogd_std.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/rollout.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_ref.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage2_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_aes.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_tf.json'], ['hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/ogd_std.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/rollout.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage1_ref.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/stage2_labeled.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_aes.json', 'hf://datasets/Junjun2333/HPDv3-PlusPlus@5ccc5efa6024607a06ac502beb772d59235e7f93/train/train_tf.json']
Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.
group_key string | std string |
|---|---|
0_sd15_0 | 0.5196003317832947 |
2_qwen_image_0 | 1.8669826984405518 |
5_sdxl_0 | 0.8671074509620667 |
8_sd15_0 | 1.210698127746582 |
10_qwen_image_0 | 1.361228346824646 |
13_sdxl_0 | 1.5466495752334595 |
16_sd15_0 | 1.7577970027923584 |
18_qwen_image_0 | 1.1863906383514404 |
21_sdxl_0 | 1.5061284303665161 |
24_sd15_0 | 1.0432137250900269 |
26_qwen_image_0 | 1.3215934038162231 |
29_sdxl_0 | 1.774195671081543 |
32_sd15_0 | 1.0112184286117554 |
34_qwen_image_0 | 1.1256364583969116 |
37_sdxl_0 | 1.3799389600753784 |
40_sd15_0 | 1.5470871925354004 |
42_qwen_image_0 | 1.1408222913742065 |
45_sdxl_0 | 1.0573819875717163 |
48_sd15_0 | 1.2057279348373413 |
50_qwen_image_0 | 2.553287982940674 |
53_sdxl_0 | 1.4632068872451782 |
56_sd15_0 | 1.1907742023468018 |
58_qwen_image_0 | 1.2943367958068848 |
61_sdxl_0 | 0.9881778955459595 |
64_sd15_0 | 0.0109957754611969 |
66_qwen_image_0 | 0.9125379323959351 |
69_sdxl_0 | 1.4924921989440918 |
72_sd15_0 | 0.3233208954334259 |
74_qwen_image_0 | 0.873437225818634 |
77_sdxl_0 | 0.5395181775093079 |
80_sd15_0 | 1.2112849950790405 |
82_qwen_image_0 | 1.4900116920471191 |
85_sdxl_0 | 0.030752239748835564 |
88_sd15_0 | 1.0849272012710571 |
90_qwen_image_0 | 1.3268784284591675 |
93_sdxl_0 | 1.7986881732940674 |
96_sd15_0 | 1.8509434461593628 |
98_qwen_image_0 | 1.7981706857681274 |
101_sdxl_0 | 1.7227821350097656 |
104_sd15_0 | 1.2491554021835327 |
106_qwen_image_0 | 1.0433381795883179 |
109_sdxl_0 | 0.7634141445159912 |
112_sd15_0 | 1.009671688079834 |
114_qwen_image_0 | 0.5743687748908997 |
117_sdxl_0 | 0.9999319314956665 |
120_sd15_0 | 1.1115531921386719 |
122_qwen_image_0 | 1.9276930093765259 |
125_sdxl_0 | 1.0058438777923584 |
128_sd15_0 | 1.4179741144180298 |
130_qwen_image_0 | 0.5620606541633606 |
133_sdxl_0 | 0.9906840324401855 |
136_sd15_0 | 2.169646739959717 |
138_qwen_image_0 | 1.6004961729049683 |
141_sdxl_0 | 1.1230113506317139 |
144_sd15_0 | 1.620121955871582 |
146_qwen_image_0 | 2.150611162185669 |
149_sdxl_0 | 0.1710658222436905 |
152_sd15_0 | 0.9175712466239929 |
154_qwen_image_0 | 0.8258328437805176 |
157_sdxl_0 | 0.709878146648407 |
160_sd15_0 | 0.2822781801223755 |
162_qwen_image_0 | 0.5606086254119873 |
165_sdxl_0 | 0.193226620554924 |
168_sd15_0 | 0.5490515232086182 |
170_qwen_image_0 | 0.8859235644340515 |
173_sdxl_0 | 2.0872576236724854 |
176_sd15_0 | 1.367887020111084 |
178_qwen_image_0 | 0.8499795794487 |
181_sdxl_0 | 0.4932331442832947 |
184_sd15_0 | 1.7810335159301758 |
186_qwen_image_0 | 2.4625415802001953 |
189_sdxl_0 | 0.030361615121364594 |
192_sd15_0 | 0.20025479793548584 |
194_qwen_image_0 | 0.9771237969398499 |
197_sdxl_0 | 1.8507764339447021 |
200_sd15_0 | 0.7779256105422974 |
202_qwen_image_0 | 0.9881678819656372 |
205_sdxl_0 | 0.8148122429847717 |
208_sd15_0 | 2.012232780456543 |
210_qwen_image_0 | 2.4451675415039062 |
213_sdxl_0 | 1.174737811088562 |
216_sd15_0 | 0.4527755081653595 |
218_qwen_image_0 | 1.2261762619018555 |
221_sdxl_0 | 1.1549489498138428 |
224_sd15_0 | 1.5732609033584595 |
226_qwen_image_0 | 1.449184536933899 |
229_sdxl_0 | 1.2308263778686523 |
232_sd15_0 | 0.535926103591919 |
234_qwen_image_0 | 0.7999389171600342 |
237_sdxl_0 | 0.04111122339963913 |
240_sd15_0 | 1.6432368755340576 |
242_qwen_image_0 | 1.0480259656906128 |
245_sdxl_0 | 1.236539363861084 |
248_sd15_0 | 0.8481998443603516 |
250_qwen_image_0 | 0.3959624469280243 |
253_sdxl_0 | 1.6044071912765503 |
256_sd15_0 | 0.6340675354003906 |
258_qwen_image_0 | 1.2265187501907349 |
261_sdxl_0 | 1.2273329496383667 |
264_sd15_0 | 0.04196717590093613 |
HPDv3++: A Dual-Dimension Preference Dataset for Text-to-Image Reward Modeling
HPDv3++ is a large-scale human-preference dataset for text-to-image (T2I) generation, built on a frontier generator (Qwen-Image) and annotated along two axes: text-following (TF) and aesthetic quality (Aes). It is the dataset used to train HPSv3++, a capability-aware and RL-iteration-aware reward model.
Each preference pair stores a preferred image (path1) and a non-preferred image (path2) for the same prompt.
Quick start
pip install -U "huggingface_hub[cli]"
hf download Junjun2333/HPDv3-PlusPlus --repo-type dataset --local-dir HPDv3pp
cd HPDv3pp
# Reassemble and extract our image pool (split tar parts -> images/qwen_image, images/rollout, images/thumbs):
cat images.tar.part* | tar -xf -
The split tar contains only the images we generated (images/qwen_image/,
images/rollout/, images/thumbs/). The stage1_ref.json reference pairs point
to the original HPDv3 images (images/hpdv3/...), which we do not
re-host here. If you need them (only required to reproduce HPSv3++ Stage 1 with
the original HPDv3 reference set), download the HPDv3 images from the official
repo and place them under images/hpdv3/:
# Original HPDv3 images (only needed for stage1_ref.json)
hf download MizzenAI/HPDv3 --repo-type dataset --include "images.tar.gz.*" --local-dir hpdv3_src
cat hpdv3_src/images.tar.gz.* | gunzip | tar -xv # then move/symlink the resulting images into images/hpdv3/
After extraction you get an images/ directory. Every path in the JSON files
(path1 / path2 / image_path) is relative and resolves against the
repo root, e.g. images/qwen_image/prompt_000000/6.jpg. The four ready-to-use
train/test files (train_aes, train_tf, test_aes, test_tf) reference only
our own images and need no HPDv3 download.
What you can use directly
These four files are ready-to-use, self-contained, and do not require any
HPSv3++ code or model -- just images + JSON. Each record is
{"path1": <preferred>, "path2": <non-preferred>, "prompt": <text>} (the same
format as HPSv3/HPDv3), with path1 preferred over path2.
| File | Pairs | Use |
|---|---|---|
train/train_aes.json |
100,463 | Training -- aesthetic preference |
train/train_tf.json |
90,908 | Training -- text-following preference |
test/test_aes.json |
5,720 | Evaluation -- aesthetic |
test/test_tf.json |
4,465 | Evaluation -- text-following |
The training and test sets are disjoint (no shared pairs), including across the two axes (aes/tf), so they can be used together without leakage.
Repository layout
HPDv3-PlusPlus/
|-- images.tar.part00, images.tar.part01, ... # split tar of OUR images (~268 GB; qwen_image + rollout + thumbs)
|-- train/
| |-- train_aes.json # 100,463 ready-to-use aesthetic training pairs
| |-- train_tf.json # 90,908 ready-to-use text-following training pairs
| |-- stage1_labeled.json # 191,466 labeled pairs (used by HPSv3++ Stage 1)
| |-- stage1_ref.json # 284,974 original HPDv3 reference pairs (Stage 1 OGD anti-forgetting)
| |-- stage2_labeled.json # 111,650 labeled pairs (used by HPSv3++ Stage 2)
| |-- rollout.json # 322,452 unlabeled rollouts, long format, one image per row
| `-- ogd_std.json # 58,242 pre-computed per-group std (also embedded in rollout.json)
|-- test/
| |-- test_aes.json # 5,720 ready-to-use aesthetic test pairs
| `-- test_tf.json # 4,465 ready-to-use text-following test pairs
`-- images/ # after extraction: qwen_image/, rollout/, thumbs/ (ours);
# hpdv3/ must be downloaded separately from MizzenAI/HPDv3 (only for stage1_ref)
JSON formats
Preference pairs (train_aes, train_tf, stage1_labeled, stage1_ref, stage2_labeled, test_aes, test_tf):
| Field | Meaning |
|---|---|
path1 / path2 |
Preferred / non-preferred image (relative images/... path) |
prompt |
Text prompt |
choice_dist / confidence / model1 / model2 |
(where annotated) vote distribution, confidence, generator names; null otherwise. The ready-to-use train_aes/train_tf and test files keep only path1/path2/prompt. |
rollout.json (unlabeled rollouts for HPSv3++ Stage 2; long format, one image per row):
| Field | Meaning |
|---|---|
group_id |
Group id (same prompt + tier + iter_step form one group) |
source |
capability or iteration |
prompt |
Text prompt |
tier |
Generator tier |
iter_step / iter_norm |
Raw / normalized RL iteration |
capability / level |
Continuous capability score / discrete level |
image_path |
Relative image path |
ogd_std |
Pre-computed per-group std |
Notes
- The images we host here (
qwen_image+rollout+thumbs) are ~268 GB. The original HPDv3 images (hpdv3/, ~60 GB, referenced only bystage1_ref.json) are not re-hosted -- download them from MizzenAI/HPDv3 if needed (see Quick start). - The ready-to-use train/test files reference only our own images, so they work with just the split tar above (no HPDv3 download needed).
- For the full two-stage training / evaluation pipeline (which additionally uses
rollout.json,stage1_ref.json, etc.), see the HPSv3++ code repository.
Citation
@misc{hpsv3pp,
title = {HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities},
author = {HPSv3++ Team},
year = {2026}
}
- Downloads last month
- 46