From Data Repositories to Production Data Pipelines: Bridging Hugging Face Datasets and Dagster with dagster-hf-datasets
Introduction
The Hugging Face Hub has surpassed 1M+ datasets with domains spanning NLP, synthetic data, evaluation benchmarks and agentic traces, they are also evolving from static artifacts into production-scale data assets.
As the ecosystem continues to scale, datasets are transformed, versioned and re-published as part of iterative ML systems. While Hugging Face Datasets provides an exceptionally powerful interface for dataset access, operationalizing it in production environments introduces an entirely different set of challenges. Dataset lineage, metadata tracking and observability becomes increasingly important as teams move from experimentation to continuously orchestrated data pipelines.
Figure 1. The Orchestration Gap Beyond Dataset Storage
This orchestration gap becomes especially visible in modern ML systems where datasets tend to behave like evolving production assets. Data pipelines may involve scheduled refreshes, feature extraction or publishing transformed datasets back to the Hugging Face Hub. Managing these workflows reliably requires infrastructure capable of modeling datasets as reproducible and orchestrated assets rather than standalone repositories.
In this article, we introduce dagster-hf-datasets, a Dagster-native library designed to bridge the usage between Hugging Face Datasets and Dagster. By combining the flexibility of Hugging Face Datasets with Dagster’s orchestration capabilities, the integration makes it easier to build production-worthy data pipelines. To get started, install the integration with:
pip install dagster-hf-datasets
Why Modern ML Datasets need Orchestration beyond Storage
A typical pipeline begins by loading datasets locally or directly from the Hugging Face Hub through HuggingFaceResource. The datasets are then passed into the asset Layer as native Dagster assets, where transformations such as filtering, feature extraction or text normalization can be applied incrementally across the entire lifecycle.
Figure 2. From Dataset Repositories to Orchestrated Data Assets
To see these abstractions in practice, we have provided a complete end-to-end transformation of the GLUE QQP dataset in this GitHub Gist example. The pipeline incrementally transforms the dataset through multiple asset stages, persists intermediate materializations via the HFParquetIOManager and finally republishes the curated dataset (golden-glue-qpp) back to the Hugging Face Hub.
The hf_dataset_asset decorator automatically loads datasets from the Hugging Face Hub and injects them directly into Dagster assets as Dataset objects. Each downstream asset then performs an isolated transformation stage. Because these transformations are modeled as independent Dagster assets, every stage can be observed and recomputed independently. This becomes especially valuable for larger dataset pipelines where intermediate transformations may be computationally expensive.
An additional advantage of integrating publication directly into the orchestration lifecycle is the ability to automatically generate dataset cards from pipeline metadata.
This includes information such as:
- Source datasets and revisions.
- Transformation and filtering stages.
- Dataset statistics and metadata.
- Pipeline-level provenance.
By generating dataset documentation as part of the execution lifecycle, curated datasets published back to the Hugging Face Hub become more transparent to use. Hugging Face Datasets continues to provide Arrow-backed datasets while Dagster manages execution semantics, orchestration boundaries and metadata tracking between assets.
Tracing the Runtime Lifecycle of HF Datasets with Dagster
At the center of the integration is a simple idea: Treating Hugging Face datasets as first-class Dagster assets.
The integration introduces a set of abstractions that map naturally onto the Hugging Face Datasets API while remaining aligned with Dagster’s asset-oriented execution model.
Figure 3: Core Components of dagster-hf-datasets
HuggingFaceResource acts as the primary interface for interacting with the Hugging Face Hub and dataset loading, encapsulating authentication cleanly inside Dagster. For asset modeling, the integration provides both hf_dataset_asset and hf_multi_asset. A single Hugging Face Dataset maps naturally to an individual Dagster asset, enabling explicit materialization, metadata tracking and lineage visualization. Meanwhile, Hugging Face DatasetDict objects align closely with Dagster’s multi-asset abstraction, allowing train, validation and test splits to be represented with hf_multi_asset.
The integration also includes HFDatasetPublisher, enabling transformed datasets to be published back to the Hugging Face Hub as part of the orchestration lifecycle. We also introduceHFParquetIOManager to handle local storage serialization via Parquet artifacts. By acting as a stable boundary between your orchestration layer and the file system, it provides seamless persistence without sacrificing compatibility with Hugging Face Dataset or DatasetDict objects.
One of the most important outcomes of this architecture is the separation of responsibilities between the two ecosystems. Hugging Face Datasets continues to provide efficient Arrow-backed datasets, streaming support and interoperability with the broader ML ecosystem. Dagster provides orchestration capabilities such as lineage tracking, scheduling, metadata management, retries, partitioning, observability and execution coordination.
Enabling Observable Workflows for Modern Data Pipelines
As machine learning systems continue to scale, datasets increasingly behave as evolving operational assets rather than static collections of files. One of the central advantages of integrating Hugging Face Datasets with Dagster’s asset-oriented orchestration model is the ability to make dataset workflows observable throughout their entire lifecycle. Every dataset transformation becomes an explicitly materialized orchestration step with associated lineage, persistence boundaries and downstream dependencies.
Rather than managing datasets through opaque pre-processing scripts, teams gain visibility into:
- How datasets were generated.
- Which upstream assets produced them.
- What transformations were applied.
- When pipelines executed.
- Where materializations were persisted.
Dagster’s asset graph provides a natural representation for these relationships. Dataset transformations become lineage-aware execution nodes rather than isolated processing steps, enabling practitioners to reason about workflows at the system level rather than through disconnected scripts.
Figure 4. Dagster Asset Lineage of Dataset Transformation Pipeline
In the example pipeline previously introduced, each dataset transformation is represented as an independent Dagster asset. The raw dataset is loaded from the Hugging Face Hub, transformed through multiple stages of curation and ultimately republished as a refined dataset artifact. Because each stage exists as a materialized asset, lineage becomes explicit rather than implicit.
Observability extends beyond lineage into the datasets themselves. During materialization, the integration extracts rich metadata from Hugging Face datasets and surfaces it directly within Dagster. This includes dataset characteristics such as row counts, feature schemas, dataset fingerprints, revisions, Hub statistics, storage locations and execution metadata.
Figure 5: Hugging Face Dataset Metadata and Observability in Dagster
This metadata transforms datasets from opaque artifacts into inspectable operational assets. Instead of manually inspecting dataset repositories or external dashboards, practitioners can access dataset statistics, schema information, Hub metadata and execution details directly within the orchestration environment. This creates a significantly more transparent publication process. Datasets published back to the Hugging Face Hub become operational artifacts generated from observable orchestration graphs with explicit provenance history.
As the Hugging Face Datasets ecosystem continues to grow beyond 1M+ datasets, orchestration and observability become increasingly important concerns rather than optional features. Dataset repositories alone are no longer sufficient for modern ML systems that depend on continuously evolving production-grade data workflows.
Conclusion
The rapid growth of HF Datasets beyond 1M+ datasets reflects a broader shift in modern machine learning systems: Datasets are no longer passive storage artifacts, but continuously evolving operational assets that drive model training and production data workflows.
As modern ML systems continue to scale, the operational complexity surrounding datasets will increasingly resemble traditional software infrastructure concerns. Observable dataset pipelines, reproducible transformations will become foundational requirements for machine learning ecosystems rather than optional tooling. dagster-hf-datasets represents an important step toward treating datasets as first-class assets within modern orchestration systems and enabling Hugging Face datasets to evolve from standalone repositories into observable data pipelines. I would also like to thank Colton Padden for the collaboration in bringing dagster-hf-datasets into the Dagster ecosystem.
Resources
- Documentation: Explore
dagster-hf-datasetsdocumentation for API references and usage patterns. - Example Pipelines: Try the end-to-end dataset examples, including dataset streaming, materialization and publishing data assets.
- CodeWiki: Explore the architecture and implementation walkthrough on CodeWiki.
- GitHub Repository: Browse the source code and share feedback within the community.





