Dataset Viewer
Auto-converted to Parquet Duplicate
id
stringlengths
35
35
conversations
listlengths
2
2
input
stringlengths
25
5.41k
output
stringlengths
600
52.1k
domain
stringclasses
3 values
meta
dict
kimi_k2_6_reasoning_8000x_row_07561
[ { "from": "human", "value": "What strategies can a student use to adequately prepare for a chemistry exam and avoid inadequate exam preparation?" }, { "from": "gpt", "value": "<think>\nTo answer this question effectively, I first decompose what \"chemistry exam preparation\" entails and why stud...
What strategies can a student use to adequately prepare for a chemistry exam and avoid inadequate exam preparation?
<think> To answer this question effectively, I first decompose what "chemistry exam preparation" entails and why students commonly under-prepare. Chemistry is a hybrid discipline: it requires memorization of nomenclature, periodic trends, and reaction types, but also deep conceptual understanding of atomic and molecula...
general
{ "input_tokens": 28, "output_tokens": 2753, "teacher_model": "Kimi-K2.6", "teacher_model_id": "moonshotai/Kimi-K2.6", "seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned", "seed_config": "main", "seed_split": "train", "seed_id": "kimi_seed_main_eef32a29a3931e6d", "source_index": 190417, "bucket":...
kimi_k2_6_reasoning_8000x_row_07183
[ { "from": "human", "value": "Can you provide a Python code that can find the longest common substring between two given strings?" }, { "from": "gpt", "value": "<think>\n**Problem Clarification**\nThe task is to find the *longest common substring* between two strings. A substring is a contiguous ...
Can you provide a Python code that can find the longest common substring between two given strings?
<think> **Problem Clarification** The task is to find the *longest common substring* between two strings. A substring is a contiguous block of characters, which distinguishes this problem from the longest common *subsequence* (where characters need not be contiguous). For example, given "abcdef" and "zcdem", the longes...
general
{ "input_tokens": 24, "output_tokens": 1300, "teacher_model": "Kimi-K2.6", "teacher_model_id": "moonshotai/Kimi-K2.6", "seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned", "seed_config": "main", "seed_split": "train", "seed_id": "kimi_seed_main_edf255acc10c1c76", "source_index": 503796, "bucket":...
kimi_k2_6_reasoning_8000x_row_00164
[ { "from": "human", "value": "In a synthetic microbial consortium engineered for lignocellulose degradation, how does spatial organization coupled with quorum‑sensing regulated metabolic cross‑feeding influence emergent community stability and product yield under fluctuating substrate gradients?" }, { ...
In a synthetic microbial consortium engineered for lignocellulose degradation, how does spatial organization coupled with quorum‑sensing regulated metabolic cross‑feeding influence emergent community stability and product yield under fluctuating substrate gradients?
<think> To answer this question, I need to integrate principles from microbial ecology, synthetic biology, reaction-diffusion physics, and metabolic engineering. The system involves four interacting layers: (1) a recalcitrant, heterogeneous lignocellulosic substrate; (2) a synthetic multi-species consortium with metabo...
science
{ "input_tokens": 66, "output_tokens": 4443, "teacher_model": "Kimi-K2.6", "teacher_model_id": "moonshotai/Kimi-K2.6", "seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned", "seed_config": "PHD-Science", "seed_split": "train", "seed_id": "kimi_seed_phd_science_0bdd2ecad8aa9b53", "source_index": 56012...
kimi_k2_6_reasoning_8000x_row_05392
[ { "from": "human", "value": "You are a wildlife researcher studying a species of bird that migrates from one place to another every year. You want to track their movements during different seasons and understand their behavior patterns. You need to analyze the impact of changing weather patterns and other e...
You are a wildlife researcher studying a species of bird that migrates from one place to another every year. You want to track their movements during different seasons and understand their behavior patterns. You need to analyze the impact of changing weather patterns and other environmental factors on the bird's behavi...
<think> This is a resource-constrained ecological monitoring and conservation planning problem. I need to design a strategy that maximizes scientific inference from exactly two pieces of hardware: a single remote camera that records timestamped hourly images at a fixed location, and a single GPS tracker that can be dep...
general
{ "input_tokens": 336, "output_tokens": 4888, "teacher_model": "Kimi-K2.6", "teacher_model_id": "moonshotai/Kimi-K2.6", "seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned", "seed_config": "main", "seed_split": "train", "seed_id": "kimi_seed_main_7e2a54a085400fa9", "source_index": 315923, "bucket"...
kimi_k2_6_reasoning_8000x_row_04363
[ { "from": "human", "value": "How can multi-dimensional nuclear magnetic resonance spectroscopy combined with machine‑learning deconvolution be used to resolve overlapping metabolite signals in complex environmental samples, and what statistical validation ensures the inferred component structures are chemic...
How can multi-dimensional nuclear magnetic resonance spectroscopy combined with machine‑learning deconvolution be used to resolve overlapping metabolite signals in complex environmental samples, and what statistical validation ensures the inferred component structures are chemically plausible?
<think> **1. Problem definition and the limits of conventional NMR** Complex environmental samples—such as dissolved organic matter (DOM), soil extracts, or microbial biofilms—contain thousands of metabolites spanning vast concentration ranges. In one-dimensional (1D) ¹H NMR, the narrow chemical-shift dispersion (~0–12...
science
{ "input_tokens": 73, "output_tokens": 4701, "teacher_model": "Kimi-K2.6", "teacher_model_id": "moonshotai/Kimi-K2.6", "seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned", "seed_config": "PHD-Science", "seed_split": "train", "seed_id": "kimi_seed_phd_science_65be2c65cac9943c", "source_index": 2142,...
kimi_k2_6_reasoning_8000x_row_06304
[ { "from": "human", "value": "Below you are given code written in javascript that contains the function 'stuff', which manipulates a string, as well as an input to the function. Your job is to predict the output of 'stuff' given the provided input.\\n\\nHere is the code:\\n```javascript\\nclass Rot13Cipher {...
Below you are given code written in javascript that contains the function 'stuff', which manipulates a string, as well as an input to the function. Your job is to predict the output of 'stuff' given the provided input.\n\nHere is the code:\n```javascript\nclass Rot13Cipher {\n constructor() {\n this.rot13Map = this.ini...
<think> The `stuff` function obtains a singleton instance of `Rot13Reverser` and calls `process(input)` on it. The `Rot13Reverser` constructs a `Pipeline` with four steps: 1. Split the input string into an array of characters. 2. Apply ROT13 to each character via `RecursiveProcessor.applyRot13`. 3. Reverse the resulti...
general
{ "input_tokens": 715, "output_tokens": 411, "teacher_model": "Kimi-K2.6", "teacher_model_id": "moonshotai/Kimi-K2.6", "seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned", "seed_config": "main", "seed_split": "train", "seed_id": "kimi_seed_main_02e38bb495a475ae", "source_index": 291648, "bucket":...
kimi_k2_6_reasoning_8000x_row_03674
[ { "from": "human", "value": "How does the incorporation of non‑commutative geometry into loop quantum gravity modify the predicted spectrum of primordial gravitational waves, and what observable signatures could distinguish this framework from standard inflationary models?" }, { "from": "gpt", "...
How does the incorporation of non‑commutative geometry into loop quantum gravity modify the predicted spectrum of primordial gravitational waves, and what observable signatures could distinguish this framework from standard inflationary models?
<think> To answer this question, we must synthesize three distinct layers of physics: (1) the standard inflationary prediction for primordial gravitational waves (tensor perturbations), (2) the modifications introduced by Loop Quantum Cosmology (LQC)—the symmetry-reduced, cosmological sector of Loop Quantum Gravity (LQ...
science
{ "input_tokens": 61, "output_tokens": 3314, "teacher_model": "Kimi-K2.6", "teacher_model_id": "moonshotai/Kimi-K2.6", "seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned", "seed_config": "PHD-Science", "seed_split": "train", "seed_id": "kimi_seed_phd_science_9227ccbda71eae63", "source_index": 75527...
kimi_k2_6_reasoning_8000x_row_05424
[{"from":"human","value":"Both nematic transition and Jahn-Teller distortion distorts the lattice an(...TRUNCATED)
"Both nematic transition and Jahn-Teller distortion distorts the lattice and reduces the symmetry of(...TRUNCATED)
"<think>\nThe user asks why iron-based superconductors (Fe-SC) use the term “nematic transition”(...TRUNCATED)
general
{"input_tokens":130,"output_tokens":2610,"teacher_model":"Kimi-K2.6","teacher_model_id":"moonshotai/(...TRUNCATED)
kimi_k2_6_reasoning_8000x_row_07754
[{"from":"human","value":"Zangwill says (Zangwill, Modern Electrodynamics, 2012 edition, page 40, un(...TRUNCATED)
"Zangwill says (Zangwill, Modern Electrodynamics, 2012 edition, page 40, undersection 2.3.1, Lorentz(...TRUNCATED)
"<think>\nThe question asks whether the macroscopic force formula \n\\[\n\\mathbf F = \\int \\rho_{(...TRUNCATED)
general
{"input_tokens":400,"output_tokens":2254,"teacher_model":"Kimi-K2.6","teacher_model_id":"moonshotai/(...TRUNCATED)
kimi_k2_6_reasoning_8000x_row_03675
[{"from":"human","value":"In a coupled land‑atmosphere–hydrogeologic model, how do stochastic va(...TRUNCATED)
"In a coupled land‑atmosphere–hydrogeologic model, how do stochastic variations in soil hydrauli(...TRUNCATED)
"<think>\n**1. Problem decomposition and scope definition**\n\nThe question asks how two specific so(...TRUNCATED)
science
{"input_tokens":59,"output_tokens":4013,"teacher_model":"Kimi-K2.6","teacher_model_id":"moonshotai/K(...TRUNCATED)
End of preview. Expand in Data Studio

Kimi-K2.6-Reasoning-3300x-WandB

Kimi-K2.6-Reasoning-3300x-WandB is a W&B-only synthetic reasoning dataset generated with Kimi-K2.6 through Weights & Biases Inference.

This dataset is the pure W&B-generated subset from a larger planned 8,000-example Kimi reasoning distillation run. Generation stopped when the W&B quota limit was reached, and the completed accepted rows were audited, cleaned, and exported as a standalone dataset.

This release contains 3,303 accepted W&B-generated rows and uses 3300x as a rounded public release label.

Dataset Summary

Field Value
Public release label 3300x
Actual rows 3,303
Teacher model Kimi-K2.6
Teacher model ID moonshotai/Kimi-K2.6
Generation backend W&B Inference
Output format <think>...</think> + final answer
SHA256 train.jsonl 9d8832e23f7d927a10cfb1b30d9a255a7836debd38003f290bf94622f4cc387f

Seed Data / Prompt Provenance

The seed prompts were sampled from Jackrong/GLM-5.1-Reasoning-1M-Cleaned, a large cleaned reasoning prompt corpus.

For this release, only the seed prompt text was used as the user-facing task. The original source outputs were not included in the public dataset and were not used as public target completions. Kimi-K2.6 generated new reasoning responses for the selected prompts.

The larger planned seed bank targeted 8,000 prompts across three source buckets:

Source bucket Planned seed count
main 3,000
PHD-Science 2,500
Math 2,500

The W&B run completed 3,303 accepted examples before quota exhaustion. The completed subset has the following bucket distribution:

Source bucket Rows
main 1,237
PHD-Science 1,034
Math 1,032

Domain Distribution

Domain Rows
general 1,237
science 1,034
math 1,032

Backend and Release Scope

This is the W&B-only release. Every row in this dataset was generated through W&B Inference and has:

"generation_backend": "wandb_inference"

A separate mixed-backend continuation release may be published later using the completed W&B rows plus additional Ollama Cloud Kimi-K2.6 continuation rows. If released, that dataset will preserve backend metadata per row.

Schema

Each row contains exactly:

{
  "id": "...",
  "conversations": [
    {"from": "human", "value": "..."},
    {"from": "gpt", "value": "..."}
  ],
  "input": "...",
  "output": "<think>...reasoning...</think>\n\nfinal answer",
  "domain": "...",
  "meta": {...}
}

Columns

Column Description
id Unique row identifier
conversations Chat-style SFT format with human/gpt turns
input Original seed prompt / user task
output Kimi-generated reasoning output in <think>...</think> format followed by the final answer
domain Coarse domain label
meta Generation metadata, source bucket info, backend, and settings

Generation Settings

{
  "max_tokens": 65536,
  "temperature": 0.6,
  "top_p": 0.95,
  "concurrency": 18
}

Validation

The exported dataset passed the following checks:

Check Result
Raw rows 3,303
Clean rows 3,303
Dropped rows 0
Raw JSON parse errors 0
Row errors after sanitize 0
Row warnings after sanitize 0
Missing think blocks 0
Empty final answers 0
Raw teacher tags in public output 0
Rows changed during sanitization 1

Short final answers are allowed when the reasoning trace is valid. This matters for math/science examples where the correct final answer may be something like 5, 12, or 27.6122.

What Is Not Included

This dataset does not include:

  • Original source dataset outputs as public completions
  • Weave traces
  • Private debugging logs
  • Failed generations
  • Rows from Ollama Cloud continuation
  • Rows from other teacher models

Only accepted, cleaned, W&B-generated Kimi-K2.6 rows are included.

Intended Use

This dataset is intended for:

  • supervised fine-tuning experiments
  • reasoning-format SFT
  • long-form synthetic reasoning research
  • teacher-model distillation experiments
  • comparison against other reasoning datasets

Limitations

This is a synthetic dataset and should be treated as generated training data, not ground truth. Outputs may contain mistakes, over-explanations, or reasoning paths that should be independently evaluated before high-stakes use.

The dataset was generated from a partial run because the W&B quota limit was reached. The public release label 3300x is rounded, while the exact row count is 3,303.

Reproducibility Notes

The exported train.jsonl file has the following checksum:

SHA256(train.jsonl): 9d8832e23f7d927a10cfb1b30d9a255a7836debd38003f290bf94622f4cc387f

This checksum can be used to verify that the downloaded file matches the audited upload.

Files

File Description
train.jsonl Main training dataset
generation_metadata.md Generation settings and dataset metadata
validation_report.md Audit and validation report
Downloads last month
201

Models trained or fine-tuned on trjxter/Kimi-K2.6-Reasoning-3300x-WandB