id stringlengths 35 35 | conversations listlengths 2 2 | input stringlengths 25 5.41k | output stringlengths 600 52.1k | domain stringclasses 3
values | meta dict |
|---|---|---|---|---|---|
kimi_k2_6_reasoning_8000x_row_07561 | [
{
"from": "human",
"value": "What strategies can a student use to adequately prepare for a chemistry exam and avoid inadequate exam preparation?"
},
{
"from": "gpt",
"value": "<think>\nTo answer this question effectively, I first decompose what \"chemistry exam preparation\" entails and why stud... | What strategies can a student use to adequately prepare for a chemistry exam and avoid inadequate exam preparation? | <think>
To answer this question effectively, I first decompose what "chemistry exam preparation" entails and why students commonly under-prepare. Chemistry is a hybrid discipline: it requires memorization of nomenclature, periodic trends, and reaction types, but also deep conceptual understanding of atomic and molecula... | general | {
"input_tokens": 28,
"output_tokens": 2753,
"teacher_model": "Kimi-K2.6",
"teacher_model_id": "moonshotai/Kimi-K2.6",
"seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned",
"seed_config": "main",
"seed_split": "train",
"seed_id": "kimi_seed_main_eef32a29a3931e6d",
"source_index": 190417,
"bucket":... |
kimi_k2_6_reasoning_8000x_row_07183 | [
{
"from": "human",
"value": "Can you provide a Python code that can find the longest common substring between two given strings?"
},
{
"from": "gpt",
"value": "<think>\n**Problem Clarification**\nThe task is to find the *longest common substring* between two strings. A substring is a contiguous ... | Can you provide a Python code that can find the longest common substring between two given strings? | <think>
**Problem Clarification**
The task is to find the *longest common substring* between two strings. A substring is a contiguous block of characters, which distinguishes this problem from the longest common *subsequence* (where characters need not be contiguous). For example, given "abcdef" and "zcdem", the longes... | general | {
"input_tokens": 24,
"output_tokens": 1300,
"teacher_model": "Kimi-K2.6",
"teacher_model_id": "moonshotai/Kimi-K2.6",
"seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned",
"seed_config": "main",
"seed_split": "train",
"seed_id": "kimi_seed_main_edf255acc10c1c76",
"source_index": 503796,
"bucket":... |
kimi_k2_6_reasoning_8000x_row_00164 | [
{
"from": "human",
"value": "In a synthetic microbial consortium engineered for lignocellulose degradation, how does spatial organization coupled with quorum‑sensing regulated metabolic cross‑feeding influence emergent community stability and product yield under fluctuating substrate gradients?"
},
{
... | In a synthetic microbial consortium engineered for lignocellulose degradation, how does spatial organization coupled with quorum‑sensing regulated metabolic cross‑feeding influence emergent community stability and product yield under fluctuating substrate gradients? | <think>
To answer this question, I need to integrate principles from microbial ecology, synthetic biology, reaction-diffusion physics, and metabolic engineering. The system involves four interacting layers: (1) a recalcitrant, heterogeneous lignocellulosic substrate; (2) a synthetic multi-species consortium with metabo... | science | {
"input_tokens": 66,
"output_tokens": 4443,
"teacher_model": "Kimi-K2.6",
"teacher_model_id": "moonshotai/Kimi-K2.6",
"seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned",
"seed_config": "PHD-Science",
"seed_split": "train",
"seed_id": "kimi_seed_phd_science_0bdd2ecad8aa9b53",
"source_index": 56012... |
kimi_k2_6_reasoning_8000x_row_05392 | [
{
"from": "human",
"value": "You are a wildlife researcher studying a species of bird that migrates from one place to another every year. You want to track their movements during different seasons and understand their behavior patterns. You need to analyze the impact of changing weather patterns and other e... | You are a wildlife researcher studying a species of bird that migrates from one place to another every year. You want to track their movements during different seasons and understand their behavior patterns. You need to analyze the impact of changing weather patterns and other environmental factors on the bird's behavi... | <think>
This is a resource-constrained ecological monitoring and conservation planning problem. I need to design a strategy that maximizes scientific inference from exactly two pieces of hardware: a single remote camera that records timestamped hourly images at a fixed location, and a single GPS tracker that can be dep... | general | {
"input_tokens": 336,
"output_tokens": 4888,
"teacher_model": "Kimi-K2.6",
"teacher_model_id": "moonshotai/Kimi-K2.6",
"seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned",
"seed_config": "main",
"seed_split": "train",
"seed_id": "kimi_seed_main_7e2a54a085400fa9",
"source_index": 315923,
"bucket"... |
kimi_k2_6_reasoning_8000x_row_04363 | [
{
"from": "human",
"value": "How can multi-dimensional nuclear magnetic resonance spectroscopy combined with machine‑learning deconvolution be used to resolve overlapping metabolite signals in complex environmental samples, and what statistical validation ensures the inferred component structures are chemic... | How can multi-dimensional nuclear magnetic resonance spectroscopy combined with machine‑learning deconvolution be used to resolve overlapping metabolite signals in complex environmental samples, and what statistical validation ensures the inferred component structures are chemically plausible? | <think>
**1. Problem definition and the limits of conventional NMR**
Complex environmental samples—such as dissolved organic matter (DOM), soil extracts, or microbial biofilms—contain thousands of metabolites spanning vast concentration ranges. In one-dimensional (1D) ¹H NMR, the narrow chemical-shift dispersion (~0–12... | science | {
"input_tokens": 73,
"output_tokens": 4701,
"teacher_model": "Kimi-K2.6",
"teacher_model_id": "moonshotai/Kimi-K2.6",
"seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned",
"seed_config": "PHD-Science",
"seed_split": "train",
"seed_id": "kimi_seed_phd_science_65be2c65cac9943c",
"source_index": 2142,... |
kimi_k2_6_reasoning_8000x_row_06304 | [
{
"from": "human",
"value": "Below you are given code written in javascript that contains the function 'stuff', which manipulates a string, as well as an input to the function. Your job is to predict the output of 'stuff' given the provided input.\\n\\nHere is the code:\\n```javascript\\nclass Rot13Cipher {... | Below you are given code written in javascript that contains the function 'stuff', which manipulates a string, as well as an input to the function. Your job is to predict the output of 'stuff' given the provided input.\n\nHere is the code:\n```javascript\nclass Rot13Cipher {\n constructor() {\n this.rot13Map = this.ini... | <think>
The `stuff` function obtains a singleton instance of `Rot13Reverser` and calls `process(input)` on it. The `Rot13Reverser` constructs a `Pipeline` with four steps:
1. Split the input string into an array of characters.
2. Apply ROT13 to each character via `RecursiveProcessor.applyRot13`.
3. Reverse the resulti... | general | {
"input_tokens": 715,
"output_tokens": 411,
"teacher_model": "Kimi-K2.6",
"teacher_model_id": "moonshotai/Kimi-K2.6",
"seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned",
"seed_config": "main",
"seed_split": "train",
"seed_id": "kimi_seed_main_02e38bb495a475ae",
"source_index": 291648,
"bucket":... |
kimi_k2_6_reasoning_8000x_row_03674 | [
{
"from": "human",
"value": "How does the incorporation of non‑commutative geometry into loop quantum gravity modify the predicted spectrum of primordial gravitational waves, and what observable signatures could distinguish this framework from standard inflationary models?"
},
{
"from": "gpt",
"... | How does the incorporation of non‑commutative geometry into loop quantum gravity modify the predicted spectrum of primordial gravitational waves, and what observable signatures could distinguish this framework from standard inflationary models? | <think>
To answer this question, we must synthesize three distinct layers of physics: (1) the standard inflationary prediction for primordial gravitational waves (tensor perturbations), (2) the modifications introduced by Loop Quantum Cosmology (LQC)—the symmetry-reduced, cosmological sector of Loop Quantum Gravity (LQ... | science | {
"input_tokens": 61,
"output_tokens": 3314,
"teacher_model": "Kimi-K2.6",
"teacher_model_id": "moonshotai/Kimi-K2.6",
"seed_source": "Jackrong/GLM-5.1-Reasoning-1M-Cleaned",
"seed_config": "PHD-Science",
"seed_split": "train",
"seed_id": "kimi_seed_phd_science_9227ccbda71eae63",
"source_index": 75527... |
kimi_k2_6_reasoning_8000x_row_05424 | [{"from":"human","value":"Both nematic transition and Jahn-Teller distortion distorts the lattice an(...TRUNCATED) | "Both nematic transition and Jahn-Teller distortion distorts the lattice and reduces the symmetry of(...TRUNCATED) | "<think>\nThe user asks why iron-based superconductors (Fe-SC) use the term “nematic transition”(...TRUNCATED) | general | {"input_tokens":130,"output_tokens":2610,"teacher_model":"Kimi-K2.6","teacher_model_id":"moonshotai/(...TRUNCATED) |
kimi_k2_6_reasoning_8000x_row_07754 | [{"from":"human","value":"Zangwill says (Zangwill, Modern Electrodynamics, 2012 edition, page 40, un(...TRUNCATED) | "Zangwill says (Zangwill, Modern Electrodynamics, 2012 edition, page 40, undersection 2.3.1, Lorentz(...TRUNCATED) | "<think>\nThe question asks whether the macroscopic force formula \n\\[\n\\mathbf F = \\int \\rho_{(...TRUNCATED) | general | {"input_tokens":400,"output_tokens":2254,"teacher_model":"Kimi-K2.6","teacher_model_id":"moonshotai/(...TRUNCATED) |
kimi_k2_6_reasoning_8000x_row_03675 | [{"from":"human","value":"In a coupled land‑atmosphere–hydrogeologic model, how do stochastic va(...TRUNCATED) | "In a coupled land‑atmosphere–hydrogeologic model, how do stochastic variations in soil hydrauli(...TRUNCATED) | "<think>\n**1. Problem decomposition and scope definition**\n\nThe question asks how two specific so(...TRUNCATED) | science | {"input_tokens":59,"output_tokens":4013,"teacher_model":"Kimi-K2.6","teacher_model_id":"moonshotai/K(...TRUNCATED) |
Kimi-K2.6-Reasoning-3300x-WandB
Kimi-K2.6-Reasoning-3300x-WandB is a W&B-only synthetic reasoning dataset generated with Kimi-K2.6 through Weights & Biases Inference.
This dataset is the pure W&B-generated subset from a larger planned 8,000-example Kimi reasoning distillation run. Generation stopped when the W&B quota limit was reached, and the completed accepted rows were audited, cleaned, and exported as a standalone dataset.
This release contains 3,303 accepted W&B-generated rows and uses 3300x as a rounded public release label.
Dataset Summary
| Field | Value |
|---|---|
| Public release label | 3300x |
| Actual rows | 3,303 |
| Teacher model | Kimi-K2.6 |
| Teacher model ID | moonshotai/Kimi-K2.6 |
| Generation backend | W&B Inference |
| Output format | <think>...</think> + final answer |
SHA256 train.jsonl |
9d8832e23f7d927a10cfb1b30d9a255a7836debd38003f290bf94622f4cc387f |
Seed Data / Prompt Provenance
The seed prompts were sampled from Jackrong/GLM-5.1-Reasoning-1M-Cleaned, a large cleaned reasoning prompt corpus.
For this release, only the seed prompt text was used as the user-facing task. The original source outputs were not included in the public dataset and were not used as public target completions. Kimi-K2.6 generated new reasoning responses for the selected prompts.
The larger planned seed bank targeted 8,000 prompts across three source buckets:
| Source bucket | Planned seed count |
|---|---|
| main | 3,000 |
| PHD-Science | 2,500 |
| Math | 2,500 |
The W&B run completed 3,303 accepted examples before quota exhaustion. The completed subset has the following bucket distribution:
| Source bucket | Rows |
|---|---|
| main | 1,237 |
| PHD-Science | 1,034 |
| Math | 1,032 |
Domain Distribution
| Domain | Rows |
|---|---|
| general | 1,237 |
| science | 1,034 |
| math | 1,032 |
Backend and Release Scope
This is the W&B-only release. Every row in this dataset was generated through W&B Inference and has:
"generation_backend": "wandb_inference"
A separate mixed-backend continuation release may be published later using the completed W&B rows plus additional Ollama Cloud Kimi-K2.6 continuation rows. If released, that dataset will preserve backend metadata per row.
Schema
Each row contains exactly:
{
"id": "...",
"conversations": [
{"from": "human", "value": "..."},
{"from": "gpt", "value": "..."}
],
"input": "...",
"output": "<think>...reasoning...</think>\n\nfinal answer",
"domain": "...",
"meta": {...}
}
Columns
| Column | Description |
|---|---|
id |
Unique row identifier |
conversations |
Chat-style SFT format with human/gpt turns |
input |
Original seed prompt / user task |
output |
Kimi-generated reasoning output in <think>...</think> format followed by the final answer |
domain |
Coarse domain label |
meta |
Generation metadata, source bucket info, backend, and settings |
Generation Settings
{
"max_tokens": 65536,
"temperature": 0.6,
"top_p": 0.95,
"concurrency": 18
}
Validation
The exported dataset passed the following checks:
| Check | Result |
|---|---|
| Raw rows | 3,303 |
| Clean rows | 3,303 |
| Dropped rows | 0 |
| Raw JSON parse errors | 0 |
| Row errors after sanitize | 0 |
| Row warnings after sanitize | 0 |
| Missing think blocks | 0 |
| Empty final answers | 0 |
| Raw teacher tags in public output | 0 |
| Rows changed during sanitization | 1 |
Short final answers are allowed when the reasoning trace is valid. This matters for math/science examples where the correct final answer may be something like 5, 12, or 27.6122.
What Is Not Included
This dataset does not include:
- Original source dataset outputs as public completions
- Weave traces
- Private debugging logs
- Failed generations
- Rows from Ollama Cloud continuation
- Rows from other teacher models
Only accepted, cleaned, W&B-generated Kimi-K2.6 rows are included.
Intended Use
This dataset is intended for:
- supervised fine-tuning experiments
- reasoning-format SFT
- long-form synthetic reasoning research
- teacher-model distillation experiments
- comparison against other reasoning datasets
Limitations
This is a synthetic dataset and should be treated as generated training data, not ground truth. Outputs may contain mistakes, over-explanations, or reasoning paths that should be independently evaluated before high-stakes use.
The dataset was generated from a partial run because the W&B quota limit was reached. The public release label 3300x is rounded, while the exact row count is 3,303.
Reproducibility Notes
The exported train.jsonl file has the following checksum:
SHA256(train.jsonl): 9d8832e23f7d927a10cfb1b30d9a255a7836debd38003f290bf94622f4cc387f
This checksum can be used to verify that the downloaded file matches the audited upload.
Files
| File | Description |
|---|---|
train.jsonl |
Main training dataset |
generation_metadata.md |
Generation settings and dataset metadata |
validation_report.md |
Audit and validation report |
- Downloads last month
- 201