Redesign Mixture-of-Experts Routers with Manifold Power Iteration Paper • 2606.12397 • Published 1 day ago • 73
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It Paper • 2606.11052 • Published 3 days ago • 13
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention Paper • 2606.09079 • Published 4 days ago • 56
Less is More: Recursive Reasoning with Tiny Networks Paper • 2510.04871 • Published Oct 6, 2025 • 516 • 43
Less is More: Recursive Reasoning with Tiny Networks Paper • 2510.04871 • Published Oct 6, 2025 • 516
Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding Paper • 2605.29707 • Published 15 days ago • 145
NITP: Next Implicit Token Prediction for LLM Pre-training Paper • 2605.24956 • Published 19 days ago • 35
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws Paper • 2605.23901 • Published 21 days ago • 13
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information Paper • 2605.11609 • Published about 1 month ago • 195
HRM-Text: Efficient Pretraining Beyond Scaling Paper • 2605.20613 • Published 23 days ago • 313
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Paper • 2605.22791 • Published 22 days ago • 31