Checkpoints for the main experiments in "Forgetting Transformer: Softmax Attention with a Forget Gate" (https://arxiv.org/abs/2503.02130).
-
zhixuan-lin/fox-pro-760m-longcrawl64-48b
Text Generation • 0.8B • Updated • 24 -
zhixuan-lin/transformer-pro-760m-longcrawl64-48b
Text Generation • 0.8B • Updated • 20 -
zhixuan-lin/fox-llama-760m-longcrawl64-48b
Text Generation • 0.8B • Updated • 8 -
zhixuan-lin/transformer-llama-760m-longcrawl64-48b
Text Generation • 0.8B • Updated • 6