MiniCPM4: Ultra-Efficient LLMs on End Devices
Paper β’ 2506.07900 β’ Published β’ 99
How to use litert-community/MiniCPM5-1B with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
This repository hosts the LiteRT-LM (LiteRT formerly known as TensorFlow Lite) version of MiniCPM5-1B, optimized for fully on-device inference on mobile and edge hardware.
minicpm_dynamic_wi8_afp32_gpu_opt.litertlm: This model features dynamic weight-only INT8 quantization (wi8) with FP32 activations (afp32), heavily optimized for GPU execution.MiniCPM5-1B is the first model in the MiniCPM5 series from OpenBMB. It is a dense 1B-parameter Transformer built specifically for on-device, local, and resource-constrained deployment, while reaching 1B-class open-source SOTA in its size class.
<think> template (enable_thinking).| Item | Value |
|---|---|
| Type | Causal Language Model |
| Architecture | Standard LlamaForCausalLM |
| Parameters | 1,080,632,832 (~1B) |
| Non-Embedding Parameters | 679,552,512 |
| Layers | 24 |
| Attention Heads (GQA) | 16 (Q) / 2 (KV) |
| Context Length | 131,072 |
To build the demo app from source, please follow the instructions from the GitHub repository.
Install uv and run the model directly from the LiteRT-LM command line:
uv tool install litert-lm
uvx litert-lm run --from-huggingface-repo=litert-community/MiniCPM5-1B minicpm_dynamic_wi8_afp32_gpu_opt.litertlm --prompt="What is the capital of France?"
Released under the Apache-2.0 License, consistent with the upstream openbmb/MiniCPM5-1B.
@article{minicpm4,
title={MiniCPM4: Ultra-efficient LLMs on end devices},
author={MiniCPM, Team},
journal={arXiv preprint arXiv:2506.07900},
year={2025}
}
Base model
openbmb/MiniCPM5-1B