Capability
Cpu Based Inference With 6 Instance Tiers
4 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “hardware-agnostic model architecture enabling deployment across compute tiers”
1.1B model pre-trained on 3T tokens for edge use.
Unique: Achieves 100x throughput range (71.8-7,094.5 tok/sec) across hardware tiers while maintaining identical model weights and architecture, enabling deployment decisions based on latency/cost/privacy without retraining — unique positioning as single model for heterogeneous infrastructure
vs others: Smaller memory footprint than Llama 2 7B enabling CPU inference (71.8 tok/sec M2 vs impractical for 7B), and faster than Phi-2 on GPU (7k+ tok/sec vs ~3k tok/sec) due to optimized quantization