Open-source inference framework achieving 3-7x speedup for 1M-token inputs through sparse attention integration and memory optimizations. Supports both 7B and 14B Qwen2.5 models.