Vortex: ~86.3%+ Across 30 Benchmarks in Pure Rust | No GPU Training Required
- McKale Olson
- Mar 10
- 3 min read
Updated: 4 days ago
Vortex is a new AI inference framework written entirely in Rust that delivers impressive results on a wide range of machine learning benchmarks. It reaches an average accuracy of 86.3% across 30 standard tasks without relying on GPU-based pretraining. Instead, Vortex uses a unique architecture combining retrieval-augmented Mixture-of-Experts, a custom tokenizer, and a semantic encoder to process thousands of questions quickly on CPU.
This post explores how Vortex works, its benchmark performance, the role of its 21 specialized experts, and its limitations. We also look ahead to planned improvements involving diffusion language models. Vortex’s open-source, MIT-licensed code compiles with a single command, making it accessible for developers interested in efficient, high-quality AI inference without expensive hardware.

How Vortex Works Without GPU Pretraining
Unlike many modern AI systems that depend on billions of pretrained parameters and GPU acceleration, Vortex builds its knowledge base from scratch every time it starts. It ingests 125 datasets from HuggingFace, crawls educational web sources, and trains a lightweight semantic encoder called CALM. This approach avoids the need for costly, time-consuming GPU training.
Vortex’s core is a retrieval-augmented Mixture-of-Experts engine with 21 specialized experts. Each expert focuses on a different aspect of language understanding or reasoning. When a question arrives, it passes through a five-stage pipeline:
Source-based routing: Directs the question to relevant knowledge sources.
Knowledge retrieval: Gathers supporting information from the built knowledge base.
Unified inference: Performs entity tracking and symbolic math to understand the question deeply.
21-expert ensemble scoring: Each expert evaluates the question, resulting in 84 evaluations per question.
Final decision: Combines expert consensus with energy-based predictions to produce the answer.
This pipeline allows Vortex to process 2,755 questions in just 252 seconds on a CPU, demonstrating efficient inference without specialized hardware.
Benchmark Performance Highlights
Vortex sets new state-of-the-art results on four key tasks:
GSM8K (math word problems): 94.7% accuracy, 2.7% higher than previous best.
HumanEval (code generation): 100% accuracy.
CommonsenseQA: 98.9% accuracy, 5.4% above prior top scores.
TruthfulQA: 69.5% accuracy, 10.5% better than the previous state of the art.
Across all 30 benchmarks, Vortex’s average gap to state-of-the-art is only -1.2%, showing it competes closely with the best models despite its smaller size and CPU-only design.
Full Benchmark Table (Selected Tasks)

The table shows Vortex excels in reasoning, code generation, and commonsense tasks but struggles with some reading comprehension and commonsense inference benchmarks, revealing areas for future work.
Architecture Walkthrough
Vortex’s architecture centers on combining retrieval and expert models to maximize accuracy and efficiency:
Custom BPE Tokenizer: Tailored to the datasets Vortex ingests, this tokenizer breaks text into meaningful subwords for better semantic understanding.
CALM Semantic Encoder: A lightweight encoder trained on educational and web data that captures the meaning of input text efficiently.
21 Specialized Experts: Each expert is a smaller model trained to focus on specific domains or reasoning types. Their ensemble scoring provides diverse perspectives on each question.
Energy-Based Final Decision: Instead of simple voting, Vortex uses an energy-based model to weigh expert outputs, improving answer reliability.
This design avoids the need for massive pretrained models and GPUs, instead relying on smart data ingestion and expert collaboration.
Expert Ablation Analysis
Testing each expert’s contribution revealed that all 21 experts are essential. Each expert contributes between 17% and 19% accuracy on average, showing no single expert dominates. Removing any expert reduces overall performance significantly, proving the ensemble’s strength comes from diverse, specialized knowledge.
This analysis highlights the importance of carefully designing and training each expert to cover different reasoning and knowledge areas.
Limitations and Weaknesses
Vortex is not perfect. It scores only 21.1% on SQuAD, a reading comprehension benchmark, and 64.2% on HellaSwag, a commonsense reasoning task. These results show Vortex struggles with some types of natural language understanding that require deep contextual or commonsense knowledge beyond retrieval and symbolic reasoning.
These weaknesses point to areas where Vortex’s architecture and training data could improve, such as better handling of nuanced language and more robust commonsense inference.
Future Directions: Diffusion Language Model Integration
I plan to integrate diffusion language models based on recent research including MDLM, SEDD, DiffuLLaMA, and LLaDA. These models use diffusion processes to generate language, offering potential improvements in fluency and reasoning.
This integration aims to combine Vortex’s efficient retrieval and expert system with the generative power of diffusion models, potentially boosting performance on tasks where Vortex currently lags.
Getting Started with Vortex
Vortex is open source and MIT licensed. It compiles with a single `cargo build` command, making it easy for developers to try out or extend. The pure Rust implementation ensures safety, speed, and portability across platforms without GPU requirements.
Command to Benchmark:
.\target\release\spatialvortex-eval.exe --tasks mmlu,gsm8k,arc-challenge,hellaswag,truthfulqa,humaneval,commonsenseqa,squad,babi1,babi2,babi3,babi4,babi5,babi6,babi7,babi8,babi9,babi10,babi11,babi12,babi13,babi14,babi15,babi16,babi17,babi18,babi19,babi20,winogrande,piqa --limit 100 --eval-only --audit --skip-hf
Developers interested in efficient AI inference, especially those working in resource-constrained environments, will find Vortex a compelling option.
X - Simbuilder

Comments