How fast is Pyvorin?
We benchmark honestly because we have to - our customers are engineers, not mugs. No cherry-picking, no hidden configs, no synthetic micro-benchmarks rigged to make us look good. Here is what you actually get.
Why native code beats the interpreter
CPython is solid engineering, but it is still an interpreter. Every operation hits the same dispatch loop. Pyvorin skips that entirely.
Type-specialised machine instructions
In CPython, a + b calls PyNumber_Add, which checks the types of both operands,
dispatches to the correct C function, increments reference counts, and handles overflow.
That is ~50–100 CPU instructions.
We know a and b are integers at compile time. We emit a single
addq instruction. On x86_64, that is 1 CPU instruction.
The speedup on tight loops is dramatic - often 20–50× for simple arithmetic.
No Global Interpreter Lock
CPython’s GIL ensures only one thread executes Python bytecode at a time. Even on a 64-core machine, CPU-bound Python is effectively single-threaded unless you use multiprocessing.
Compiled functions release the GIL before entering native code. You can call the same compiled function from multiple Python threads and they will execute in parallel on all cores. This alone can provide a 4–16× throughput improvement on multi-core servers.
SIMD vectorisation
Modern CPUs have vector registers that can process 4, 8, or 16 values simultaneously (AVX2, AVX-512, NEON). CPython cannot use these because every operation is individually dispatched.
LLVM’s auto-vectoriser transforms loops like sum(x[i] for i in range(N)) into
SIMD instructions. On AVX2, this processes 4 doubles or 8 floats per instruction -
a theoretical 4–8× speedup on top of the interpreter removal.
Direct memory access
Python lists are arrays of pointers to PyObject. Accessing my_list[i]
requires pointer indirection, a type check, and a reference count increment.
Pyvorin’s list runtime uses contiguous C arrays. When types are known, my_list[i]
compiles to a single memory access with a bounds check. For typed numeric arrays, the bounds
check can even be hoisted out of the loop by LLVM.
Loop unrolling and inlining
Function calls in Python are expensive: build a frame, push locals, execute the call, pop the frame, return. In tight loops, this dominates runtime.
LLVM inlines small functions and unrolls short loops, eliminating call overhead and
enabling cross-loop optimisations. A loop that calls math.sqrt on each
iteration becomes a series of inlined vsqrtpd SIMD instructions.
Infrastructure cost impact
Speed is not just about user experience - it is about not burning money on CPU time you do not need. A 10× speedup means 1/10th the compute. On AWS or GCP, that translates directly to your monthly bill.
Based on 4× c6i.2xlarge instances at roughly £0.13/hr running CPU-bound Python 24/7. Your mileage will vary depending on workload, cloud provider, and how good your finance team is at spotting waste.
Example: ETL pipeline
Daily ETL on 50M rows: JSON parsing, validation, aggregation, CSV output. Running on AWS c6i.4xlarge spot instances.
Our benchmarking methodology
We publish every detail of how we measure so you can reproduce our results.
Correctness first
Every benchmark result is validated against CPython ground truth. If we produce a different answer, the benchmark is marked as failed - regardless of how fast it ran.
Warm runs only
We report the minimum of multiple warm runs, not the first run. First-run times include compilation overhead, which is tracked separately. Cache hits load in < 5 ms.
Full disclosure
We publish hardware specs, Python version, our version, CPU features, and the exact benchmark source code. You can run the same benchmarks on your own hardware.
See the numbers for yourself
Browse our public benchmark database with 48 real-world workloads.