1. Introduction
The performance gap between Python and compiled languages is well-documented. CPython executes code through a stack-based bytecode interpreter with reference counting, dynamic dispatch, and a global interpreter lock (GIL). Every arithmetic operation, attribute access, and method call carries overhead that accumulates dramatically in tight loops.
Existing approaches to Python acceleration fall into four categories:
- JIT compilers (PyPy) replace the interpreter with a tracing JIT. They improve long-running programs but suffer from warmup time, memory overhead, and C extension incompatibility.
- Annotation-based compilers (Numba, mypyc) require type annotations or decorators and support only a subset of Python. Numba excels at NumPy arrays but struggles with strings, dicts, and file I/O.
- Source-to-source transpilers (Cython, Nuitka) require rewriting code in a dialect (Cython) or produce large binaries with complex debugging (Nuitka).
- New languages (Mojo) abandon Python compatibility in favour of performance, requiring a complete rewrite of existing codebases.
Pyvorin sits in a different spot: it compiles unmodified Python 3 to native machine code with no annotations, no decorators, and no new syntax. When something cannot compile, it falls back to CPython automatically.
2. Architecture Overview
Pyvorin’s compilation pipeline has five stages, executed sequentially for each target function:
- AST Parse: Python source is parsed with the standard
astmodule. We walk the tree and build an internal representation with source location annotations. - Type Inference: A constraint solver propagates types through the AST. Supported types: int, float, bool, string, list, dict, set, tuple, file_handle.
- LLVM IR Lowering: Typed AST nodes are translated into LLVM intermediate
representation using
llvmlite. - Native Codegen: LLVM optimises the IR and emits a platform-specific shared
object (
.so,.dylib, or.dll). - Load & Execute: The binary is loaded with
ctypes. A fast wrapper marshals Python arguments to native types and converts results back.
2.1 AST Analysis
We do not operate on source text or bytecode - we work directly on the AST. This provides several advantages: whitespace and comments are irrelevant, refactoring style never breaks compilation, and semantic information (variable scopes, control flow) is readily available.
During AST analysis, we perform normalisation: chained comparisons are expanded,
tuple unpacking is desugared into indexed assignments, and augmented assignments
(+=) are converted to explicit load-compute-store sequences. This simplifies
the later lowering stages.
2.2 Type Inference
We use a fixed-point constraint solver that propagates types from assignments and return statements outward through the AST. The type lattice is simple but effective:
- Scalar types:
int,float,bool,string - Container types:
list[T],dict[K,V],set[T],tuple[T1,T2,...] - Special types:
file_handle - Bottom type:
object(triggers fallback)
Type annotations are respected when present but never required. This makes us compatible with untyped legacy codebases, which constitute the majority of production Python.
When a variable’s type cannot be resolved to a single concrete type (e.g., used as both
int and string in different branches), we either insert a
runtime type guard or falls back to CPython. The choice is configurable via the
strict flag.
2.3 LLVM IR Generation
The IR generator emits LLVM instructions using llvmlite. Every Python construct
maps to a specific IR pattern:
; Python: total += i * 2
%prod = mul i64 %i, 2
%new_total = add i64 %total, %prod
store i64 %new_total, i64* %total_ptr
; Python: for i in range(N):
loop_header:
%i = phi i64 [ 0, %entry ], [ %i_next, %loop_body ]
%cond = icmp slt i64 %i, %N
br i1 %cond, label %loop_body, label %loop_exit
Container operations (list append, dict set, string concat) are not inlined - they call into C runtime libraries via function pointers. This keeps the generated IR compact and allows the runtime to handle memory management, reallocation, and error handling consistently.
2.4 Native Code Generation
LLVM’s optimisation pipeline runs automatically at O3 level. Key optimisations
that benefit Python workloads:
- Loop unrolling: Reduces branch overhead for small, fixed-size loops.
- Auto-vectorisation: Transforms data-parallel loops into SIMD instructions (AVX2, AVX-512, NEON).
- Inlining: Eliminates call overhead for small functions and runtime operations.
- Dead code elimination: Removes unreachable branches and unused computations.
- Loop invariant code motion: Hoists computations out of loops where possible.
The resulting binary is cached on disk keyed by a SHA-256 hash of (source, Python version, our version, CPU features). Cache hits load in < 5 ms. Moving to a machine with AVX-512 triggers an automatic recompile.
3. Runtime Libraries
Pyvorin’s compiled code does not call back into CPython for every container operation. Instead, it uses optimised C runtime libraries that operate on raw memory:
| Library | Operations | Data Structure |
|---|---|---|
| nexus_builtins.so | sqrt, sin, cos, isqrt, factorial, randint | Scalar math |
| list_runtime.so | append, get, set, len, pop, iteration | Contiguous C array |
| dict_runtime.so | get, set, contains, keys, values, items | Flat hash table (open addressing) |
| set_runtime.so | union, intersection, difference | Hash set |
| file_runtime.so | open, read, write, readline, readlines | Buffered FILE* with handle table |
| string_runtime.so | find, split, join, strip, slice | Immutable UTF-8 with refcount |
The list runtime uses amortised 2× reallocation. The dict runtime uses robin-hood hashing with a load factor threshold of 0.7. These are standard data structures tuned for the access patterns common in Python workloads.
4. Fallback and Safety
Pyvorin’s core safety principle is all-or-nothing compilation. If any part of a function cannot be compiled, the entire function falls back to CPython. There is no partial compilation, no silent performance degradation, and no risk of mixed-mode execution producing incorrect results.
Every fallback is reported with:
- The function name
- The unsupported construct (e.g.,
eval,exec) - The source line number
- A suggested workaround where applicable
In strict mode (PyvorinCompiler(strict=True)), fallback is treated
as a fatal error. A CompilationError is raised with a detailed explanation.
Strict mode is recommended for CI/CD pipelines where native execution is a hard requirement.
5. Performance Evaluation
We evaluated Pyvorin against CPython 3.12, Numba 0.65, and PyPy 7.3 on an AMD EPYC-Milan (znver3) processor running Ubuntu 24.04. The benchmark suite comprises 48 real-world workloads covering arithmetic, string processing, data structures, file I/O, and ETL patterns.
5.1 Numerical Workloads
On tight loops with integer and float arithmetic, Pyvorin achieves 5–50× speedups over CPython. The speedup is highest for simple operations (addition, multiplication) and lower for operations that require runtime calls (square root, trigonometric functions).
5.2 String Processing
String workloads show 2–10× speedups. The improvement comes from direct UTF-8 manipulation
in C rather than Python’s Unicode object model. Operations like split, find,
and join benefit most; regex-heavy workloads see smaller gains because Pyvorin
delegates regex to Python’s re module.
5.3 Data Structure Operations
List and dict operations achieve 3–15× speedups. The contiguous C array backing lists avoids Python’s pointer-indirection overhead. Dict lookups compile to direct hash table probes. Set operations (union, intersection) show 5–12× improvements.
5.4 File I/O
File I/O improvements are modest (1.2–2×) because the bottleneck is the operating system, not Python. However, Pyvorin still benefits these workloads by compiling the parsing and processing stages that run between read operations.
5.5 Comparison with Numba
Numba excels at NumPy array operations but falls back to object mode for strings, dicts, and file I/O - often silently. Pyvorin compiles all of these natively and reports every fallback explicitly. On general Python code (ETL, parsing, business logic), Pyvorin outperforms Numba by 2–10×.
6. Correctness Guarantees
Every Pyvorin release is validated against a suite of 166+ benchmarks covering all supported constructs. The validation pipeline:
- Runs CPython to establish ground truth
- Compiles with Pyvorin (native or fallback)
- Compares outputs with exact equality for ints/strings and approximate equality (rel_tol=1e-9, abs_tol=1e-12) for floats
- Hashes large outputs to detect subtle differences
- Blocks the release if any benchmark fails
Compiled code runs in a separate memory arena with bounds checks on container accesses. Our CI runs the entire benchmark suite under AddressSanitizer. In two years of production use, we have had zero reported segfaults from compiled code.
7. Summary
Standard Python compiles to native machine code without annotations, rewrites, or language changes. Pyvorin targets the constructs people actually use - loops, arithmetic, containers, strings, file I/O - and falls back to CPython when it encounters something it cannot handle. The result is 5–50× speedups on numerical workloads and 2–10× on general Python code, with full CPython compatibility intact.
We are working on dynamic module reloading, broader C extension runtime bridges, and automatic loop parallelisation. The goal is simple: make Python fast without making it feel like a different language.
References
- Python Software Foundation. CPython 3.12 Documentation. https://docs.python.org/3.12/
- Lattner, C., & Adve, V. (2004). LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation.
- Lam, S. K., Pitrou, A., & Seibert, S. (2015). Numba: A LLVM-based Python JIT Compiler.
- Behnel, S. et al. (2011). Cython: The Best of Both Worlds.
- PyPy Team. PyPy: A Fast, Compliant Alternative Implementation of Python.
- Modular Inc. Mojo Programming Language.