Whitepaper

Pyvorin: Native Compilation for Python

A technical deep-dive into how Pyvorin compiles Python to native machine code, the design decisions behind the compiler, and the performance characteristics of the resulting binaries.

May 2025 15 min read PDF available on request

Abstract

Python is the most popular language for data engineering, scientific computing, and backend development, yet its standard interpreter (CPython) is orders of magnitude slower than natively compiled languages. Existing solutions - Numba, Cython, PyPy, Mojo - each require trade-offs: annotations, rewrites, language changes, or incomplete compatibility. We present Pyvorin, an ahead-of-time Python compiler that translates standard Python 3 source code to native machine code via LLVM, with no annotations, no rewrites, and transparent fallback to CPython for unsupported constructs. We achieve 5–50× speedups on numerical workloads and 2–10× on string-heavy code while maintaining full CPython compatibility and correctness guarantees.

1. Introduction

The performance gap between Python and compiled languages is well-documented. CPython executes code through a stack-based bytecode interpreter with reference counting, dynamic dispatch, and a global interpreter lock (GIL). Every arithmetic operation, attribute access, and method call carries overhead that accumulates dramatically in tight loops.

Existing approaches to Python acceleration fall into four categories:

  1. JIT compilers (PyPy) replace the interpreter with a tracing JIT. They improve long-running programs but suffer from warmup time, memory overhead, and C extension incompatibility.
  2. Annotation-based compilers (Numba, mypyc) require type annotations or decorators and support only a subset of Python. Numba excels at NumPy arrays but struggles with strings, dicts, and file I/O.
  3. Source-to-source transpilers (Cython, Nuitka) require rewriting code in a dialect (Cython) or produce large binaries with complex debugging (Nuitka).
  4. New languages (Mojo) abandon Python compatibility in favour of performance, requiring a complete rewrite of existing codebases.

Pyvorin sits in a different spot: it compiles unmodified Python 3 to native machine code with no annotations, no decorators, and no new syntax. When something cannot compile, it falls back to CPython automatically.

2. Architecture Overview

Pyvorin’s compilation pipeline has five stages, executed sequentially for each target function:

  1. AST Parse: Python source is parsed with the standard ast module. We walk the tree and build an internal representation with source location annotations.
  2. Type Inference: A constraint solver propagates types through the AST. Supported types: int, float, bool, string, list, dict, set, tuple, file_handle.
  3. LLVM IR Lowering: Typed AST nodes are translated into LLVM intermediate representation using llvmlite.
  4. Native Codegen: LLVM optimises the IR and emits a platform-specific shared object (.so, .dylib, or .dll).
  5. Load & Execute: The binary is loaded with ctypes. A fast wrapper marshals Python arguments to native types and converts results back.

2.1 AST Analysis

We do not operate on source text or bytecode - we work directly on the AST. This provides several advantages: whitespace and comments are irrelevant, refactoring style never breaks compilation, and semantic information (variable scopes, control flow) is readily available.

During AST analysis, we perform normalisation: chained comparisons are expanded, tuple unpacking is desugared into indexed assignments, and augmented assignments (+=) are converted to explicit load-compute-store sequences. This simplifies the later lowering stages.

2.2 Type Inference

We use a fixed-point constraint solver that propagates types from assignments and return statements outward through the AST. The type lattice is simple but effective:

  • Scalar types: int, float, bool, string
  • Container types: list[T], dict[K,V], set[T], tuple[T1,T2,...]
  • Special types: file_handle
  • Bottom type: object (triggers fallback)

Type annotations are respected when present but never required. This makes us compatible with untyped legacy codebases, which constitute the majority of production Python.

When a variable’s type cannot be resolved to a single concrete type (e.g., used as both int and string in different branches), we either insert a runtime type guard or falls back to CPython. The choice is configurable via the strict flag.

2.3 LLVM IR Generation

The IR generator emits LLVM instructions using llvmlite. Every Python construct maps to a specific IR pattern:

; Python: total += i * 2
%prod = mul i64 %i, 2
%new_total = add i64 %total, %prod
store i64 %new_total, i64* %total_ptr

; Python: for i in range(N):
loop_header:
  %i = phi i64 [ 0, %entry ], [ %i_next, %loop_body ]
  %cond = icmp slt i64 %i, %N
  br i1 %cond, label %loop_body, label %loop_exit

Container operations (list append, dict set, string concat) are not inlined - they call into C runtime libraries via function pointers. This keeps the generated IR compact and allows the runtime to handle memory management, reallocation, and error handling consistently.

2.4 Native Code Generation

LLVM’s optimisation pipeline runs automatically at O3 level. Key optimisations that benefit Python workloads:

  • Loop unrolling: Reduces branch overhead for small, fixed-size loops.
  • Auto-vectorisation: Transforms data-parallel loops into SIMD instructions (AVX2, AVX-512, NEON).
  • Inlining: Eliminates call overhead for small functions and runtime operations.
  • Dead code elimination: Removes unreachable branches and unused computations.
  • Loop invariant code motion: Hoists computations out of loops where possible.

The resulting binary is cached on disk keyed by a SHA-256 hash of (source, Python version, our version, CPU features). Cache hits load in < 5 ms. Moving to a machine with AVX-512 triggers an automatic recompile.

3. Runtime Libraries

Pyvorin’s compiled code does not call back into CPython for every container operation. Instead, it uses optimised C runtime libraries that operate on raw memory:

Library Operations Data Structure
nexus_builtins.so sqrt, sin, cos, isqrt, factorial, randint Scalar math
list_runtime.so append, get, set, len, pop, iteration Contiguous C array
dict_runtime.so get, set, contains, keys, values, items Flat hash table (open addressing)
set_runtime.so union, intersection, difference Hash set
file_runtime.so open, read, write, readline, readlines Buffered FILE* with handle table
string_runtime.so find, split, join, strip, slice Immutable UTF-8 with refcount

The list runtime uses amortised 2× reallocation. The dict runtime uses robin-hood hashing with a load factor threshold of 0.7. These are standard data structures tuned for the access patterns common in Python workloads.

4. Fallback and Safety

Pyvorin’s core safety principle is all-or-nothing compilation. If any part of a function cannot be compiled, the entire function falls back to CPython. There is no partial compilation, no silent performance degradation, and no risk of mixed-mode execution producing incorrect results.

Every fallback is reported with:

  • The function name
  • The unsupported construct (e.g., eval, exec)
  • The source line number
  • A suggested workaround where applicable

In strict mode (PyvorinCompiler(strict=True)), fallback is treated as a fatal error. A CompilationError is raised with a detailed explanation. Strict mode is recommended for CI/CD pipelines where native execution is a hard requirement.

5. Performance Evaluation

We evaluated Pyvorin against CPython 3.12, Numba 0.65, and PyPy 7.3 on an AMD EPYC-Milan (znver3) processor running Ubuntu 24.04. The benchmark suite comprises 48 real-world workloads covering arithmetic, string processing, data structures, file I/O, and ETL patterns.

5.1 Numerical Workloads

On tight loops with integer and float arithmetic, Pyvorin achieves 5–50× speedups over CPython. The speedup is highest for simple operations (addition, multiplication) and lower for operations that require runtime calls (square root, trigonometric functions).

48×
Hash sum (1M ints)
35×
Array sum (1M floats)
22×
Monte Carlo simulation

5.2 String Processing

String workloads show 2–10× speedups. The improvement comes from direct UTF-8 manipulation in C rather than Python’s Unicode object model. Operations like split, find, and join benefit most; regex-heavy workloads see smaller gains because Pyvorin delegates regex to Python’s re module.

5.3 Data Structure Operations

List and dict operations achieve 3–15× speedups. The contiguous C array backing lists avoids Python’s pointer-indirection overhead. Dict lookups compile to direct hash table probes. Set operations (union, intersection) show 5–12× improvements.

5.4 File I/O

File I/O improvements are modest (1.2–2×) because the bottleneck is the operating system, not Python. However, Pyvorin still benefits these workloads by compiling the parsing and processing stages that run between read operations.

5.5 Comparison with Numba

Numba excels at NumPy array operations but falls back to object mode for strings, dicts, and file I/O - often silently. Pyvorin compiles all of these natively and reports every fallback explicitly. On general Python code (ETL, parsing, business logic), Pyvorin outperforms Numba by 2–10×.

6. Correctness Guarantees

Every Pyvorin release is validated against a suite of 166+ benchmarks covering all supported constructs. The validation pipeline:

  1. Runs CPython to establish ground truth
  2. Compiles with Pyvorin (native or fallback)
  3. Compares outputs with exact equality for ints/strings and approximate equality (rel_tol=1e-9, abs_tol=1e-12) for floats
  4. Hashes large outputs to detect subtle differences
  5. Blocks the release if any benchmark fails

Compiled code runs in a separate memory arena with bounds checks on container accesses. Our CI runs the entire benchmark suite under AddressSanitizer. In two years of production use, we have had zero reported segfaults from compiled code.

7. Summary

Standard Python compiles to native machine code without annotations, rewrites, or language changes. Pyvorin targets the constructs people actually use - loops, arithmetic, containers, strings, file I/O - and falls back to CPython when it encounters something it cannot handle. The result is 5–50× speedups on numerical workloads and 2–10× on general Python code, with full CPython compatibility intact.

We are working on dynamic module reloading, broader C extension runtime bridges, and automatic loop parallelisation. The goal is simple: make Python fast without making it feel like a different language.

References

  1. Python Software Foundation. CPython 3.12 Documentation. https://docs.python.org/3.12/
  2. Lattner, C., & Adve, V. (2004). LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation.
  3. Lam, S. K., Pitrou, A., & Seibert, S. (2015). Numba: A LLVM-based Python JIT Compiler.
  4. Behnel, S. et al. (2011). Cython: The Best of Both Worlds.
  5. PyPy Team. PyPy: A Fast, Compliant Alternative Implementation of Python.
  6. Modular Inc. Mojo Programming Language.

Want the full PDF?

Download the complete whitepaper with extended benchmarks, IR examples, and security analysis.