Why I Integrated Rust into a Python-Heavy Codebase

I want to be clear upfront: I’m not here to tell you to rewrite your Django app in Rust. That’s almost never the right answer.

What I am going to argue is that there’s a middle path most Python developers ignore — one that gets you Rust-level performance for the 5% of your codebase that actually needs it, while keeping Python for the 95% where developer productivity matters more than nanoseconds.

The Problem That Started This

We had a product indexing pipeline processing batch embeddings for 200M+ product records. Python handled the Django ORM, Celery orchestration, and API layer without issues. But one step — computing vector embeddings for semantic search — was a CPU-bound bottleneck that no amount of async or multiprocessing could fully solve.

The reason: Python’s GIL (Global Interpreter Lock). For CPU-bound work, the GIL means threads can’t truly run in parallel on multiple cores. You can work around it with multiprocessing, but inter-process overhead adds up when you’re doing it 40,000 times per minute.

The options I considered:

Pure Python + multiprocessing — works, but process spawning overhead is significant for short-lived tasks
C extension — effective but unsafe, manual memory management, painful to maintain
Rewrite the whole thing in Rust — massively expensive, abandons the existing working system
Rust module callable from Python via PyO3 — surgical, safe, maintainable

I went with option 4.

PyO3: Rust Modules That Look Like Python

PyO3 is a Rust library that lets you write Python extension modules in Rust. The Python side calls your Rust function exactly like any other Python function. No FFI juggling, no ctypes, no subprocess calls.

Here’s a minimal example. A Python function that computes cosine similarity between two vectors, implemented in Rust:

Rust side (src/lib.rs):

use pyo3::prelude::*;

#[pyfunction]
fn cosine_similarity(a: Vec<f32>, b: Vec<f32>) -> PyResult<f32> {
    if a.len() != b.len() {
        return Err(pyo3::exceptions::PyValueError::new_err(
            "Vectors must have equal length"
        ));
    }

    let dot: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
    let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
    let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();

    if norm_a == 0.0 || norm_b == 0.0 {
        return Ok(0.0);
    }

    Ok(dot / (norm_a * norm_b))
}

#[pymodule]
fn vector_ops(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(cosine_similarity, m)?)?;
    Ok(())
}

Cargo.toml:

[package]
name = "vector_ops"
version = "0.1.0"
edition = "2021"

[lib]
name = "vector_ops"
crate-type = ["cdylib"]

[dependencies]
pyo3 = { version = "0.20", features = ["extension-module"] }

Python side:

import vector_ops

similarity = vector_ops.cosine_similarity([0.1, 0.2, 0.3], [0.4, 0.5, 0.6])

That’s it. Build with maturin develop (for local dev) or maturin build --release (for production), and Python imports the .so file like any other module.

Batch Processing: Where It Really Matters

Single-vector operations aren’t where you feel the difference. It’s bulk operations. Here’s a real pattern we use — process a batch of vectors in Rust, returning results to Python:

use pyo3::prelude::*;
use rayon::prelude::*;  // parallel iterator

#[pyfunction]
fn batch_normalize(py: Python, vectors: Vec<Vec<f32>>) -> PyResult<Vec<Vec<f32>>> {
    // Release GIL for CPU-bound work
    py.allow_threads(|| {
        Ok(vectors
            .par_iter()  // rayon parallel iterator
            .map(|v| {
                let norm: f32 = v.iter().map(|x| x * x).sum::<f32>().sqrt();
                if norm == 0.0 {
                    v.clone()
                } else {
                    v.iter().map(|x| x / norm).collect()
                }
            })
            .collect())
    })
}

The key line is py.allow_threads(|| { ... }). This releases the Python GIL for the duration of the Rust computation, allowing other Python threads to run simultaneously. With Rayon’s parallel iterators, the Rust side uses all available CPU cores.

In our case, this made batch normalization of 5,000 vectors go from ~2.3 seconds (Python + NumPy) to ~0.08 seconds.

Deploying to Production

The tricky part isn’t writing the Rust — it’s distributing the compiled binary as part of your Python package.

For our Django app on AWS, we build the Rust extension as part of the Docker image:

FROM python:3.11-slim as builder

RUN apt-get update && apt-get install -y curl build-essential
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"

RUN pip install maturin

WORKDIR /app/vector_ops
COPY vector_ops/ .
RUN maturin build --release

FROM python:3.11-slim
COPY --from=builder /app/vector_ops/target/wheels/*.whl /tmp/
RUN pip install /tmp/*.whl

The wheel installs like any other Python package. No Rust toolchain needed in production.

When NOT to Do This

I want to be honest about the costs.

Maintenance surface: You now have two languages in your codebase. Every Rust function is a function Python developers can’t easily debug or modify. Keep the boundary narrow.

Compilation time: Rust compiles slowly. Our CI pipeline adds ~4 minutes for the Rust build step. Worth it for us, annoying on fast iteration cycles.

Learning curve: If you’re the only person on your team who knows Rust, you’ve created a bus-factor problem. Make sure someone else can maintain it.

The threshold: I wouldn’t add Rust unless a profiler has shown that the specific function in question is responsible for at least 20–30% of your total processing time, and you’ve already eliminated the obvious Python-level inefficiencies (bulk queries, avoiding loops, proper indexing).

If you haven’t profiled, profile first. The bottleneck is rarely where you think it is.

The Right Mental Model

Think of Rust not as a replacement for Python, but as a power tool for specific operations. Python stays responsible for orchestration, business logic, API handling, and anything where developer ergonomics matter. Rust handles the tight loops, the CPU-bound transforms, the memory-sensitive operations.

The boundary between them should be small and explicit. Ours is about 400 lines of Rust for a 40,000-line Python codebase. That’s the right ratio.