Batch Processing and ETL

May 30, 2026 | 5 min read

ETL Pipeline Acceleration

Pyvorin excels at CPU-bound data transformation. Typical ETL stages that benefit:

  • Data cleaning and validation
  • Type conversion and normalisation
  • Aggregation and grouping
  • Deduplication logic

Example: CSV Processing

def process_csv(rows):
    results = []
    for row in rows:
        if validate_row(row):
            transformed = {
                'id': int(row['id']),
                'amount': float(row['amount']) * 1.08,
                'category': row['category'].upper(),
            }
            results.append(transformed)
    return results

Batch Size Tuning

  • Too small: compilation overhead dominates.
  • Too large: memory pressure from buffering.
  • Sweet spot: 1,000–10,000 rows per batch depending on row width.

Streaming with Fallback

For streaming pipelines, compile the transformation function and call it per-batch:

for batch in stream_batches(source):
    compiled_transform(batch)  # native
    sink.write(batch)          # CPython I/O fallback