Pyvorin for NLP Preprocessing

May 30, 2026 | 5 min read

Tokenisation

Byte-Pair Encoding and WordPiece tokenisation in compiled loops.

def bpe_tokenize(text, merges):
    tokens = list(text)
    for merge in merges:
        i = 0
        while i < len(tokens) - 1:
            if tokens[i] + tokens[i+1] == merge:
                tokens[i] = merge
                del tokens[i+1]
            else:
                i += 1
    return tokens

Normalisation

Unicode normalisation, lowercasing, and punctuation removal.

Feature Extraction

TF-IDF and n-gram generation.