Pyvorin for NLP Preprocessing
May 30, 2026 | 5 min read
Tokenisation
Byte-Pair Encoding and WordPiece tokenisation in compiled loops.
def bpe_tokenize(text, merges):
tokens = list(text)
for merge in merges:
i = 0
while i < len(tokens) - 1:
if tokens[i] + tokens[i+1] == merge:
tokens[i] = merge
del tokens[i+1]
else:
i += 1
return tokens
Normalisation
Unicode normalisation, lowercasing, and punctuation removal.
Feature Extraction
TF-IDF and n-gram generation.