When working with independent data, spreading computations across multiple cores often provides an easy way to (linearly) increase a program’s throughput.
One of the most common forms of parallelism is to split loops across multiple threads. Unfortunately, Python does not support such parallelism easily, especially if computations are CPU-bound. Using simple threads is often not an option due to the GIL and working with processes is often tedious. This is particularly true when working with PyTorch tensors as they have to be passed between processes over queues.
multiprocessing module of PyBlaze aims to make it possible to speed up computations easily, providing a high-level interface.
PyBlaze refers to vectorization as the process of parallelizing for-loops of the following form:
result =  for item in iterable: result.append(map(item))
PyBlaze’s class providing this functionality is the
Vectorizer class in the
multiprocessing module. In the background, the vectorizer handles everything such as creating processes, ensuring their shutdown, passing items and results between processes. Due to the class’s simplicity, it can often be used as a drop-in replacement for existing for-loops which not only reduces runtime but enhances readability.
Consider, for example, an array
text of strings which you want to tokenize according to a complex function
tokenize. The function takes as input a single string and returns its tokenization.
import numpy as np import time texts = [ ''.join(np.random.choice(['a', 'b', 'c', 'd', 'e', ' '], size=(20,))) for _ in range(100) ] def tokenize(text): time.sleep(0.01) return text.split()
The sequential implementation is very easy, however, not particularly efficient:
def sequential(texts): return [tokenize(t) for t in texts]
Vectorizer can easily be used to compute the tokenizations of all texts in parallel:
import pyblaze.multiprocessing as xmp def parallel(texts): tokenizer = xmp.Vectorizer(tokenize, num_workers=4) return tokenizer.process(texts)
We can now compare the runtime of the sequential and vectorized implementation:
1.01 s ± 86.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
280 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In this case, the vectorized implementation achieves an almost linear speedup by distributing work across processes.
Vectorizer class also provides some more advanced features, such as initializing workers. Consult the class’s docs for more information on that.