Do you need to distribute a heavy Python workload across multiple CPUs or a compute cluster? These seven frameworks are up to the task.
Python is powerful, versatile, and programmer-friendly, but it isnโt the fastest programming language around. Some of Pythonโs speed limitations are due to its default implementation, CPython, being single-threaded. That is, CPython doesnโt use more than one hardware thread at a time.
And while you can use Pythonโs built-in threading module to speed things up, threading only gives you concurrency, not parallelism. Itโs good for running multiple tasks that arenโt CPU-dependent, but does nothing to speed up multiple tasks that each require a full CPU. Python 3.13 introduced a special โfree-threadingโ or โno-GILโ build of the interpreter to allow full parallelism with Python threads, but itโs still considered an experimental feature. For now, itโs best to assume threading in Python, by default, wonโt give you parallelism.
Python does include another native way to run a workload across multiple CPUs. The multiprocessing module spins up multiple copies of the Python interpreter, each on a separate core, and provides primitives for splitting tasks across cores. But sometimes even the multiprocessing module isnโt enough.
In some cases, the job calls for distributing work not only across multiple cores but also across multiple machines. Thatโs where the Python libraries and frameworks discussed in this article come in. Weโll look at seven frameworks you can use to spread an existing Python application and its workload across multiple cores, multiple machines, or both.
The best parallel processing libraries for Python
- Ray: Parallelizes and distributes AI and machine learning workloads across CPUs, machines, and GPUs.
- Dask: Parallelizes Python data science libraries such as NumPy, Pandas, and Scikit-learn.
- Dispy: Executes computations in parallel across multiple processors or machines.
- Pandaralโขlel: Parallelizes Pandas across multiple CPUs.
- Ipyparallel: Enables interactive parallel computing with IPython, Jupyter Notebook, and Jupyter Lab.
- Joblib: Executes computations in parallel, with optimizations for NumPy and transparent disk caching of functions and output values.
- Parsl: Supports parallel execution across multiple cores and machines, along with chaining functions together into multi-step workflows.
Ray
Developed by a team of researchers at the University of California, Berkeley, Ray underpins a variety of distributed machine learning libraries. But Ray isnโt limited to machine learning tasks alone, even if that was its original intended use. You can break up and distribute any type of Python task across multiple systems with Ray.
Rayโs syntax is minimal, so you donโt need to rework existing applications extensively to parallelize them. The @ray.remote decorator distributes that function across any available nodes in a Ray cluster, with the option to specify parameters for how many CPUs or GPUs to use. The results of each distributed function are returned as Python objects, so theyโre easy to manage and store, and the amount of copying across or within nodes is minimal. This last feature comes in handy when dealing with NumPy arrays, for instance.
Ray even includes a built-in cluster manager, which can automatically spin up nodes as needed on local hardware or popular cloud computing platforms. Other Ray libraries let you scale common machine learning and data science workloads, so you donโt have to manually scaffold them. For instance, Ray Tune lets you perform hyperparameter turning at scale for most common machine learning systems (PyTorch and TensorFlow, among others). And if all you want to do is scale your use of Pythonโs multiprocessing module, Ray can do that too.
Related video: Using the multiprocessing module to speed up Python
Dask
From the outside, Dask looks a lot like Ray. It, too, is a library for distributed parallel computing in Python, with a built-in task scheduling system, awareness of Python data frameworks like NumPy, and the ability to scale from one machine to many.
One key difference between Dask and Ray is the scheduling mechanism. Dask uses a centralized scheduler that handles all tasks for a cluster. Ray is decentralized, meaning each machine runs its own scheduler, so any issues with a scheduled task are handled at the level of the individual machine, not the whole cluster. Daskโs task framework works hand-in-hand with Pythonโs native concurrent.futures interfaces, so for those whoโve used that library, most of the metaphors for how jobs work should be familiar.
Dask works in two basic ways. The first is by using parallelized data structuresโessentially, Daskโs own versions of NumPy arrays, lists, or Pandas DataFrames. Swap in the Dask versions of those constructions for their defaults, and Dask will automatically spread their execution across your cluster. This typically involves little more than changing the name of an import, but may sometimes require rewriting to work completely.
The second way is through Daskโs low-level parallelization mechanisms, including function decorators that parcel out jobs across nodes and return the results synchronously (in โimmediateโ mode) or asynchronously (โlazyโ mode). You can also mix the modes as needed.
One Pythonic convenience Dask offers is a memory structure called a bagโessentially a distributed Python list. Bags provide distributed operations (like map and filter) on collections of Python objects, with whatever optimizations can be provided for them. The downside is that any operation that requires a lot of cross-communication between nodes (for example, groupby) wonโt work as well.
Dask also offers a feature called actors. An actor is an object that points to a job on another Dask node. This way, a job that requires a lot of local state can run in-place and be called remotely by other nodes, so the state for the job doesnโt have to be replicated. Daskโs actor model supports more sophisticated job distribution than Ray can manage. However, Daskโs scheduler isnโt aware of what actors do, so if an actor runs wild or hangs, the scheduler canโt intercede. โHigh-performing but not resilientโ is how the documentation puts it, so actors should be used with care.
Dispy
Dispy lets you distribute whole Python programs or just individual functions across a cluster of machines for parallel execution. It uses platform-native mechanisms for network communication to keep things fast and efficient, so Linux, macOS, and Windows machines work equally well. That makes it a more generic solution than others discussed here, so itโs worth a look if you need something that isnโt specifically about accelerating machine-learning tasks or a particular data-processing framework.
Dispy syntax somewhat resembles Pythonโs multiprocessing module in that you explicitly create a cluster (where multiprocessing would have you create a process pool), submit work to the cluster, then retrieve the results. Modifying jobs to work with Dispy may require a little more work, but you gain precise control over how those jobs are dispatched and returned. For instance, you can return provisional or partially completed results, transfer files as part of the job distribution process, and use SSL encryption when transferring data.
Pandaralยทlel
Pandaralยทlel, as the name implies, is a way to parallelize Pandas jobs across multiple machines. The downside is that Pandaralยทlel works only with Pandas. But if Pandas is what youโre using, and all you need is a way to accelerate Pandas jobs across multiple cores on a single computer, Pandaralยทlel is laser-focused on the task.
Note that while Pandaralยทlel does run on Windows, it will run only from Python sessions launched in the Windows Subsystem for Linux. Linux and macOS users can run Pandaralยทlel as-is. Also note that Pandaralยทlel currently does not have a maintainer; its last formal release was in May 2023.
Ipyparallel
Ipyparallel is another tightly focused multiprocessing and task-distribution system, specifically for parallelizing the execution of Jupyter Notebook code across a cluster. Projects and teams already working in Jupyter can start using Ipyparallel immediately.
Ipyparallel supports many approaches to parallelizing code. On the simple end, thereโs map, which applies any function to a sequence and splits the work evenly across available nodes. For more complex work, you can decorate specific functions to always run remotely or in parallel.
Jupyter notebooks support โmagic commandsโ for actions that are only possible in a notebook environment. Ipyparallel adds a few magic commands of its own. For example, you can prefix any Python statement with %px to automatically parallelize it.
Joblib
Joblib has two major goals: run jobs in parallel, and donโt recompute results if nothing has changed. These efficiencies make Joblib well-suited for scientific computing, where reproducible results are sacrosanct. Joblibโs documentation provides plenty of examples for how to use all its features.
Joblib syntax for parallelizing work is simple enoughโit amounts to a decorator that can be used to split jobs across processors or to cache results. Parallel jobs can use threads or processes.
Joblib includes a transparent disk cache for Python objects created by compute jobs. This cache not only helps Joblib avoid repeating work, as noted above, but can also be used to suspend and resume long-running jobs, or pick up where a job left off after a crash. The cache is also intelligently optimized for large objects like NumPy arrays. Regions of data can be shared in-memory between processes on the same system by using numpy.memmap. This all makes Joblib highly useful for work that may take a long time to complete, since you can avoid redoing existing work and pause and resume as needed.
One thing Joblib does not offer is a way to distribute jobs across multiple separate computers. In theory, itโs possible to use Joblibโs pipeline to do this, but itโs probably easier to use another framework that supports it natively.
Parsl
Short for โParallel Scripting Library,โ Parsl lets you take computing jobs and split them across multiple systems using roughly the same syntax as Pythonโs existing Pool objects. It also lets you stitch together different computing tasks into multi-step workflows, which can run in parallel, in sequence, or via map/reduce operations.
Parsl lets you execute native Python applications, but also run any other external application by way of commands to the shell. Your Python code is written like normal Python code, save for a special function decorator that marks the entry point to your work. The job-submission system also gives you fine-grained control over how things run on the targetsโfor example, the number of cores per worker, how much memory per worker, CPU affinity controls, how often to poll for timeouts, and so on.
One excellent feature Parsl offers is a set of prebuilt templates to dispatch work to a variety of high-end computing resources. This not only includes staples like AWS or Kubernetes clusters, but supercomputing resources (assuming you have access) like Blue Waters, ASPIRE 1, Frontera, and so on. (Parsl was co-developed with the aid of many of the institutions that built such hardware.)
Conclusions
If youโre working with existing, popular machine learning libraries you want to run in a distributed way, Ray is your first-round draft choice. For the same kind of work, but with centralized scheduling (for instance, to have top-down control over the submitted jobs), use Dask. If you want to work specifically with Pandas, use Pandaralยทlel; for parallel distribution of work in Jupyter notebooks, use Ipyparallel. And, for more generic Python workloads, Dispy, Joblib, and Parsl can provide parallel task distribution without too many extras.
Pythonโs limitations with threads will continue to evolve, with major changes slated to allow threads to run side-by-side for CPU-bound work. But those updates are years away from being usable. Libraries designed for parallelism can help fill the gap while we wait.


