2053 подписчика

Python multiprocessing pool

9 июня 20259 июн 2025

9 мин

The multiprocessing. Pool class in Python is a powerful tool for parallelizing tasks across multiple CPU cores. It provides a way to distribute work among a pool of worker processes, improving the performance of CPU-bound operations. Here’s a comprehensive overview of how to use multiprocessing. Pool: Key Concepts Process Pool: A collection of worker processes that can execute tasks concurrently. Worker Process: A separate Python process that runs independently and executes tasks submitted to the pool. Task: A unit of work to be performed by a worker process. Typically, this is a function call with specific arguments. Map/Apply/Etc.: Methods to distribute tasks to the workers in the pool. These methods handle the distribution, execution, and collection of results. Synchronization: Necessary to ensure that multiple processes accessing shared resources do so safely (e. g., using locks). Pickling: Python’s mechanism for serializing (converting to a byte stream) and deserializing (converti

Key Concepts

Process Pool: A collection of worker processes that can execute tasks concurrently. Worker Process: A separate Python process that runs independently and executes tasks submitted to the pool. Task: A unit of work to be performed by a worker process. Typically, this is a function call with specific arguments. Map/Apply/Etc.: Methods to distribute tasks to the workers in the pool. These methods handle the distribution, execution, and collection of results. Synchronization: Necessary to ensure that multiple processes accessing shared resources do so safely (e. g., using locks). Pickling: Python’s mechanism for serializing (converting to a byte stream) and deserializing (converting back from a byte stream) objects. This is necessary to pass data between processes. Some objects can’t be pickled (e. g., lambdas or functions defined inside other functions).

Basic Usage

Here’s a simple example demonstrating the basic usage of multiprocessing. Pool:

Import multiprocessing

Import time

Def square(x):

"""Calculates the square of a number, with a slight delay to simulate work."""

time. sleep(1) # Simulate some work

return x * x

If __name__ == ‘__main__’:

# Create a pool of worker processes. The default is the number of CPU cores.

pool = multiprocessing. Pool() # or specify the number: multiprocessing. Pool(processes=4)

# Define the input data (a list of numbers)

numbers = [1, 2, 3, 4, 5]

# Use the `map` function to apply the `square` function to each number in parallel

results = pool. map(square, numbers)

# Close the pool and wait for the worker processes to finish

pool. close()

pool. join()

# Print the results

print(f"Squares: {results}") # Output: Squares: [1, 4, 9, 16, 25]

Explanation

Import Multiprocessing: This imports the necessary module. Define the Task Function: The square(x) function defines the task that each worker process will execute. Importantly, this function must be defined Outside of the if __name__ == ‘__main__’: block (more on that later). Create a Pool Object: multiprocessing. Pool() creates a pool of worker processes. The number of processes defaults to the number of CPU cores on your system. You can specify the number of processes using the processes argument (e. g., multiprocessing. Pool(processes=4)). Use Pool. map(): The pool. map() method applies the square function to each element in the numbers list. It distributes the work among the worker processes in the pool. The first argument is the function to apply, and the second is an iterable (like a list) containing the input data. pool. map waits for all tasks to complete and returns a list containing the results, in the same order as the input. Pool. close(): This prevents any more tasks from being submitted to the pool. It’s crucial to call pool. close() before pool. join(). Pool. join(): This waits for all the worker processes in the pool to finish their tasks before the main program continues. Without pool. join(), the main program might terminate before the worker processes have completed their work. If __name__ == ‘__main__’:: This is a crucial part of multiprocessing on many operating systems (especially Windows). It ensures that the code that creates the pool and submits tasks is only executed when the script is run directly, and not when it’s imported as a module. This prevents infinite recursion when the child processes import the main module.

Important Pool Methods

Map(func, iterable, chunksize=None): Applies func to each element of iterable, collecting the results in a list. The chunksize argument can improve performance by grouping tasks into larger chunks (especially useful for very short tasks). Apply(func, args=(), kwds={}): Applies func to the arguments specified in args (a tuple) and keyword arguments in kwds (a dictionary). This is useful for single tasks. apply Blocks until the result is ready. Apply_async(func, args=(), kwds={}, callback=None): A non-blocking version of apply. It submits the task to the pool and returns an AsyncResult object. The callback argument is a function that will be called with the result when it becomes available. Imap(func, iterable, chunksize=1): Similar to map, but returns an iterator. This can be more memory-efficient if you’re processing a large amount of data, as it doesn’t need to store all the results in memory at once. Imap_unordered(func, iterable, chunksize=1): Like imap, but the results are returned in arbitrary order (as they become available), which can be faster. Starmap(func, iterable, chunksize=None): Like map, but the elements of the iterable are expected to be iterables themselves, and are unpacked as arguments to func. Useful when you have multiple arguments for each task. Close(): Prevents any more tasks from being submitted to the pool. Must be called before join(). Join(): Waits for all worker processes in the pool to complete. Terminate(): Immediately stops all worker processes in the pool without completing pending tasks. Use with caution.

Examples of Other Methods

Import multiprocessing

Import time

Def cube(x, power):

"""Calculates x to the power of power."""

time. sleep(0.5)

return x ** power

Def print_result(result):

"""Callback function to print the result asynchronously."""

print(f"Result: {result}")

If __name__ == ‘__main__’:

pool = multiprocessing. Pool(processes=4)

# apply_async with a callback

async_result = pool. apply_async(cube, args=(2, 3), callback=print_result)

print("Doing other stuff while the cube is calculated…")

time. sleep(1) # Simulate doing other work

# async_result. get() #Uncomment to force wait for result before continuing.

print("Done with other stuff.") #This will print BEFORE the result if async_result. get() is commented out

# imap (returns an iterator)

numbers = [1, 2, 3, 4, 5]

print("Using imap:")

for result in pool. imap(square, numbers):

print(result)

# starmap (for multiple arguments)

data = [(2, 2), (3, 3), (4, 4)] # List of tuples (base, exponent)

print("Using starmap:")

results = pool. starmap(cube, data)

print(results) # Output: [4, 27, 256]

pool. close()

pool. join()

Common Issues and Best Practices

If __name__ == ‘__main__’: Absolutely critical, especially on Windows, to prevent infinite recursion. Global Variables: Avoid using global variables inside the task function, as they can lead to unexpected behavior due to each process having its own memory space. If you need to share data, use multiprocessing. Value, multiprocessing. Array, or multiprocessing. Queue. Pickling Errors: Ensure that the functions and data you’re passing to the pool are picklable. Lambdas and functions defined inside other functions often cannot be pickled. Classes might need special handling (e. g., defining __reduce__). Deadlocks: Be very careful when using locks or other synchronization primitives, as they can lead to deadlocks if not used correctly. Consider using multiprocessing. Lock, multiprocessing. RLock, multiprocessing. Semaphore, or multiprocessing. Condition. Resource Limits: Be mindful of system resource limits (e. g., number of open files) when using a large number of processes. Choosing the Right Method: map is good for simple, independent tasks. starmap is for tasks with multiple arguments. imap and imap_unordered are for memory efficiency. apply is for single, blocking tasks, and apply_async is for single, non-blocking tasks. Error Handling: Implement proper error handling in your task functions to catch exceptions and prevent the entire pool from crashing. Consider using try…except blocks within the task function and logging errors. Chunksize: Experiment with different chunk sizes to optimize performance. A larger chunk size reduces overhead but can increase latency.

Example: Sharing Data (using Multiprocessing. Value)

Import multiprocessing

Import time

Def increment(shared_value, lock):

"""Increments a shared value safely using a lock."""

for _ in range(10000):

with lock: # Acquire the lock before accessing the shared value

shared_value. value += 1

If __name__ == ‘__main__’:

# Create a shared value (initialized to 0)

shared_value = multiprocessing. Value(‘i’, 0) # ‘i’ indicates an integer

# Create a lock to protect the shared value from race conditions

lock = multiprocessing. Lock()

# Create a pool of worker processes

pool = multiprocessing. Pool(processes=4)

# Create a list of processes, each running the increment function

processes = []

for _ in range(4):

p = pool. apply_async(increment, args=(shared_value, lock))

processes. append(p)

# Wait for all processes to complete

pool. close()

pool. join()

# Print the final value of the shared value

print(f"Final value: {shared_value. value}") #Should be 40000

When to Use Multiprocessing. Pool

CPU-bound tasks: Tasks that spend most of their time performing calculations or other CPU-intensive operations. Examples include image processing, scientific simulations, and computationally heavy algorithms. Independent tasks: Tasks that can be executed independently without relying on shared state or communication (or where communication is minimized and carefully synchronized). Large datasets: When processing large datasets, parallelizing the work can significantly reduce the overall processing time.

When Not to Use Multiprocessing. Pool

I/O-bound tasks: Tasks that spend most of their time waiting for I/O operations (e. g., reading from a file, network requests). For I/O-bound tasks, asyncio or threads might be more appropriate. Creating many Processes each blocked on I/O is generally a bad idea. Tasks with significant shared state and communication: If tasks require frequent communication or access to shared state, the overhead of managing multiple processes can outweigh the benefits of parallelism. Consider using threads instead. However, even with threads, be mindful of the Global Interpreter Lock (GIL) in CPython, which limits true parallelism for CPU-bound tasks. Tasks that are already optimized: If your task is already highly optimized and uses all available CPU resources, adding more processes might not improve performance and could even degrade it due to overhead. Simple tasks: For very simple tasks that take very little time, the overhead of creating and managing processes might be greater than the time saved by parallelization.

In summary, multiprocessing. Pool is a valuable tool for parallelizing CPU-bound tasks in Python. By understanding its capabilities and limitations, you can effectively leverage it to improve the performance of your applications. Remember to handle shared state carefully, manage resources effectively, and choose the appropriate method for your specific use case.