Monthly Archives: October 2015

Speeding up an ArcPy script using the Python multiprocessing module

ArcPy is ESRI’s interface to scripted geoprocessing capabilities. While I personally am much more inclined to use the open-source tool stack of OGR/GDAL, I am stuck with ESRI, ArcGIS and ArcPy at my current professional position. One of my more recent projects presented the need of analyzing a number of point features with regard to nearby line features. In detail the line feature with the smallest distance to the point feature had to be retrieved.

A corresponding algorithm could look like this:

  1. Select a point feature.
  2. Create a buffer around the feature using a base size.
  3. Collect all line features that are situated within the selected buffer.
  4. If no line features were found continually increase buffer size and repeat step 3 until one or more line features have been found.
  5. From the collected line features identify the one with the smallest distance to the selected point feature.

The implementation of this process is straightforward.

Problem with this solution is that depending on the number of point objects to analyze it can be painfully inefficient as we linearly process one item at a time. This may not be problematic if the overall count of items is concise, but real-world data tends to be not to. In my case several thousands of objects had to be regarded leading to a processing time of multiple hours. Whilst optimization is something that should not be done prematurely, I regarded this situation definitely as worth looking into in more detail.

A very common approach to optimize a software solution is to split the complete set of tasks into smaller subsets and to process several of those subsets simultaneously. This approach has become particularly viable with the appearance of multi-core hardware that seems predestined to be used in this manner. From a conceptional viewpoint it is necessary that the individual problems do not interfere with each other. This is clearly the case in the described setup as the minimum distance between a point and a number of line feature is obviously independent from another point feature.

Regarding the point features at hand we will create subsets by splitting the complete dataset into parts that are not larger than a certain cutoff size. An original question over here presents a solution that is already feasible for my demands:

This generic function may now be applied to the actual set of input features:

Please note that is not sufficient to simply retrieve minimum and maximum object ids from the presented layer as previous edits or selections may have created gaps between those values. We actually have to collect all available OIDs instead.

Now for Python, the multiprocessing module (among others) comprises several options to implement parallel solutions. One is to setup a pool of worker processes that subsequently are equipped with according tasks. Here we will supply a number of workers the task to retrieve nearest lines to one of the previously created range of point objects. To do so we have to collect all the necessary data first, i.e. sources for point and line layers as well as the range of OIDs:

As you can see we have combined all necessary input data for one chunk into a list object before appending it to an overall list containing the input data for all chunks. This is by design as mapping worker processes to functions only allows one argument. That is also we have to adjust the definition of our worker function:

Finally we create a worker pool and assign each worker a task by mapping it to the function for nearest line retrieval and providing it with the necessary input data:

The Pool constructor allows for several arguments. One may define an initialization function that each worker will carry out after being created, this function may of course be provided with initialization arguments. Both of these arguments are set to None in the case at hand. More interesting is the last argument which is set to 1. This is the maximum number of tasks that will be processed by a single worker process before it is killed and a new one is spawned. As ArcPy functions are notorious for leaking memory it is especially necessary to keep this number low or we may end up with exceptions due to low memory.

Implementing multiprocessing for my use case has decreased processing time to a quarter of the previous duration. Now it is still in the range of hours but at least considerable less than in the linear version of this program.

We will have a look at how analysis tasks like the one described above may be sped up by applying low-level database queries in a later post on the Portolan Blog.