pcb-rnd knowledge pool

 

The polygon lib on GPU

poly_gpu by Tibor 'Igor2' Palinkas on 2024-06-03

Tags: insight, polygon, poly, clipping, library, GPU

node source

 

 

Abstract: Considerations and details on what kind of algorithms lend themselves to be run on the GPU and whether librnd's polygon clipping lib fits the bill.

 

How/what to run on GPU

OpenGL and shader basics

The GPU is historically designed mainly for a relatively narrow range of tasks: rendering 3D scenes (textured 3D polygons with lighting onto a 2D frame buffer). The evolution of OpenGL reflects this as it was transitioning from a fully hardwired pipeline implementation to a partially programmable one. Programming is done using shaders . Vertex shaders allow programming the 3D-to-2D coordinate transformation while fragment shaders are used for programming texture mapping and color transformation. The purpose and place of these shaders are fixed in the graphic rendering pipeline and their APIs are tied to graphics.

Some GPU implementations also offer compute shaders which are used for generic computations with a more flexible API. However this API is rarely available on older or low end GPUs or when it is available, the number of compute shader units is typically below the number of fragment shader units. Meanwhile all GPUs (not older than a decade) offer a large number of vertex and fragment shaders units.

How to run anything on the GPU

There are mainly two ways to get the GPU to compute anything:

The more viable path is the latter as it works on pretty much any GPU that Ringdove users may have. A side project called libgpcogl provides a generic array-to-array computation API using OpenGL and fragment shaders. The library is designed to work with a wide range of OpenGL versions and GPUs. It's been tested with some Nvidia and AMD GPUs produced around 2011 as well as more modern Intel variants to ensure compatibility.

The fragment shader approach uploads data to the GPU's memory in form of a texture. Input data is coming from 2D arrays with each cell being a data tuple. Then the "inner loop" of the actual algorithm is written as a fragment shader (in the OpenGL shader language) and is ran on each cell of the output array while it has access to all input arrays. The output array is an in-GPU-memory frame buffer, which is again just a 2D array of data tuples. The assumption is that each cell of the output array can be computed independently of other cells of the output array. Which also makes the process easy to parallelize: the GPU is free to run multiple parallel instances of the shader program and schedule them to work on different cells of the output array.

In other words, the library provides a generic, GPU- and OpenGL-independent API for a single task:

There are costs associated with preparing/converting input data to the format the GPU can process, then converting the output frame buffer data back from "GPU format". It is also costly to upload input into the GPU memory and to download output from the GPU memory.

What to run on the GPU

A single-step tasks suitable for the GPU are where:

Alternatively multi-step, iterative algorithms also work fine since an output array of a step can be reused as an input array of a subsequent step without having to download and re-upload the data (no GPU-CPU memory transfer between steps). An iterative algorithm may work well on the GPU even if a single step of the algorithm is not very GPU-friendly (the number of output cells is low and/or the computation performed per cell is little and simple).

A typical simple example what to run on the GPU is a filter that takes an input 1D or 2D array and computes an output 1D or 2D array where each cell is the average of the adjacent 2 or 4 (or 8) cells. Suitable because each output cell can be computed independently and depends only on the input.

A typical simple example what the GPU is not good at is having the same input 1D or 2D array and find the largest value. Not suitable because the task is sequential: each input cell needs to be compared against a "largest so far" global state which potentially changes with any of the comparisons, making a random step depend on any previous step.

A workaround is to have a half/quarter size output array where each cell holds the smallest value of 2 (or 2*2) adjacent cells of the input array; then this reduction iterated until a single cell remains. However this requires approximately 2*N iterations of the comparison on GPU instead N iterations with the naive algorithm on the CPU. The GPU can parallelize the task well, but this pays off only with large input which is expensive to upload.

How do our polygon clipping libs work

The original algorithm inherited from gEDA/pcb uses a topological approach.

The initial rewrite, polybool follows the same topological approach. It also adds a few extra steps to handle the problem with finite precision numerics and self-intersections introduced when rounding to integer output. These parts are also topological.

The second rewrite using a home-grown algorithm, polybool2 is going to use a very different, but also topological algorithm.

The idea of topological algorithms for polygon clipping is to convert the original two dimensional geometric problem into a fundamentally one dimensional topological problem, stripping coordinates and geometric details in the very first step, dealing with intersections and abstract curve segments in between those intersections and specially their relations. Instead of coordinates and angles, there are orders and "left-of vs. right-of" relations. This approach makes the algorithm both much simpler and much faster.

Poly clipping vs. GPU

However, such a topological algorithm doesn't map well to any problem the GPU can solve efficiently, for various reasons:

The only geometrical part is the first step where all contour-contour intersections are mapped. This is done in approximately O(n*log(n)) time on the CPU using rtrees (n being the number of line segments that make up the input polygon contours). The trivial way to do the same on GPU is compare all segments to all other segments with the trivial O(n^2) algorithm. This yields a simple, parallel algorithm that is GPU friendly. However even with a hundred shader units working in parallel n^2/100 becomes larger than n*log(n) already around 1000 input segments. I've ran some tests that confirmed this.

In theory it would be possible to rewrite the whole algorithm as a series of shader operations (rtree included). This way the same code would execute on GPU instead of CPU and even if no part is parallelized, the CPU is freed up to do something else. Unfortunately in practice poly clipping is a blocking process and pcb-rnd has to wait until it is finished so the CPU would only idle the whole time.

Conclusion and future work

My conclusion is that the 2d poly clipping algorithm is not a suitable one for the GPU.

However, the effort put into the research is not lost: the byproduct is libgpcogl , which can be reused in other algorithms that are more GPU-suitable. A long term plan is to develop an electromagnetic simulation and those algorithms are very much GPU-friendly.

Another thing to explore later on is whether moving our geometric query algorithm (rtree on CPU) to GPU would speed things up.