GPU Computing in C# With Nvidia Alea: When CPU Parallelism Is Not Enough
After the CPU multithreading benchmark that beat SQL Server's parallelism by 3.5×, the natural next question was: what's the ceiling? An 8-core CPU at full utilization was fast. But a GPU has hundreds or thousands of cores. For the right computation, the performance difference is orders of magnitude. The question was whether geocoding was the right computation, and whether C# could access GPU compute without requiring a full rewrite to CUDA C.
The answer to both questions was yes. Here's how it worked with the Alea library.
Why GPUs for Data Processing
GPUs are optimized for Single Instruction, Multiple Data (SIMD) operations — the same instruction applied to thousands of data elements simultaneously. Point-in-polygon tests are SIMD-friendly: you're running the same geometric containment test against every radar data point, each test independent of the others. That's exactly what a GPU is built to do.
CPU parallelism on an 8-core machine gives you 8 worker threads. A modern NVIDIA GPU in 2015 (the GTX 980 or Quadro K6000 class) had 2,048 to 2,880 CUDA cores. The theoretical throughput ratio was 250-360× — though practical gains are always less than theoretical ones.
Alea: GPU Computing in C#
Alea (originally called Bumblebee, from the QuantAlea project) was a .NET library that compiled C# or F# lambda functions to CUDA PTX — the GPU instruction set — and executed them on an NVIDIA GPU. It was genuinely elegant: you wrote C# code, decorated it with attributes that marked it as a GPU kernel, and Alea handled the compilation and memory transfer to and from the GPU.
// NuGet: Alea (install-package Alea)
using Alea;
using Alea.CSharp;
[AOTCompile]
static class GeocodingKernels
{
// This method runs on the GPU
// Each thread processes one data point
public static void GeocodeKernel(
deviceptr<float> latitudes, // input: array of latitudes on GPU memory
deviceptr<float> longitudes, // input: array of longitudes on GPU memory
deviceptr<int> countyFips, // output: FIPS codes
deviceptr<float> boundaryData, // county boundary vertices (pre-loaded)
deviceptr<int> boundaryIndex, // index into boundaryData per county
int countyCount,
int pointCount)
{
var idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= pointCount) return;
var lat = latitudes[idx];
var lon = longitudes[idx];
countyFips[idx] = -1; // default: no match
// Point-in-polygon test for each county
for (int c = 0; c < countyCount; c++)
{
var start = boundaryIndex[c];
var end = boundaryIndex[c + 1];
if (PointInPolygon(lat, lon, boundaryData, start, end))
{
countyFips[idx] = c; // store county index (map to FIPS later)
break;
}
}
}
// Runs on GPU — no heap allocation, no virtual dispatch, no exceptions
static bool PointInPolygon(float lat, float lon,
deviceptr<float> vertices, int start, int end)
{
bool inside = false;
int j = end - 2; // last vertex (each vertex is 2 floats: lat, lon)
for (int i = start; i < end; i += 2)
{
float xi = vertices[i], yi = vertices[i + 1];
float xj = vertices[j], yj = vertices[j + 1];
if (((yi > lat) != (yj > lat)) &&
(lon < (xj - xi) * (lat - yi) / (yj - yi) + xi))
inside = !inside;
j = i;
}
return inside;
}
}
Launching the Kernel
// Host code (runs on CPU, launches work on GPU)
var gpu = Gpu.Default;
// Transfer data to GPU memory
using var gpuLatitudes = gpu.AllocateDevice<float>(latitudes);
using var gpuLongitudes = gpu.AllocateDevice<float>(longitudes);
using var gpuFips = gpu.AllocateDevice<int>(pointCount);
using var gpuBoundaryData = gpu.AllocateDevice<float>(boundaryVertices);
using var gpuBoundaryIndex = gpu.AllocateDevice<int>(boundaryIndex);
// Launch: 256 threads per block, enough blocks to cover all points
var blockSize = 256;
var gridSize = (pointCount + blockSize - 1) / blockSize;
var lp = new LaunchParam(gridSize, blockSize);
gpu.Launch(GeocodingKernels.GeocodeKernel, lp,
gpuLatitudes.Ptr, gpuLongitudes.Ptr, gpuFips.Ptr,
gpuBoundaryData.Ptr, gpuBoundaryIndex.Ptr,
countyCount, pointCount);
// Transfer results back to CPU
var results = Gpu.CopyToHost(gpuFips);
The Results
On a server with an NVIDIA Quadro K4200 (1,344 CUDA cores), the GPU version processed the same geocoding workload at approximately 4.2 million rows per minute — roughly 6.7× faster than the CPU version, and 23× faster than the original SQL Server approach.
The memory transfer overhead (CPU → GPU, GPU → CPU) was the main bottleneck at this scale. For the 400 million row dataset, the effective throughput was lower than the kernel-only throughput because we were processing in batches and paying the transfer cost for each batch. Larger batches amortized the transfer cost better; the sweet spot was around 500,000 rows per GPU launch on the hardware we had.
When GPU Makes Sense for Data Processing
The GPU approach was worth it because: the computation was massively parallel (independent per row), it was computationally intensive (nested loops in PointInPolygon), and the dataset was large enough that the overhead of GPU setup was amortized across hundreds of millions of rows. For smaller datasets or simpler computations, the CPU parallel version was faster when setup overhead was accounted for.
GPU computing in C# is genuinely accessible in 2015 with tools like Alea. If you have a CPU-bound data processing problem that's embarrassingly parallel, it's worth evaluating. The performance ceiling is much higher than CPU threads alone. As always, I'm here to help.