main()
{
for (each unique func1, func3, and inArray1 dataset)
{
"setup" func1, func3, and inArray1...
compute();
}
}
compute()
{
for (each element in inArray1)
{
outArray1[i] = func1(inArray1[i]);
}
// func2 is a well known transformation
// that takes input from outArray1
// and outputs to outArray2
func2(outArray1[], outArray2[]);
for (each element in outArray2)
{
outArray3[i] = func3(outArray2[i]);
}
}
Programmers are given the freedom to implement main, but have absolutely no control over the structure of the compute function other than the ability to implement func1 and func3. The for loops in compute function is executed super fast because it's on dedicated parallel hardware that executes as many elements together at a time.
It's the best thing that happened to parallel programming becuase its structure ensures no data dependency and there's no synchronization primitives needed.
As you can guess by now, func1 is our vertex program, func2 is the GPU rasterizer, and func3 is our fragment program. And it is the way people program real-time graphics right now.