Wednesday, January 28, 2009

The Pipeline

What if I told you that the best thing that happened to parallel programming was the following:

main()
{
    for (each unique func1, func3, and inArray1 dataset)
    {
        "setup" func1, func3, and inArray1...

        compute();
    }
}

compute()
{
    for (each element in inArray1)

    {
        outArray1[i] = func1(inArray1[i]);
    }

    // func2 is a well known transformation
    // that takes input from outArray1
    // and outputs to outArray2
    func2(outArray1[], outArray2[]);

    for (each element in outArray2)
    {
        outArray3[i] = func3(outArray2[i]);
    }
}

Programmers are given the freedom to implement main, but have absolutely no control over the structure of the compute function other than the ability to implement func1 and func3. The for loops in compute function is executed super fast because it's on dedicated parallel hardware that executes as many elements together at a time. 

It's the best thing that happened to parallel programming becuase its structure ensures no data dependency and there's no synchronization primitives needed. 

As you can guess by now, func1 is our vertex program, func2 is the GPU rasterizer, and func3 is our fragment program. And it is the way people program real-time graphics right now.

Friday, January 16, 2009

The future of real-time rendering

Seem like everyone in the industry is speculating where real-time rendering is heading these days. People are either really excited about Larrabee or really doubtful that Intel can make compete with ATI/NVIDIA. After all, Intel is the same company that brought us i740 and the beloved Intel IGP (integrated graphics platform). The way I see it, whether Intel will succeed or not depends on one question: "Do we need more programmability in real-time graphics?" If you believe that the current state of affairs: triangles, maps (textures, light map, normal map, ...etc.), and shaders (vertex and fragment/pixel) is all we need, then it's quite possible that Larrabee will down history as another one of Intel's failed attempts. But if DX10 and DX11 level hardware has any appeal to you at all, then what you need to ask is: why am I stuck with the pipeline?

Make no mistake, there's nothing magical about the GPU trumping CPU in terms of graphics performance back in the day GPU was introduced. GPU is fast for graphics because it's a parallel machine on chip and raster graphics is an inherently parallel problem. Everything in the pipeline comes in multiple of them: vertices, maps, and pixels. In the end, they are all arrays of data. Better yet, they are arrays of data that you perform similar tasks on. GPU architects observed this, and rightfully optimized for data and instruction parallelism. CPUs never had the luxury of building such parallelism, its task was to be the conductor, and the trait of a conductor is to be able to branch fast and reduce latency when computing. It will never see a thousands/millions element array that it can perform the same instructions on. Cache is the CPUs answer to reducing latency, because everything is unpredictable, the more memory it can pack in its cache the faster it can fetch thus compute. When given the chance a CPU architect rather use silicon for cache to increase odds of reducing latency than to use it for computation. (The flipside is: bigger the cache, the slower it is. This is why cache comes in levels.) GPU is almost the exact opposite, it doesn't need cache, because CPU sends it a stream that is the definition of predictable. The vertex shader streams another predictable stream to geometry setup unit. The rasterizer sends stream of quads to the fragment shader. Fragment shader accumulates results in color buffers (tiling of these further promotes uniform access). It might have branches when following through the stages, but you can bet that there's more predictability in each stage than there is randomness. What's the point of all this? To illustrate that CPU and GPU architects take the same computer architecture classes, and that there isn't some black magic sauce that allows ATI/NVIDIA chips to be fast. It it simply optimizing for the use case.

The post is getting long, I guess I will rant about being stuck in a pipeline next time...