In my previous post I skimmed over the details of exactly what I meant by translating instructions from CPU to GPU format.
Here is what usually happens:
Using a CPU profiler, you will see:
That last part is where the translation takes place. If your profiler is able to dig below the .NET code to see what is going on in the native layer, this will show up as a mixture of d3d9.dll, kernel32.dll, and the graphics driver for your card (typically nv4_disp.dll or ati2dvag.dll).
You might think measuring how long Present takes would be a good way to see how much translation work your game is causing. Or you may have cottoned on by now that things are rarely so simple :-) There are a couple of other reasons why Present could take a long time:
Because it can mean several different things, profiling the Present call does not tell us anything directly, but it is an important clue which I shall return to in my next post.
People are often surprised to see how long Present can take. They protest: "I would understand if my Draw method was slow (I am drawing a lot of stuff, after all), but surely it is a bug that the framework spends so long in this method I never even directly called?"
This can be confusing because drawing graphics is a play-now, pay-later kind of a deal. The time spent in Present is directly caused by the drawing commands you issued, but the true cost of those commands didn't show up at the time you called them.
There is one case where a drawing command may pay an unduly large cost, and this is if the internal graphics command buffer fills up in the middle of a frame (ie. Charles runs out of room on his piece of paper). If this happens, Direct3D will call into the driver, translating a batch of commands and handing them over to the to GPU, without waiting for the final Present call. This will show up in your profile as an arbitrary drawing call taking an unusual amount of time. If you find yourself wondering why the first 1,000 renderstate changes were almost free, but then the 1,001th took a long time, that call is probably paying a deferred cost for all your previous drawing operations.
Understanding how this works can teach us some things about graphics drivers. You know how newer drivers often claim to include optimizations that boost overall rendering performance? If you think about it, this only makes sense for games which are CPU bound. If a game is GPU bound, speeding up the translation code in the graphics driver will make no difference, since that CPU code was not the limiting factor in the first place.
It is also interesting to think about this from the perspective of a GPU hardware designer. One of the big questions faced by silicon designers is how closely to mirror the behavior of the D3D API. If they keep their hardware close to the D3D spec, the translation work will be simple, so their driver won't require much CPU, but this might complicate the silicon and slow down the GPU side of things. On the other hand, if they optimize their silicon purely to be as fast as possible, they are likely to produce a better performing GPU, but at the cost of more complex translation which will increase the driver CPU load. Benchmarks don't talk about this much (I guess because not many people would understand the distinction) but there can actually be differences where one card is more likely to be CPU bound, while a different design tends to be GPU bound.