Lost in translation

In my previous post I skimmed over the details of exactly what I meant by translating instructions from CPU to GPU format.

Here is what usually happens:

Your Draw method issues graphics calls, which are recorded into a buffer
Your Draw method finishes
The XNA framework calls GraphicsDevice.Present
This calls the native IDirect3DDevice::Present
The DirectX runtime does some translation, then calls the graphics driver
The graphics driver does some more translation, then hands the final translated commands over to the GPU

Using a CPU profiler, you will see:

Some time in MyGame.Update
Some time in MyGame.Draw
Some time in GraphicsDevice.Present

That last part is where the translation takes place. If your profiler is able to dig below the .NET code to see what is going on in the native layer, this will show up as a mixture of d3d9.dll, kernel32.dll, and the graphics driver for your card (typically nv4_disp.dll or ati2dvag.dll).

You might think measuring how long Present takes would be a good way to see how much translation work your game is causing. Or you may have cottoned on by now that things are rarely so simple :-) There are a couple of other reasons why Present could take a long time:

If your game is GPU bound, Present will spend time waiting for the GPU to finish drawing the previous frame
If you have vsync enabled, Present may spend time waiting for the next monitor refresh

Because it can mean several different things, profiling the Present call does not tell us anything directly, but it is an important clue which I shall return to in my next post.

People are often surprised to see how long Present can take. They protest: "I would understand if my Draw method was slow (I am drawing a lot of stuff, after all), but surely it is a bug that the framework spends so long in this method I never even directly called?"

This can be confusing because drawing graphics is a play-now, pay-later kind of a deal. The time spent in Present is directly caused by the drawing commands you issued, but the true cost of those commands didn't show up at the time you called them.

There is one case where a drawing command may pay an unduly large cost, and this is if the internal graphics command buffer fills up in the middle of a frame (ie. Charles runs out of room on his piece of paper). If this happens, Direct3D will call into the driver, translating a batch of commands and handing them over to the to GPU, without waiting for the final Present call. This will show up in your profile as an arbitrary drawing call taking an unusual amount of time. If you find yourself wondering why the first 1,000 renderstate changes were almost free, but then the 1,001th took a long time, that call is probably paying a deferred cost for all your previous drawing operations.

Understanding how this works can teach us some things about graphics drivers. You know how newer drivers often claim to include optimizations that boost overall rendering performance? If you think about it, this only makes sense for games which are CPU bound. If a game is GPU bound, speeding up the translation code in the graphics driver will make no difference, since that CPU code was not the limiting factor in the first place.

It is also interesting to think about this from the perspective of a GPU hardware designer. One of the big questions faced by silicon designers is how closely to mirror the behavior of the D3D API. If they keep their hardware close to the D3D spec, the translation work will be simple, so their driver won't require much CPU, but this might complicate the silicon and slow down the GPU side of things. On the other hand, if they optimize their silicon purely to be as fast as possible, they are likely to produce a better performing GPU, but at the cost of more complex translation which will increase the driver CPU load. Benchmarks don't talk about this much (I guess because not many people would understand the distinction) but there can actually be differences where one card is more likely to be CPU bound, while a different design tends to be GPU bound.

Blog index - Back to my homepage

Lost in translation

Originally posted to Shawn Hargreaves Blog on MSDN, Wednesday, April 2, 2008