Santa's production line

I oversimplified when I described the GPU as a single elf named George.

In fact, a modern graphics card has a complex pipeline with hundreds of elves working in parallel. In the same way that the CPU records drawing commands into a buffer, then the GPU processes them while the CPU is free to get on with other work, each of these internal GPU pipeline elves is reading input data from a buffer, doing some computations, then writing output data to another buffer which is consumed by a different elf further down the chain.

This lets us subdivide the concept of being "GPU bound" based on which particular elf is causing the bottleneck. In the same way that optimizing your CPU code makes no difference if you are GPU bound, successfully optimizing GPU rendering depends on knowing which part of the pipeline you are trying to speed up.

So what exactly does happen inside the GPU? The details vary from card to card, but these are the most important stages:

The vertex fetch unit reads vertex data from memory
The vertex shader processes this data
The rasterizer works out which pixels are covered by each triangle
The pixel shader calculates the color of each pixel
The texture fetch unit looks up any textures that were requested by the pixel shader
The depth/stencil unit reads, tests, and updates the depth buffer
The framebuffer stores the final output color, and applies alpha blending

Any of these may be your performance bottleneck, and it is tremendously useful to find out which. For instance if we learn our game is limited by vertex shader processing, we know to optimize that rather than wasting time trying to reduce the number of texture fetches. Or if we are limited by pixel shading, we could increase the number of triangles in our models without affecting the framerate!

So what factors affect the performance of each pipeline stage?

vertex fetch
- number of vertices
- size of each vertex
- whether vertices are well ordered for cache coherency
vertex shader
- number of vertices
- length of vertex shader program
- whether triangle indices are well ordered for cache coherency
rasterizer
- number of pixels rendered
- number of interpolator values passed from vertex shader to pixel shader
pixel shader
- number of pixels rendered
- length of pixel shader program
texture fetch
- number of pixels rendered
- how many texture lookups per pixel
- amount of texture data read from memory
  - mipmapped textures have way better cache coherency
  - DXT textures are smaller than uncompressed formats
- type of filtering
  - anisotropic is the most expensive
  - trilinear is usually only a little slower than bilinear
  - bilinear and point sampling are often identical
depth/stencil
- number of pixels rendered
- whether multisampling is used
- read/write vs. read-only mode
framebuffer
- number of pixels rendered
- whether multisampling is used
- size of each framebuffer pixel (including MRT)
- read/write (alpha blending) vs. write-only (opaque)

To identify the bottleneck, we need some way of altering just one of these contributing factors, and without changing our CPU code in any significant way (if a change affected CPU performance as well as GPU, that could invalidate our results).

Try running your game in a tiny resolution, say 100x50. This makes no difference to the CPU, vertex fetch, or vertex shader performance. Does the framerate improve?

If reducing the resolution does not affect performance (and assuming you are not CPU bound), your limiting factor must be vertex processing. You can speed up both vertex fetch and vertex shading by using fewer triangles in your models, or you could try to simplify the vertex shader. If you think vertex fetch might be the problem, run your models through a custom processor and use VertexChannelCollection.ConvertChannelContent to compress your vertex data into a PackedVector format. Normalized101010 is good for normals, and you can often get away with HalfVector2 for texture coordinates.

If reducing the resolution speeds things up, you must be limited by one of the pixel processing stages.

Try setting SamplerStates[n].MipMapLevelOfDetailBias to 4 or 5. If you do this right, and assuming you are using mipmaps (if not, add mipmaps straight away and watch your performance improve!) your textures will become blurry. If it boosts performance, you are limited by texture fetch bandwidth, in which case you can speed up your game by enabling DXT compression or using fewer/smaller textures.

Try changing all your pixel shaders so they just return a constant color. This will affect both pixel shader and texture fetch performance, but since we already tested texture fetching, we can deduce that if this boosts the framerate while the mipmap bias did not, the bottleneck must be pixel shader processing.

Still here? That means your bottleneck must be #3, #6, or #7.

Try enabling multisampling. If this makes no difference, you are limited by the rasterizer.

Try changing the framebuffer to a smaller pixel format such as SurfaceFormat.Bgr565. If this speeds things up, you are limited by framebuffer writes.

Otherwise, by process of elimination, it must be the depth/stencil.

Tada!

I was going to write some suggestions about how to optimize for each possible bottleneck, but this post is long enough already. Please ask if you have questions about that...

Blog index - Back to my homepage

Santa's production line

Originally posted to Shawn Hargreaves Blog on MSDN, Friday, April 11, 2008