I oversimplified when I described the GPU as a single elf named George.
In fact, a modern graphics card has a complex pipeline with hundreds of elves working in parallel. In the same way that the CPU records drawing commands into a buffer, then the GPU processes them while the CPU is free to get on with other work, each of these internal GPU pipeline elves is reading input data from a buffer, doing some computations, then writing output data to another buffer which is consumed by a different elf further down the chain.
This lets us subdivide the concept of being "GPU bound" based on which particular elf is causing the bottleneck. In the same way that optimizing your CPU code makes no difference if you are GPU bound, successfully optimizing GPU rendering depends on knowing which part of the pipeline you are trying to speed up.
So what exactly does happen inside the GPU? The details vary from card to card, but these are the most important stages:
Any of these may be your performance bottleneck, and it is tremendously useful to find out which. For instance if we learn our game is limited by vertex shader processing, we know to optimize that rather than wasting time trying to reduce the number of texture fetches. Or if we are limited by pixel shading, we could increase the number of triangles in our models without affecting the framerate!
So what factors affect the performance of each pipeline stage?
To identify the bottleneck, we need some way of altering just one of these contributing factors, and without changing our CPU code in any significant way (if a change affected CPU performance as well as GPU, that could invalidate our results).
Try running your game in a tiny resolution, say 100x50. This makes no difference to the CPU, vertex fetch, or vertex shader performance. Does the framerate improve?
If reducing the resolution does not affect performance (and assuming you are not CPU bound), your limiting factor must be vertex processing. You can speed up both vertex fetch and vertex shading by using fewer triangles in your models, or you could try to simplify the vertex shader. If you think vertex fetch might be the problem, run your models through a custom processor and use VertexChannelCollection.ConvertChannelContent to compress your vertex data into a PackedVector format. Normalized101010 is good for normals, and you can often get away with HalfVector2 for texture coordinates.
If reducing the resolution speeds things up, you must be limited by one of the pixel processing stages.
Try setting SamplerStates[n].MipMapLevelOfDetailBias to 4 or 5. If you do this right, and assuming you are using mipmaps (if not, add mipmaps straight away and watch your performance improve!) your textures will become blurry. If it boosts performance, you are limited by texture fetch bandwidth, in which case you can speed up your game by enabling DXT compression or using fewer/smaller textures.
Try changing all your pixel shaders so they just return a constant color. This will affect both pixel shader and texture fetch performance, but since we already tested texture fetching, we can deduce that if this boosts the framerate while the mipmap bias did not, the bottleneck must be pixel shader processing.
Still here? That means your bottleneck must be #3, #6, or #7.
Try enabling multisampling. If this makes no difference, you are limited by the rasterizer.
Try changing the framebuffer to a smaller pixel format such as SurfaceFormat.Bgr565. If this speeds things up, you are limited by framebuffer writes.
Otherwise, by process of elimination, it must be the depth/stencil.
Tada!
I was going to write some suggestions about how to optimize for each possible bottleneck, but this post is long enough already. Please ask if you have questions about that...