I'm cross posting this discussion from an internal Microsoft
mailing list, because I'm so awesomely cool that I just can't bear the
thought of everything I ever wrote not being indexed and archived for
posterity :-)
My reply
to a question about hardware instancing on Xbox 360:
The
360 doesn’t support vertex stream frequency in the same sense as DX9 SM
3.0 uses it. It just provides the vfetch instruction, which you can use
to implement all kinds of crazy addressing schemes.
I’m familiar
with several good ways to implement instancing using vfetch:
- The technique used in our sample
- Replicate and offset
your index data
- Use index%freq to index into the
vertex buffer
- Use index/freq to select the instance
transform from a shader constant array
- Using multiple vertex streams
- Replicate and
offset your index data
- Use index%freq to index into
vertex buffer #1
- Use index/freq to select the instance
transform from vertex buffer #2
- Upside: no longer
limited by shader constant registers
- Downside: instance
data now needs to be set into a dynamic VB, so you have to deal with
the complexity of managing that to avoid stalls (which can be a pain
since Xbox doesn’t support the Discard semantic for SetData)
- Without index replication
- Draw
geometry using a non-indexed API call, so the GPU just generates
steadily incrementing index values
- Store your real
index values in vertex stream #1
- Store vertex data in
stream #2
- Store instance data in either constant
registers or stream #3
- Upside: no longer need to
replicate any geometry data at all (thus saves memory)
- Downside:
disables post T&L vertex caching (thus increases vertex processing
workload)
- Store instance
data in a texture
- This could be combined with any of
the above schemes
- Great for animations: you can encode
all the frames for all the bones of a skinned animation, plus the
current position of each instance, into a single texture
Chris Tector suggested a cunning
fifth option:
There is 2a: indirect your transform indices.
Store a transform index vertex buffer which holds 1 DWORD index of
which transform to use on an instance. Then you can avoid the lock
stalls by playing dirty and never locking. You write a modified
transform to a not in use location in the transform vertex buffer. Then
you rewrite the index to point to the newly written transform. You’re
relying on atomic updates of the single DWORD transform index. So:
- Replicate and offset your index data
- Use index%freq
to index into vertex buffer #1
- Use index/freq to select the
instance transform index from vertex buffer #2
- Use
instance transform index to select a transform from vertex buffer #3
- Upside: no longer limited by shader constant registers and no
longer stall prone, with less buffer juggling required
- Downside:
extra vertex fetch indirection, but same values fetched for every
vertex so they should stay nice and warm in the vertex cache
Since I haven’t tried it in GS, my question is a more general
“loose” multi-threading one. Is this possible? Can I play dirty like
that in safe only managed land? I’m guessing no since I don’t ever get
the pointer to the VB memory.
To which I
replied:
That should work in GS. You don’t get a raw
pointer to the VB memory, but you can use SetData with the NoOverwrite
semantic to update pieces of a dynamic VB without a stall.