Generating Shaders From HLSL Fragments

By Shawn Hargreaves

Shaders are cool. You can do all sorts of interesting things with them: this and previous ShaderX books are full of examples. Alongside their power, however, programmable shaders can lead to an explosion of permutations: my last Xbox game contained 89 different pixel shaders, and my current project already has far more. Many of these shaders are variations on a few basic themes, for instance level-of-detail approximations of a material, or the same lighting model both with and without animation skinning. The total number of combinations is huge and is increasing all the time. Typing everything out by hand would be time consuming, error prone, and a maintenance nightmare.

This article will describe how to automatically generate large numbers of shader permutations from a smaller set of handwritten input fragments.

Uber Shaders

A common solution to the permutation problem is to write a single shader that implements a superset of all desired behaviours, and to let the application disable whatever elements are not currently required. This can be achieved in various ways:

It could be as simple as setting shader constants so as to ignore the effects of any unwanted calculations. However this wastes GPU horsepower, as the data is still actually being calculated before it is thrown away.
Static flow control instructions in vs 2.0 and ps 2.x are ideal for this task, and take no time to execute at least on some hardware. Complicated control flows will inevitably limit the ability of the optimiser to understand what is going on, however, and if you try to cram too many features into a single shader, instruction count limits can be a problem.
It can be done entirely as a preprocess, using #ifndef blocks to enable and disable various parts of the code, and compiling the same source multiple times with different preprocessor settings to generate all the different permutations.

The problem with these approaches is that they require all your shader techniques to be merged into a single monolithic program. The core lighting model, one-off special effects, debugging visualisation aids, and things you experimented with six months ago and then discarded, all get tangled up to the point where you dare not change anything for fear of breaking the entire edifice. Not exactly what is generally accepted as good coding practice!

Micro Shaders

An alternative approach is to write many small fragments of shader code, and then concatenate these into various combinations. This could be as simple as performing a strcat() call to combine bits of shader source code. Alternatively, you can use tools like NVLink or the D3DX fragment linker to merge fragments of already-assembled shader microcode.

Back in the days of shader 1.x, at Climax we used the C preprocessor to #include source fragments in a suitable order. For instance, here is the highest quality version of the character vertex shader from MotoGP:

    #define WANT_NORMAL

    #include "animate.vsi"
    #include "transform.vsi"
    #include "light.vsi"
    #include "fog.vsi"
    #include "radiosity.vsi"
    #include "envmap.vsi"

    mov oT0.xy, iTex0

This approach worked reasonably well for simple shaders, but it was hard to keep track of which input values and registers were used by which fragment. To make things more scalable and robust, some kind of automated register allocation was needed. Fortunately, this is exactly what HLSL does for us!

High-level shader languages are the greatest boon imaginable to anyone trying to generate shader code programmatically. When two fragments want to share a piece of data, they just need to refer to it by the same variable name, and then the compiler figures out what register to put it in. When each fragment wants their own private piece of data, a trivial string substitution is enough to mangle the variable names so the compiler sees two different symbols, and will hence allocate two different registers. Perhaps most important of all is that the HLSL compiler does an extremely good job at removing dead or redundant code.

It is common for one shader fragment to calculate several intermediate values, only for a later fragment to overwrite all but one of these with different data. Likewise, several fragments may independently perform the same calculation, such as transforming the input normal to view space. It would be a hassle to manually detect and remove this kind of redundancy, but fortunately there is no need for this. The fragment combiner only has to say what it means, ignoring any duplicate or unused calculations that may result, as the compiler can be trusted to make the details efficient.

In the next section we will describe the main concepts behind our approach to generating shader permutations from HLSL fragments.

HLSL Fragments

We store each shader fragment as a text file, which contains pieces of shader code along with an interface block defining the required usage context.

The following example shows one of the simplest possible fragments, which is a single 2D colormap texture :

    interface()
    {
        $name = base_texture
        $textures = color_map
        $vertex = uv
        $interpolators = uv
        $uv = 2
    }

    ps 1_1

    void main(INPUT input, inout OUTPUT output)
    {
        output.color = tex2D(color_map, input.uv);
    }

We reused an in-house Climax script parser to read the interface block, but this format could just as easily be XML, or a simple "var=value" file.

Pixel and vertex processing is linked together, so each fragment contains both pixel and vertex shader code in a single file. When multiple fragments are concatenated, the final pixel and vertex shaders are generated in parallel. This linkage presents a convenient interface to the outside world, which also removes the potential error of selecting a mismatched shader pair. It also makes it easy to optimise code by moving calculations back and forth between the vertex and pixel units. However it does sometimes produce redundant outputs, because many different pixel shaders often share the same vertex shader. This problem can be handled externally to the generation system by merging duplicate compiled shaders if these have identical token streams.

The above fragment does not include any vertex shader code. In this case our framework will generate a standard pass-through vertex function, which simply copies each input straight across to the output. This is an increasingly common case as more and more processing tends to be done on a per-pixel basis.

Code Generation

During development, shaders are generated and compiled the first time each combination is requested by the engine, but the results can be saved to disk in order to avoid this runtime overhead in the final product. The generation process tries to compile to the lowest possible shader version first, and then tries higher versions if that fails to compile. An example of this is a concatenation of many small ps 1.1 fragments, which could produce a shader too long to work in the 1.1 model. Individual fragments can also label themselves as requiring a specific minimum version, so if fragments use features specific to ps 2.0 or 3.0, we do not have to waste time trying to compile for earlier models.

In the interface block, shader fragments report what resources they require:

The "params" statement lists any constant registers used by the fragment.
The "textures" entry declares what texture samplers it will use.
The "vertex" statement describes the format of the vertex shader input data.
The "interpolators" line declares what data needs to be output from the vertex shader and input to the pixel shader.

Any of these declarations can be annotated with type information, metadata allowing editing tools to handle materials in a sensible way, and conditional tests, in case the fragment wants to adapt itself depending on the context in which it is being used. As a very simple example, the fragment shown above declares that the "uv" vertex input and interpolator channel is a 2-component vector.

Given a list of fragments, we generate a complete shader by performing a number of textual search and replace operations. There is no need to actually parse the syntax of the HLSL code, because our goal is to combine the fragments, rather than to compile them directly ourselves!

We will show how our framework works by considering a simple example. Let's say we want to concatenate the "base_texture" fragment shown earlier with an equally simple "detail_tex" fragment:

    interface()
    {
        $name = detail_tex
        $textures = detail_map
        $vertex = uv
        $interpolators = uv
        $uv = 2
    }

    ps 1_1

    void main(INPUT input, inout OUTPUT output)
    {
        output.color.rgb *= tex2D(detail_map, input.uv) * 2;
    }

The first step is to output all the constants and samplers required by each fragment. For the pixel shader, neither fragment has requested any constant inputs, but they both want one texture sampler. Names must be mangled to avoid conflicts, which can be done by appending the particular fragment index. However, the generated code is more readable if it also includes the shader name. The resulting code is:

    // base_texture0 textures
    sampler base_texture0_color_map;

    // detail_tex1 textures
    sampler detail_tex1_detail_map;

Next, the input structure is declared. This is built by concatenating the data requested by each fragment, and allocating usage indices to avoid conflicts. Each fragment gets a nested structure declaration, again with mangled names:

    // -------- input structures --------
    struct base_texture0_INPUT
    {
        float2 uv : TEXCOORD0;
    };

    struct detail_tex1_INPUT
    {
        float2 uv : TEXCOORD1;
    };

    struct INPUT
    {
        base_texture0_INPUT base_texture0;
        detail_tex1_INPUT detail_tex1;
    };

    INPUT gInput;

Due to an inconsistency in shader versions prior to 3.0, color and texture interpolators are not interchangeable: the two color interpolators have a limited range and precision. In other words, fragments should prefer to use color interpolators whenever possible, and leave the more powerful texture interpolators to fragments that really require the extra precision. This becomes a problem when the concatenation of fragments require more color interpolators than are available. Therefore the allocator needs to be somewhat flexible. It can never assign a texture interpolator request to a color channel because of the limited range, but if the color channels run out, it can use texture interpolators to satisfy any further color requests.

The vertex shader output structure is a duplicate of the pixel shader input, while the pixel shader output (in the absence of any fragments that use multiple rendertargets or oDepth) is very simple:

    // -------- output type --------
    struct OUTPUT
    {
        float4 color : COLOR0;
    };

The core of the shader program is a block copy of the HLSL code for each fragment, with mangled function, structure, and variable names:

    // -------- shader base_texture0 --------
    void base_texture0_main(base_texture0_INPUT input, inout OUTPUT   
                            output)
    {
        output.color = tex2D(base_texture0_color_map, input.uv);
    }

    // -------- shader detail_tex1 --------
    void detail_tex1_main(detail_tex1_INPUT input, inout OUTPUT
                          output)
    {
        output.color.rgb *= tex2D(detail_tex1_detail_map, 
                                  input.uv) * 2;
    }

Finally, the main body of the shader is generated, which simply calls each of the fragments in turn:

    // -------- entrypoint --------
    OUTPUT main(const INPUT i)
    {
        gInput = i;

        OUTPUT output = (OUTPUT)0;

        base_texture0_main(gInput.base_texture0, output);
        detail_tex1_main(gInput.detail_tex1, output);

        return output;
    }

The global gInput structure is unimportant in this example, but it can be useful in a few unusual situations where one fragment wants to access the inputs of another.

This entire process may seem like a ridiculous amount of work, with an excessive amount of code generated, especially when you consider that it compiles down to something as small as:

      ps_1_1
      tex t0
      tex t1
      mul_x2 r0.xyz, t1, t0
    + mov r0.w, t0.w

But that would be missing the point: the size of the intermediate code is unimportant as long as the input fragments are easy to write, and as long as the eventual compiled code is efficient.

The real advantage of our system is that with no extra effort, we can now generate shaders that apply for example two, three, or more detail textures on top of each other, using the same shader fragment for each layer. We can also combine detail textures with whatever other fragments we might write in the future, without ever having to reimplement that particular shader behaviour.

Shade Trees

Offline rendering systems, such as Maya's Hypershade material editor, often describe their shaders as a tree or graph of pluggable components, and allow the user to connect the inputs and outputs in whatever way they desire. Our framework is very basic in comparison, being just a linear chain of operations, which is in many ways reminiscent of the old DX7 texture cascade.

We justify this simple design based on the type of scenarios in which shaders are most commonly employed. There are three typical patterns:

Some shaders are requested purely by code, for drawing a specific graphical effect such as a particle system or explosion .
Some shaders are created by artists, combining different material fragments in an editing tool.
Perhaps most often, the core of a shader is created by artists, but the runtime code may then want to modify this, for instance by adding some lighting, fogging, or animation fragments to the end of the chain.

In the first case, where shader descriptions are constructed by code, linear structures are significantly easier to work with. C++ has powerful grammatical features for declaring lists and arrays, but lacks any direct way of embedding trees into source code. Statements like:

    setShader(ShaderList(ST::base_texture,
                         ST::detail_texture,
                         ST::normalmap,
                         ST::fresnel_envmap,
                         ST::light_specular,
                         ST::fog,
                         ST::depth_of_field));

are easy to write, easy to read, and efficient to execute in a way that would be impossible with a more flexible shade tree.

In the second case, where shaders are built by artists, tree structures are complicated to explain, difficult to visualise, and prone to error. In contrast, a linear layering can be instantly understood by even the least technical of artists, because this mental model is already familiar not only from Photoshop, but also from the most basic processes of working with physical paint. In my opinion, the more we can translate the power of programmable hardware into familiar artistic terms, the better the results we will get out of our existing artists, without having to turn them into programmers first!

Yet, most of the really interesting things just cannot be done using a linear model. Take for example an environment map, where the amount of reflection is controlled by the alpha channel of an earlier texture layer. Or a specular lighting shader, which takes the specular power from a constant register which belongs to the base texture material, except for those pixels where an alpha blended decal fragment has overwritten this with a locally varying material property.

To allow for such things, fragments must have the ability to import and export named control values. The actual plumbing happens automatically: whenever a fragment tries to import a value, this value gets hooked up to any previous exports of that same name, or to the default value if no such export is available. This behaviour resembles the flexibility of a full shade tree, while maintaining the simplicity of linear shader descriptions. It also provides a valuable guarantee that the results will always work. Any fragment can be used (and will function correctly) in isolation, but when several fragments are combined they will automatically communicate to develop more sophisticated abilities.

This level of robustness is particularly desirable when we want to programmatically add new fragments to the end of an artist-constructed material. In our current pipeline artists do not work directly with lighting shaders, however they do have access to fragments that export control values such as the gloss amount, specular power, Fresnel factor, and the amount of subsurface scattering. The editing tool takes whatever material the artist has constructed, and concatenates a fragment which implements a single directional preview light. This fragment imports the various material parameters to make the preview as accurate as possible.

A game engine, on the other hand, is likely to use more sophisticated lighting techniques. In our case this happens to be deferred shading. We take the exact same materials as used in the art tool, and concatenate a deferred shading fragment, which imports the material attributes, then writes them out to the various channels of multiple render targets, with the actual lighting being evaluated later. Existing material fragments work unchanged despite the rendering being done in such a fundamentally different way, the only requirement being that everyone agree on a standard name for each import/export value.

HLSL Metafunctions

In order to communicate material parameters between fragments, we require two new HLSL keywords: "import" and "export". These are done purely by string manipulation, replacing each call with suitable generated code.

The "export" keyword is very simple, as shown by the following fragment which implements a simple 2D base texture and exports its alpha channel as a "specular_amount" control value:

    ps 1_1

    void main(INPUT input, inout OUTPUT output)
    {
        float4 t = tex2D(color_map, input.uv);

        output.color.rgb = t.rgb;

        export(float, specular_amount, t.a);
    }

The preprocessor recognises the export call, and replaces it with a global variable assignment:

    // -------- shader base_texture0 --------
    float base_texture0_export_specular_amount;

    void base_texture0_main(base_texture0_INPUT input, inout OUTPUT 
                            output)
    {
        float4 t = tex2D(base_texture0_color_map, input.uv);

        output.color.rgb = t.rgb;
    
        // metafunction: export(float, specular_amount, t.a);
        base_texture0_export_specular_amount = t.a;
    }

Later on, another fragment might try to import the specular amount value:

    float spec = 0;
    
    import(specular_amount, spec += specular_amount);

Which the preprocessor turns into:

    float spec = 0;
    
    // metafunction: import(specular_amount,spec += specular_amount);
    spec += base_texture0_export_specular_amount;

The content of the import call is expanded once for each matching export. If no other fragment has exported such a value, no such expansion will be available, and the default spec = 0 is used instead. If more than one fragment has exported the value, multiple lines of code are generated (one for each fragment that has exported the value). In the example above this will have the effect of spec being a sum of all the different values. It is up to the caller to decide how the values should be combined: adding is often appropriate, or multiplying, or performing a function call, or perhaps just an assignment that discards all but the most recent value.

The code that is generated by this construction is often full of redundancies, such as the line that adds to zero in the above example, or a global variable generated by export that will never actually be imported by later fragments. Fortunately, the compiler is smart and will fix such things for us.

Adaptive Fragments

It is often useful for a shader fragment to be able to change its behaviour based on the context in which it is being used. For instance:

A fragment that uses one texture, a set of UV coordinates, and a scalar fade value. This works fine in the ps 1.1 model, but needs two interpolator channels because ps 1.1 does not allow a single interpolator to be used both as a texture lookup and as a direct input. If we were compiling for ps 2.0, however, it would be more efficient to pack our data into a single xyz interpolator. It would be nice if we could still support ps 1.1 whenever possible, and in those cases where another fragment requires us to compile for ps 2.0, switch to a different implementation that takes better advantage of this more powerful hardware model.
A lighting shader. Should we do our lighting per vertex, or per pixel? It depends on the context. Per vertex lighting can make sense for distant or highly tessellated objects. But when using a normal map, it would not make sense to evaluate the lighting anything other than on a per-pixel basis. It would be nice if a single fragment could adapt to both situations.

All fragments provide a list of defines in their interface block. Some of these are used to send requests to the code generator: for instance, asking for input normals to be made available, or declaring the intention to use multiple rendertarget outputs. Another use of defines is for communication with other fragments, declaring facts like "I provide a perturbed normal for each pixel", which can modify the behaviour of any later fragments in the chain.

When generating shader code, the defines set by each input fragment are merged into a single list, along with the target shader version. Any aspect of the shader can be tagged with conditionals that test this list of defines, so that it can select different input constants, interpolators, or blocks of shader code depending on the context in which it is being used. In addition, the combined list is #define'd at the start of the generated HLSL program, so it can be tested using preprocessor conditionals within the code itself.

The following is an example of a context sensitive hemisphere lighting fragment that can work either per vertex or per pixel. To decide if the input constants should be made available to either the vertex shader or the pixel shader, we test the ppl (which stands for per pixel lighting) define. The fragment requests a color interpolator channel if it is doing per-vertex lighting, and provides alternative blocks of shader code for each possible situation:

    interface()
    {
        $name = light_hemisphere

        $params = [ ambient, sky, diffuse ]

        $ambient = [ color, vs="!ppl", ps="ppl" ]
        $sky     = [ color, vs="!ppl", ps="ppl" ]
        $diffuse = [ color, vs="!ppl", ps="ppl" ]

        $interpolators = color
    
        $color = [ color, enable="!ppl" ]
    } 


    // the core lighting function might be wanted in the vertex or pixel shader
    vs (!ppl),
    ps (ppl)

    float3 $light(float3 normal)
    {
        float upness = 0.5 + normal.y * 0.5;
        float3 hemisphere = lerp(ambient, sky, upness);
        float d = 0.5 - dot(normal, WorldLightDir) * 0.5;
        return saturate((hemisphere + d * diffuse) * 0.5);
    }


    // per vertex lighting shader
    vs (!ppl)

    void main(out OUTPUT output)
    {
        output.color.rgb = $light(gInput.normal);
        output.color.a = 1;
    }


    // when doing vertex lighting, just modulate each pixel by the vertex color
    ps (!ppl)

    void main(INPUT input, inout OUTPUT output)
    {
        output.color.rgb = saturate(output.color.rgb * 
                                    input.color * 2);
    }


    // per pixel lighting shader
    ps (ppl)

    void main(inout OUTPUT output)
    {
        output.color.rgb = saturate(output.color.rgb * 
                           $light(gInput.normal) * 2);
    }

If we compile this fragment on its own, ppl has not been defined anywhere, so the core light function is used to generate a simple hemisphere lighting vertex shader:

    vs_1_1
    def c8, 0.5, -0.5, 0, 1
    dcl_position v0
    dcl_normal v1
    mad r0.w, v1.y, c8.x, c8.x
    mov r0.xyz, c6
    add r0.xyz, r0, -c5
    dp3 r1.x, v1, c4
    mad r0.xyz, r0.w, r0, c5
    mad r0.w, r1.x, c8.y, c8.x
    mad r0.xyz, r0.w, c7, r0
    mul r0.xyz, r0, c8.x
    max r0.xyz, r0, c8.z
    min oD0.xyz, r0, c8.w
    dp4 oPos.x, v0, c0
    dp4 oPos.y, v0, c1
    dp4 oPos.z, v0, c2
    dp4 oPos.w, v0, c3
    mov oD0.w, c8.w

But now, let's introduce a tangent space normal mapping fragment:

    interface()
    {
        $name = normalmap
        $defines = ppl
        $textures = normalmap
        $vertex = uv
        $interpolators = [ uv, tangent, binormal ]
        $uv = [ 2, want_tangentspace=true ]
    }
    

    vs 1_1
    
    void main(INPUT input, out OUTPUT output)
    {
        output.uv = input.uv;

        output.tangent  = mul(input.uv_tangent,  NormalTrans);
        output.binormal = mul(input.uv_binormal, NormalTrans);
    }


    ps 2_0

    void main(INPUT input, inout OUTPUT output)
    {
        float3 n = tex2D(normalmap, input.uv);

        gInput.normal = normalize(n.x * input.tangent  +
                                  n.y * input.binormal +
                                  n.z * gInputNormal);
    }

Note the annotation on the uv parameter in the interface block, which requests that tangent and binormal vectors be provided along with the texture coordinates. This fragment does not actually do anything on its own, other than modifying the value of gInput.normal, which can be used as input by subsequent fragments.

If we now ask our framework to generate a shader that combines normalmap and light_hemisphere, the adaptive mechanism swings into action. Because the normalmap fragment has defined ppl, different parts of the lighting code are included, resulting in a normal mapped hemisphere lighting pixel shader:

    ps_2_0
    def c4, 0.5, -0.5, 1, 0
    dcl t0.xyz
    dcl t1.xy
    dcl t2.xyz
    dcl t3.xyz
    dcl_2d s0
    texld r0, t1, s0
    mul r1.xyz, r0.y, t3
    mad r1.xyz, r0.x, t2, r1
    mad r1.xyz, r0.z, t0, r1
    nrm r0.xyz, r1
    dp3 r1.x, r0, c0
    mad r0.w, r0.y, c4.x, c4.x
    mov r0.xyz, c2
    add r0.xyz, r0, -c1
    mad r0.xyz, r0.w, r0, c1
    mad r0.w, r1.x, c4.y, c4.x
    mad r0.xyz, r0.w, c3, r0
    mul_sat r0.xyz, r0, c4.x
    mov r0.w, c4.z
    mov oC0, r0

Analysis and Conclusion

Our approach has proven to be successful at encapsulating a wide range of shader behaviours, allowing fragments to be combined in a multitude of ways, entirely automatically.

It is remarkably robust: when I first implemented deferred shading, my existing texture blending, normal mapping, and animation fragments continued to work without a single modification!

It is efficient, too: I've yet to find a single shader that could be manually optimised in ways that were impossible within the fragment combining system.

One disadvantage is the need to describe a precise interface to each fragment, which makes it hard to plug in 3rd party shader code. In practice it rarely takes more than a few minutes to add the required annotations, but this is still an irritation. Such things could be streamlined if the system was built on top of the D3DX effects framework, but that was not an important goal for me.

Debugging can be awkward, because compiler error messages refer to the generated shader rather than to your input fragments. Decent tool support for viewing the intermediate code is crucial.

Ultimately it all comes down to the numbers. If you have five, ten, or even fifty shaders, this system is probably not for you. If you have thousands, however, automation is your friend.

Back to my homepage