On Vulkan:

    push constants can be copied on the CPU
    UBO data must be copied on the GPU, especially when update-after-bind is involved (DX12 games)
    textureSIze() needs to happen on the GPU for bindless textures
    some sysvals can be copied on the CPU but some need to be GPU

On v10, we can copy data with the CSF more efficiently than a compute kernel.

In GL:

    uniforms can be copied on the CPU, unless threaded conetxt is used
    UBO data should be copied on the GPU
    textureSize() can happen wherever, currently we do it on the CPU
    some sysvals need GPU copies in case of indirect dispatch and even more for indirect draws

To deal with this, we can use

    CPU only munging. This is slow for CPU-bound apps when pushing UBOs and doesn't work for indirect/bindless.

    CPU munging + UBO copy shaders. This improves CPU perf for UBO pushes at the expense of extra GPU overhead. Doesn't help with txs.

    CPU munging + nir_opt_preamble. This improves CPU perf for UBO pushes at the expense of extra GPU overhead. Also slightly improves GPU perf for certain patterns. Makes txs trivial to handle even with bindless. nir_opt_preamble will likely make (much) better decisions on what to push than our existing push heuristic, which may improve GPU perf in some cases.

    CPU munging + nir_opt_preamble + UBO copy shaders. Even more GPU overhead.

    nir_opt_preamble only. Possibly no worse for GPU overhead than with the CPU munging and strictly less CPU overhead. 

For Vulkan on v9, relying on nir_opt_preamble is a clear win.

For Vulkan on v10, " applies, and we can maybe detect copy-only preambles and replace them with equivalent CSF instructions.

For GL with threaded_context and no cached UBOs, we need GPU based push. Toss up between copy shaders and preambles. If we're going to use GPU based stuff, we may as well go all in and use preambles and convert txs.

For GL without threaded_context and only uniforms, but uniform-on-uniform math, it's a toss up. Preambles save some ALU relative to CPU based copies, but ALU is cheap so IDK if anyone cares. GPU based push in this case is strictly worse.

For GL without threaded_context and only uniforms and no uniform-on-uniform, preamble-only pushing is strictly worse, but if we do the uniform copies on the CPU it's fine and only makes a difference with txs which is comparatively rare.

My preliminary conclusions are:

    Rely on nir_opt_preamble. It will push UBOs better than my crappy vendored code can push UBOs. It lets us handle txs in a sane way. It will optimize out uniform-on-uniform.
    Possibly copy sysvals and push constants / true uniforms on the CPU, depending what's convenient.