# Midgard Shaders with the Free NIR Compiler

18 Mar 2018

In my last update on the Panfrost project, I showed an assembler and disassembler pair for Midgard, the shader architecture for Mali Txxx GPUs. Unfortunately, Midgard assembly is an arcane, unwieldly language, understood by Connor Abbott, myself, and that’s about it besides engineers bound by nondisclosure agreements. You can read the low-level details of the ISA if you’re interested.

In any case, what any driver really needs is not just an assembler but a compiler. Ideally, such a compiler would live in Mesa itself, capable of converting programs written in high level GLSL into an architecture-specific binary.

Such a mammoth task ought to be delayed until after we begin moving the driver into Mesa, through the Gallium3D infrastructure. In any event, back in January I had already begun such a compiler, ingesting NIR, an intermediate representation coincidentally designed by Connor himself. The past few weeks were spent improving and debugging this compiler until it produced correct, reasonably efficient code for both fragment and vertex shaders.

As of last night, I have reached this milestone for simple shaders!

As an example, an input fragment shader written in GLSL might look like:

uniform vec4 uni4;

void main() {
gl_FragColor = clamp(
vec4(1.3, 0.2, 0.8, 1.0) - vec4(uni4.z),
0.0, 1.0);
}

Through the fully free compiler stack, passed through the free diaassembler for legibility, this yields:

vadd.fadd.sat r0, r26, -r23.zzzz
br_cond.write +0
fconstants 1.3, 0.2, 0.8, 1

vmul.fmov r0, r24.xxxx, r0
br_cond.write -1

This is the optimal compilation for this particular shader; the majority of that shader is the standard fragment epilogue which writes the output colour to the framebuffer.

For some background on the assembly, Midgard is a Very Long Instruction Word (VLIW) architecture. That is, multiple instructions are grouped together in blocks. In the disassembly, this is represented by spacing. Each line is an instruction, and blank lines delimit blocks.

The first instruction contains the entirety of the shader logic. Reading it off, it means “using the vector addition unit, perform the saturated floating point addition of the attached constants (register 26) and the negation of the z component of the uniform (register 23), storing the result into register 0”. It’s very compact, but comparing with the original GLSL, it should be clear where this is coming from. The constants are loaded at the end of the block with the fconstants meta instruction.

The other four instructions are the standard fragment epilogue. We’re not entirely sure why it’s so strange – framebuffer writes are fixed from the result of register 0, and are accomplished with a special loop using branching instruction. We’re also not sure why the redundant move is necessary; Connor and I suspect there may be a hardware limitation or errata preventing a br_cond.write instruction from standing alone in a block. Thankfully, we do understand more or less what’s going on, and they appear to be fixed. The compiler is able to generate it just fine, including optimising the code to write into register 0.

As for vertex shaders, well, fragment shaders are simpler than vertex shaders. Whereas the former merely has the aforementioned weird instruction sequence, vertex epilogues need to handle perspective division and viewport scaling, operations which are not implemented in hardware on this embedded GPU. When this is fully implemented, it will be quite a bit more difficult-to-optimise code in the output, although even the vendor compiler does not seem to optimise it. (Perhaps in time our vertex shaders could be faster than the vendor’s compiled shader due to a smarter epilogue!)

attribute vec4 vin;
uniform vec4 u;

void main() {
gl_Position = (vin + u.xxxx * vec4(0.01, -0.02, 0.0, 0.0)) * (1.0 / u.x);
}

Through the same stack and a stub vertex epilogue which assumes there is no perspective division needed (that the input is normalised device coordinates) and that the framebuffer happens to be the resolution 400x240, the compiler emits:

vmul.fmov r1, r24.xxxx, r26
fconstants 0, 0, 0, 0

ld_attr_32 r2, 0, 0x1E1E

vmul.fmul r4, r23.xxxx, r26
fconstants 0.01, -0.02, 0, 0

lut.frcp r6.x, r23.xxxx, #2.61731e-39
fconstants 0.01, -0.02, 0, 0

vmul.fmul r7, r5, r6.xxxx

vmul.fmul r9, r7, r26
fconstants 200, 120, 0.5, 0

fconstants 200, 120, 0.5, 1

st_vary_32 r1, 0, 0x1E9E

There is a lot of room for improvement here, but for now, the important part is that it does work! The transformed vertex (after scaling) must be written to the special register 27. Currently, a dummy varying store is emitted to workaround what appears to be yet another hardware quirk. (Are you noticing a trend here? GPUs are funky.). The rest of the code should be more or less intelligible by looking at the ISA notes. In the future, we might improve the disassembler to hide some of the internal encoding peculiarities, such as the dummy r24.xxxx and #0 arguments for fmov and frcp instructions respectively.

All in all, the compiler is progressing nicely. It is currently using a simple SSA-based intermediate representation which maps one-to-one with the hardware, minus details about register allocation and VLIW. This architecture will enable us to optimise our code as needed in the future, once we write a register allocators and instruction scheduler. A number of arithmetic (ALU) operations are supported, and although there is much work left to do – including generating texture instructions, which were only decoded a few weeks ago – the design is sound, clocking in at a mere 1500 lines of code.

The best part, of course, is that this is no standalone compiler; it is already sitting in our fork of mesa, using mesa’s infrastructure. When the driver is written, it’ll be ready from day 1. Woohoo!

Source code is available; get it while it’s hot!

Getting the shader compiler to this point was a bigger time sink than anticipated. Nevertheless, we did do a bit of code cleanup in the meanwhile. On the command stream side, I began passing memory-resident structures by name rather than by address, slowly rolling out a basic watermark allocator. This step is revealing potential issues in the understanding of the command stream, preparing us for proper, non-replay-based driver development. Textures still remain elusive, unfortunately. Aside from that, however, much of – if not most – of the command stream is well-understood now. With the help of the shader compiler, basic 3D tests like test-triangle-msoothed are now almost entirely understood and for the most part devoid of magic.

Lyude Paul has been working on code clean-up specifically regarding the build systems. Her goal is to let new contributors play with GPUs, rather than fight with meson and CMake. We’re hoping to attract some more people with low-level programming knowledge and some spare time to pitch in. (Psst! That might mean you! Join us on IRC!)

On a note of administrivia, the project name has been properly changed to Panfrost. For some history, over the summer two driver projects were formed: chai, by me, for Midgard; and BiOpenly, by Lyude et al, for Bifrost. Thanks to Rob Clark’s matchmaking, we found each other and quickly realised that the two GPU architectures had identical command streams; it was only the shader cores that were totally redesigned and led to the rename. Thus, we merged to join efforts, but the new name was never officially decided.

We finally settled on the name “Panfrost”, and our infrastructure is being changed to reflect this. The IRC channel, still on Freenode, now redirects to #panfrost. Additionally Freedesktop.org rolled out their new GitLab CE instance, of which we are the first users; you can find our repositories at the Panfrost organisation on the fd.o GitLab.

On Monday, our project was discussed in Robert Foss’s talk “Progress in the Embedded GPU Ecosystem”. Foss predicted the drivers would not be ready for another three years.

Somehow, I have a feeling it’ll be much sooner!