Panfrost on the RK3399 (Meow!)

2 Sep 2018

In January 2018, I began using my C201 full-time. In June 2018, its charging port broke, apparently due to physical damage, and I was forced to discontinue my school laptop and Mali T760 development machine.

Rest in peace, Veyron Speedy.

In its place, I ~~purchased~~ adopted Kevin, better known by its brand name, the Samsung Chromebook Plus (v1). Kevin contains a 64-bit Rockchip RK3399 processor, a substantial upgrade over the C201’s Rockchip RK3288, with a Mali T860 graphics processor. The T860, a new version of Mali Midgard in between its predecessor T760 and its successor Bifrost, was not supported in Panfrost, the community-led free software driver for modern Mali GPUs ubiquitous in phones and Chromebooks.

Meanwhile, back in May, TL Lim of Pine64 generously offered to send me a RK3399 development board, a ROCKPRO64. During my university days, the board arrived… and so work begun on T860 support.

The new hardware contains a number of challenges unseen in its predecessors. For instance, the RK3288 is a 32-bit processor (armv7) whereas the RK3399 is 64-bit (armv8). While Midgard is natively 64-bit, there are many memory-saving optimizations specifically for 32-bit machines like the RK3288. For instance, in the 32-bit command stream, the field encoding line width is stuffed in between a pointer at the end of the header and the beginning metadata of the tiler payload, in a spot that normally would contain padding to align everything. On a 64-bit machine, however, the preceding pointer is a full 8 bytes, and there is no room for the line width hack to function; this field is instead moved from the first to the last index in the data structure.

Another small change that took far too long to figure out was the modified kernel API for importing memory on 64-bit systems. Imports are necessary to setup the framebuffer correctly, avoiding expensive fullscreen blits. Once I read through the 64-bit portions of the kernel module, it was clear how to implement the new calls, but this change nevertheless slowed initial progress.

Another more challenging departure was the FBDs. From the kernel sources, we know there are two mysterious acronyms, SFBD and MFBD. We figured out these referred to the Framebuffer Descriptor. Why two? It turns out that older Midgard used the SFBD, and newer Midgard and Bifrost use the MFBD. The change coincided with the introduction of multi-render target (MRT) support in the newer chips. Accordingly, we are guessing these are the Single Framebuffer Descriptor (SFBD) and Multiple Framebuffer Descriptor (MFBD), though we can of course never be sure.

In any event, while the T760 supported MFBDs, it was possible to use a 32-bit SFBD instead. On the newer T860, due to the upgraded development hardware, we had no choice but to support the newer data structure. Luckily, Connor had already decoded the majority of the MFBD as part of his investigation in Bifrost. I was easily able to adapt his changes for the T860-family of processors, et voila, the command stream was making sense.

After decoding the new parts of the command stream, I implemented support in the Gallium driver for the new structures; as structures like the framebuffer descriptor are so fundamental, this implied significant changes. But after some fervent coding, the driver for T860 reached feature parity with Panfrost on the T760.

Turning our attention to shaders, it turns out there are no significant differences in the shader core between these two chips. That is, the minute I could replay a triangle on my laptop, we had free shaders. Woohoo!

Once the new hardware was setup, I turned my focus to cleaning up the codebase, implementing more of OpenGL ES 2.0, and debugging issues in the existing implementations. I squashed bugs (“Ow!”) ranging from incorrect stencil testing to broken colour masking in the absence of blending to missing movs in certain shaders. In terms of new features, I implemented the scissor test and added support for shaders with multiple varyings, a common pattern which did not show up in the simpler 3D tests. Multi-varying support required changes in both the command stream driver and the compiler, although we’re now able to understand why varyings are encoded in the (quirky) way they are.

I also (finally) fixed glReadPixels support, so we can start testing with dEQP and Piglit. The results remain somewhat depressing, but it’s a starting point for further driver debugging as we make the jump from proof-of-concept to production code.

Soup’s up, everypony!

(The soup is code.)

On the Bifrost side, Connor Abbott and Lyude Paul have been busy!

After decoding more of the Bifrost ISA, Connor turned his attention to the Bifrost command-stream, an iterative improvement on the Midgard command-stream. Building on my earlier work with the Midgard command stream, Connor identified analogue structures between the two versions, as well as marking off new structures including the aforementioned MFBD.

After decoding these initial structures, he implemented support in panwrap for generating replay-style traces of Bifrost command streams, again building on the infrastructure originally built for Midgard. Although he did not make direct use of the replay facilities in panwrap, these changes did enable a great deal of the Bifrost command stream to be elucidated.

He then went a step further, beyond OpenGL ES 2.0 vertex shaders and fragment shaders, venturing into the land of… compute shaders. Dun dun dun! (“What, no scary music?” “Sorry, Proselint says you already violated the 30ppm of exclamations rule. We are therefore forbidden from aiding your overexcited shenanigans.” “Fiiine. Do I at least get a singing entourage?” “Definitely not.”)

Incidentally, the initial investigations into compute shaders have shed light on vertex shaders, which are internally structured like compute jobs on both Bifrost and Midgard, with each vertex corresponding to a single invocation. However, Connor had a more immediate purpose in mind with his pursuit of compute shaders: gaining the ability to execute arbitrary shader binaries on a Bifrost board, implementing a “shader runner” to use the trochaic name. Midgard never needed this tool, as I have generally implemented the command stream parts of the driver in advance of the compiler, but Bifrost’s ISA decoding is considerably further along than its graphics pipeline. Through his earlier work with panwrap, he was successfully able to build such a tool, directly submitting simple compute jobs with offline precompiled shaders to the Bifrost GPU!

That does beg the question – where do these shader binaries come from, as work has not yet begun on the Bifrost compiler? As you may know, Lyude has been at work implementing a Bifrost assembler, a large task given the shear complexity of Bifrost, especially in comparison to Midgard – and a task made even harder without the ability to test on neither real hardware nor emulators. Paired with Connor’s shader runner, however, the assembler’s shader binaries can be tested, and her assembler is showing signs of life! Pairing these two components have enabled the first free shaders to run on Bifrost. While there are a few unimplemented opcodes remaining, the assembler is working well.

Returning to the shader running, beyond its immense value as a proof-of-concept, a side benefit of this instrumentation is enabling fine-grained ISA decoding. Whereas Midgard uses high-level arithmetic opcodes, with dedicated instructions for operations like a reciprocal-square root, Bifrost uses a minimalist architecture, dividing complex high-level operations into many small instructions. However, it can be difficult to discern the individual meaning of each micro opcode from simply studying disassembled compiled shaders. Nevertheless, with the new ability to assemble and run shaders, it is possible to test micro operations individually, determining their functions outside the high-level sequence. Through this technique, Connor has been able to understand the fine-grained details of operations like division on Bifrost.

All in all, our understanding of Bifrost is beginning to converge with that of Midgard. The next step, of course, will be to implement the Bifrost compiler and extend the Midgard driver to support Bifrost command streams. But hey, if someone had told me a year ago that we could run code natively, bloblessly, on virtually every Mali device, I don’t think I would have believed them. But between our work on Midgard and Bifrost, not to mention the Lima project’s work on Utgard, this dream has become a reality. In the words of Equestria Girls, what a strange new world…

But let’s not get ahead of ourselves. On the Mali T860 (Midgard), Freedreno’s test-cat and glmark2’s equivalent build test run successfully. The shaded cat screenshot at the beginning of the post is a native blobless render on the T860. Of course, the tests demoed on previous blog posts, like the shiny textured cube, continue to work on the new hardware.

Oh, and save for a funky viewport (corrected here) and some sporadic artefacts, it runs ’gears:

es2gears on the T860 with the blobless Panfrost stack — `es2gears` on the T860 with the blobless Panfrost stack

Meow!

Back to home