# A Panfrostian October

20 Oct 2018

In the past month-and-a-half since the previous update on the free (open source) Panfrost driver for Mali GPUs, progress has been fervent! During the interim, Lyude Paul and I presented about the driver at the X.Org Developer’s Conference. Check out the talk (slides). Benchmarks included!

Since the talk, we’ve been working on Panfrost like mad. Literally – I was pretty angry by that distorted texture bug! The culprit proved to be the compiler emitting imov (integer move) instructions rather than fmov (floating-point move), apparently causing precision degradation while processing texture coordinates.

In any event, implementing a few new features culminated in the above jellyfish demo. This jellyfish scene is a part of glmark2-es2, a set of OpenGL ES programs used to test and benchmark GPU drivers. For Panfrost, I use glmark as a set of milestones to work towards while developing the driver. glmark is comprehensive and can do just about anything I need while developing graphics drivers, except for making me a peanut butter and jellyfish sandwich. Er, never mind, it looks like they just added an OpenGL sandwich scene. Bonus!

Aside from the texturing issue, the next major challenge faced by the jellyfish scene was “uniform spilling”. Many GPUs require a special “load uniform” instruction in the shader to access a uniform; Midgard does not… usually. Instead, up to 16 (32-bit vec4) uniforms can be preloaded into registers, which makes them free to access in shaders. Yay performance!

The problem with this uniform loading scheme is when a shader needs more than 16 uniforms, which is frequent in vertex shaders loading large matrices. A single 4x4 matrix eats up 4 uniforms; 4 such matrices and there are no uniforms left for driver-internal state. Oops.

The solution is surprisingly simple: rather than using this register-based fast path, use the traditional “load uniform” instructions which have no practical limit on the number of uniforms, at the expense of a nominal performance hit. Specifically, the driver apparently switches from using uniforms to internally generating their crazy cousin, uniform buffer objects. This technique correctly handles these complex shaders.

However, Panfrost goes one step further than the blob appears to! Since uniforms and uniform buffer objects can be used simultaneously, we can use the same memory block for both kinds of uniforms – two for the price one! Mapping the uniform memory like this, the compiler is free to use the register fast path for the first 16 uniforms and seamlessly fallback on the general slow path for the rest, shaving off sixteen “load uniform” instructions on a shader that otherwise has to “spill” to dedicated instructions. A win!

Another major change prompted by a bug bringing up jellyfish surrounded register allocation. Previously, Panfrost used an ersatz, home-rolled register allocator, good enough for spinning gears but a little clunky. Debugging jellyfish uncovered a buggy corner case; the naive solution unfortunately increased register pressure, a performance issue. But there’s no need to settle for naive!

“Bayes”ing the work on Mesa’s classy register allocation library, I scrapped the old code and wrote a more robust, sophisticated register allocator. The new allocator, using graph coloring rather than ad hoc liveness checks, already generates better code with lower register pressure, neatly resolving the issue found in jellyfish. Plus, it will permit further register-related progress when we tackle register spilling and parallelism in stores and texture instructions. The new code is considerably longer, but it will pay off as shader complexity creeps up.

A few miscellaneous bugs later, et voila, Jellyfish running on GPU-accelerated using only free software on a Rockchip RK3399 laptop.

## Refract

It’s a few hops away from the jellyfish, yet I carrot believe Panfrost is running glmark’s refract demo!

Like jellyfish, uniform spilling is needed for refract. We have been able to render the bunny itself for quite a while, as demoed at the X.Org Developer’s Conference, linked above. Where’s the issue?

Framebuffer objects.

Remember how I mentioned back in April that framebuffer objects were one of the few major features missing in Panfrost?

We can strike that off the list now!

For those unfamiliar, by default, a 3D application renders to the screen. When the application instead renders off-screen, to be later read as a texture, it’s said to render to a “framebuffer object” in graphics lingo. Since Mali is a “render-only” GPU, independent of the display, we already have the freedom to render to arbitrary buffers. Theoretically, rather than telling the GPU to render on-screen, we simply it to render literally anywhere else but the screen. Pretty simple!

Of course, it’s never that simple. To reduce power consumption by saving memory bandwidth, Mali GPUs feature “ARM Framebuffer Compression”, or AFBC for short. As advertised, AFBC is a lossless compression scheme for compressing the framebuffer in-flight from one part of the chip to another.

Only across parts of the chip? Always interested in reducing power, it turns out the blob is aggressive about enabling AFBC anywhere it can. Previously, we could ignore AFBC entirely, since the X11 blob is apparently unable to use AFBC for display. However, since framebuffer objects are off-screen and stored internal to the driver, it doesn’t matter to the rest of the system what format is used. Accordingly, we have never seen a framebuffer object that doesn’t use AFBC, so I had to implement AFBC support in the driver.

Furthermore, framebuffer objects mean juggling multiple framebuffers at once. Together with AFBC support, this additional complexity required a considerable refactor of the command stream driver code to support the extra bookkeeping. Many coding afternoons and a transatlantic flight later, Panfrost implemented preliminary (color) framebuffer objects.

However, to add to the joy, refract does not only use a color attachment to a framebuffer object, but also a depth attachment. Conceptually, the two modes of operation are identical; in practice, they require a different encoding in the command stream and an additional increase in bookkeeping leading to yet another (small) refactor. Plenty of keyboard mashing and brain freezing later, Panfrost supported depth attachments to framebuffer objects as well as color. And poof, a refraction bunny!

Now, for the piece de resistance, rather than a glmark demo, we have the reference Wayland compositor, implementing a simple GPU-accelerated desktop. With a small workaround added to the driver, Panfrost is able to run …