Clip control on the Apple GPU

Neverball uses legacy “fixed function” OpenGL. Rather than supply programmable shaders like OpenGL 2, old OpenGL 1 applications configure a fixed set of graphics effects like fog and alpha testing. Modern GPUs don’t implement these features in hardware. Instead, the driver synthesizes shaders implementing the desired graphics. This translation is complicated, but we get it for “free” as an open source driver in Mesa. If we implement the modern shader pipeline, Mesa will handle fixed function OpenGL for us transparently. That’s a win for open source drivers, and a win for GPU acceleration on Asahi Linux.

To implement the modern OpenGL features, we rely on reverse-engineering the behaviour of Apple’s Metal driver, as we don’t have hardware documentation. Although Metal uses the same shader pipeline as OpenGL, it doesn’t support all the OpenGL features that the hardware does, which puts us in bind. In the past, I’ve relied on educated guesswork to bridge the gap, but there’s another solution… and it’s a doozy.

For motivation, consider the clip space used in OpenGL. In every other API on the planet, the Z component (depth) of points in the 3D world range from 0 to 1, where 0 is “near” and 1 is “far”. In OpenGL, however, Z ranges from negative 1 to 1. As Metal uses the 0/1 clip space, implementing OpenGL on Metal requires emulating the -1/1 clip space by inserting extra instructions into the vertex shader to transform the Z coordinate. Although this emulation adds overhead, it works for ANGLE’s open source implementation of OpenGL ES on Metal.

Like ANGLE, Apple’s OpenGL driver internally translates to Metal. Because Metal uses the 0 to 1 clip space, it should require this emulation code. Curiously, when we disassemble shaders compiled with their OpenGL implementation, we don’t see any such emulation. That means Apple’s GPU must support -1/1 clip spaces in addition to Metal’s preferred 0/1. The problem is figuring out how to use this other clip space.

We expect that there’s a bit toggling between these clip spaces. The logical place for such a bit is the viewport packet, but there’s no obvious difference between the viewport packets emitted by Metal and OpenGL-on-Metal. Ordinarily, we would identify the bit by toggling the clip space in Metal and comparing memory dumps. However, according to Apple’s documentation, there’s no way to change the clip space in Metal.

That’s an apparently contradiction. There’s no way to use the -1/1 clip space with Metal, but Apple’s OpenGL-on-Metal translator uses uses the -1/1 clip space. What gives?

Here’s a little secret: there are two graphics APIs called “Metal”. There’s the Metal you know, a limited API that Apple documents for App Store developers, an API that lacks useful features supported by OpenGL and Vulkan.

And there’s the Metal that Apple uses themselves, an internal API adding back features that Apple doesn’t want you using. While ANGLE implements OpenGL ES on the documented Metal, Apple can implement OpenGL on the secret Metal.

Apple does not publish documentation or headers for this richer Metal API, but if we’re lucky, we can catch a glimpse behind the curtain. The undocumented classes and methods making up the internal Metal API are still available in the production Metal binaries. To use them, we only need the missing headers. Fortunately, Objective-C symbols contain enough information to reconstruct header files, allowing us to experiment with undocumented methods with “extra” functionality inherited from OpenGL.

Compared to the desktop GPUs found in Intel Macs, Apple’s own GPU implements a slim, modern feature set mapping well to Metal. Most of the “extra” functionality is emulated. It is interesting to know the emulation happens in their Metal driver instead of their OpenGL frontend, but that’s unsurprising, as it allows their Metal drivers for Intel and AMD GPUs to implement the functionality natively. While this information is fascinating for “macOS hermeneutics”, it won’t help us with our Apple GPU mystery.

What will help us are the catch-all mystery methods named setOpenGLModeEnabled, apparently enabling “OpenGL mode”.

The render pipeline descriptor has such a method. That descriptor contains state that can change every draw. In some graphics APIs, like OpenGL with ARB_clip_control and Vulkan with VK_EXT_depth_clip_control, the application can change the clip space every draw. Ideally, the clip space state would be part of this descriptor.

We can test this optimistic guess by augmenting our Metal test bench to call [MTLRenderPipelineDescriptorInternal setOpenGLModeEnabled: YES].

It feels strange to call this hidden method. It’s stranger when the code compiles and runs just fine.

We can then compare traces between OpenGL mode and the normal Metal mode. Seemingly, enabling OpenGL mode toggles a plethora of random unknown bits. Even if one of them is what we want, it’s a bit unsatisfying that the “real” Metal would lack a proper [setClipSpace: MTLMinusOneToOne] method, rather than this blunt hack reconfiguring a pile of loosely related API behaviours.

Alas, for all the random changes in “OpenGL mode”, none seem to affect clipping behaviour.

Hope is not yet lost. There’s another setOpenGLModeEnabled method, this time in the render pass descriptor. Rather than pipeline state that can change every draw, this descriptor’s state can only change in between render passes. Changing that state in between draws would require an expensive flush to main memory, similar to the partial renders seen elsewhere with the Apple GPU. Nevertheless, it’s worth a shot.

Changing our test bench to call [MTLRenderPassDescriptorInternal setOpenGLModeEnabled: YES], we find another collection of random bits changed. Most of them are in hardware packets, and none of those seem to control clip space, either.

In addition to the packets that the userspace driver prepares for the hardware, userspace passes to the kernel a large block of render pass state describing everything from tile size to the depth/stencil buffers. Such a design is unusual. Ordinarily, GPU kernel drivers are only concerned with memory management and scheduling, remaining oblivious of 3D graphics. By contrast, Apple processes this state in the kernel forwarding the state to the GPU’s firmware to configure the actual hardware.

Comparing traces, the render pass “OpenGL mode” sets an unknown bit in this kernel-processed block. If we set the same bit in our OpenGL driver, we find the clip space changes to -1/1. Victory, right?

Almost. Because this bit is render pass state, we can’t use it to change the clip space between draws. That’s okay for baseline OpenGL and Vulkan, but it prevents us from efficiently implementing the ARB_clip_control and VK_EXT_depth_clip_control extensions. There are at least three (inefficient) implementations.

The first is ignoring the hardware support and emulating one of the clip spaces by inserting extra instructions into the vertex shader when the “wrong” clip space is used. In addition to extra overhead, that requires shader variants for the different clip spaces.

In new APIs like Vulkan, Metal, and D3D12, everything needed to compile a shader is known up-front as part of a monolithic pipeline. That means pipelines are compiled when they’re created, not when they’re used, and are never recompiled. By contrast, older APIs like OpenGL and D3D11 allow using the same shader with different API states, requiring some drivers to recompile shaders on the fly. Compiling shaders is slow, so shader variants can cause unpredictable drops in an application’s frame rate, familiar to desktop gamers as stuttering. If we use this approach in our OpenGL driver, switching clip modes could cause stuttering due to recompiling shaders. In bad circumstances, that stutter could even happen long after the mode is switched.

That option is undesirable, so the second approach is always inserting emulation instructions that read the desired clip space at run-time, reserving a uniform (push constant) for the transformation. That way, the same shader is usable with either clip space, eliminating shader variants. However, that has even higher overhead than the first method. If an application frequently changes clip spaces within a render pass, this approach will be the most efficient of the three. If it does not, this approach adds constant overhead to every application. Knowing which approach is better requires the driver to have a magic crystal ball.¹

The final option is using the hardware clip space bit and splitting the render pass when the clip space is changed. Here, the shaders are optimal and do not require variants. However, splitting the render pass wastes tremendous memory bandwidth if the application changes clip spaces frequently. Nevertheless, this approach has some support from the ARB_clip_control specification:

Each approach has trade-offs. For now, the easiest “option” is sticking our head in the sand and giving up on ARB_clip_control altogether. The OpenGL extension is optional until we get to OpenGL 4.5. Apple doesn’t implement it in their OpenGL stack. Because ARB_clip_control is primarily for porting Direct3D games, native OpenGL games are happy without it. Certainly, Neverball doesn’t mind. For now, we can use the hardware bit to use the -1/1 clip space unconditionally in OpenGL and 0/1 unconditionally in Vulkan. That does not require any emulation or flushing, though it prevents us from advertising the extensions.

That’s enough to run Neverball on macOS, using our userspace OpenGL driver in Mesa, and Apple’s proprietary kernel driver. There’s a catch: Neverball has to present with the deprecated X11 server on macOS. Years ago, Apple engineers² contributed Mesa support for X11 on macOS (XQuartz), allowing us to run X11 applications with our Mesa driver. However, there’s no support for Apple’s own Cocoa windowing system, meaning native macOS applications won’t work with our driver. There’s also no easy way to run Linux’s newer Wayland display server on macOS. Nevertheless, Neverball does not use Cocoa directly. Instead, it uses the cross-platform SDL2 library to create its window, which internally uses Cocoa, X11, or Wayland as appropriate for the operating system. With enough sweat and tears, we can build an macOS/X11 version of SDL2 and link Neverball with that.

This Neverball/macOS/X11 port was frustrating, especially when the game is one apt install away on Linux. That’s a job for Asahi Lina, who has been hard at work writing a Linux kernel driver for Apple’s GPU. When our work converges, my userspace Mesa driver will run on Linux with her kernel driver to implement a full open source graphics stack for 3D acceleration on Asahi Linux.

Please temper your expectations: even with hardware documentation, an optimized Vulkan driver stack (with enough features to layer OpenGL 4.6 with Zink) requires many years of full time work. At least for now, nobody is working on this driver full time³. Reverse-engineering slows the process considerably. We won’t be playing AAA games any time soon.

That said, thanks to the tremendous shared code in Mesa, a basic OpenGL driver is doable by a single person. I’m optimistic that we’ll have native OpenGL 2.1 in Asahi Linux by the end of the year. That’s enough to accelerate your desktop environment and browser. It’s also enough to play older games (like Neverball). Even without fancy features, GPU acceleration means smooth animations and better battery life.

This crystal ball is called “Vulkan, Metal, or D3D12”, and it has its own problems.↩︎
Hi Jeremy!↩︎
I work full-time at Collabora on my baby, the open source Panfrost driver for Mali GPUs.↩︎