Dissecting the Apple M1 GPU, part II

22 Jan 2021

Less than a month ago, I began investigating the Apple M1 GPU in hopes of developing a free and open-source driver. This week, I’ve reached a second milestone: drawing a triangle with my own open-source code. The vertex and fragment shaders are handwritten in machine code, and I interface with the hardware via the IOKit kernel driver in an identical fashion to the system’s Metal userspace driver.

A triangle rendered on the M1 with open-source code

The bulk of the new code is responsible for constructing the various command buffers and descriptors resident in shared memory, used to control the GPU’s behaviour. Any state accessible from Metal corresponds to bits in these buffers, so understanding them will be the next major task. So far, I have focused less on the content and more on the connections between them. In particular, the structures contain pointers to one another, sometimes nested multiple layers deep. The bring-up process for the project’s triangle provides a bird’s eye view of how all these disparate pieces in memory fit together.

As an example, the application-provided vertex data are in their own buffers. An internal table in yet another buffer points each of these vertex buffers. That internal table is passed directly as input to the vertex shader, specified in another buffer. That description of the vertex shader, including the address of the code in executable memory, is pointed to by another buffer, itself referenced from the main command buffer, which is referenced by a handle in the IOKit call to submit a command buffer. Whew!

In other words, the demo code is not yet intended to demonstrate an understanding of the fine-grained details of the command buffers, but rather to demonstrate there is “nothing missing”. Since GPU virtual addresses change from run to run, the demo validates that all of the pointers required are identified and can be relocated freely in memory using our own (trivial) allocator. As there is a bit of “magic” around memory and command buffer allocation on macOS, having this code working at an early stage gives peace of mind going forward.

I employed a piecemeal bring-up process. Since my IOKit wrapper exists in the same address space as the Metal application, the wrapper may modify command buffers just before submission to the GPU. As an early “hello world”, I identified the encoding of the render target’s clear colour in memory, and demonstrated that I could modify the colour as I pleased. Similarly, while learning about the instruction set to bring up the disassembler, I replaced shaders with handwritten equivalents and confirmed I could execute code on the GPU, provided I wrote out the machine code. But it’s not necessary to stop at these “leaf nodes” of the system; after modifying the shader code, I tried uploading shader code to a different part of the executable buffer while modifying the command buffer’s pointer to the code to compensate. After that, I could try uploading the commands for the shader myself. Iterating in this fashion, I could build up every structure needed while testing each in isolation.

Despite curveballs, this procedure worked out far better than the alternative of jumping straight to constructing buffers, perhaps via a “replay”. I had used that alternate technique to bring-up Mali a few years back, but it comes with the substantial drawback of fiendishly difficult debugging. If there is a single typo in five hundred lines of magic numbers, there would be no feedback, except an error from the GPU. However, by working one bit at a time, errors could be pinpointed and fixed immediately, providing a faster turn around time and a more pleasant bring-up experience.

But curveballs there were! My momentary elation at modifying the clear colours disappeared when I attempted to allocate a buffer for the colours. Despite encoding the same bits as before, the GPU would fail to clear correctly. Wondering if there was something wrong with the way I modified the pointer, I tried placing the colour in an unused part of memory that was already created by the Metal driver – that worked. The contents were the same, the way I modified the pointers was the same, but somehow the GPU didn’t like my memory allocation. I wondered if there was something wrong with the way I allocated memory, but the arguments I used to invoke the memory allocation IOKit call were bit-identical to those used by Metal, as confirmed by wrap. My last-ditch effort was checking if GPU memory had to be mapped explicitly via some side channel, like the mmap system call. IOKit does feature a device-independent memory map call, but no amount of fortified tracing found any evidence of side-channel system call mappings.

Trouble was brewing. Feeling delirious after so much time chasing an “impossible” bug, I wondered if there wasn’t something “magic” in the system call… but rather in the GPU memory itself. It was a silly theory since it produces a serious chicken-and-egg problem if true: if a GPU allocation has to be blessed by another GPU allocation, who blesses the first allocation?

But feeling silly and perhaps desperate, I pressed forward to test the theory by inserting a memory allocation call in the middle of the application flow, such that every subsequent allocation would be at a different address. Dumping GPU memory before and after this change and checking for differences revealed my first horror: an auxiliary buffer in GPU memory tracked all of the required allocations. In particular, I noticed values in this buffer increasing by one at a predictable offset (every 0x40 bytes), suggesting that the buffer contained an array of handles to allocations. Indeed, these values corresponded exactly to handles returned from the kernel on GPU memory allocation calls.

Putting aside the obvious problems with this theory, I tested it anyway, modifying this table to include an extra entry at the end with the handle of my new allocation, and modifying the header data structure to bump the number of entries by one. Still no dice. Discouraging as it was, that did not sink the theory entirely. In fact, I noticed something peculiar about the entries: contrary to what I thought, not all of them corresponded to valid handles. No, all but the last entry were valid. The handles from the kernel are 1-indexed, yet in each memory dump, the final handle was always 0, nonexistent. Perhaps this acts as a sentinel value, analogous to NULL-terminated strings in C. That explanation begs the question of why? If the header already contains a count of entries, a sentinel value is redundant.

I pressed on. Instead of adding on an extra entry with my handle, I copied the last entry n to the extra entry n + 1 and overwrote the (now second to last) entry n with the new handle.

Suddenly my clear colour showed up.

Is the mystery solved? I got the code working, so in some sense, the answer must be yes. But this is hardly a satisfying explanation; at every step, the unlikely solution only raises more questions. The chicken-and-egg problem is the easiest to resolve: this mapping table, along with the root command buffer, is allocated via a special IOKit selector independent from the general buffer allocation, and the handle to the mapping table is passed along with the submit command buffer selector. Further, the idea of passing required handles with command buffer submission is not unheard of; a similar mechanism is used on mainline Linux drivers. Nevertheless, the rationale for using 64-byte table entries in shared memory, as opposed to a simple CPU-side array, remains totally elusive.

Putting memory allocation woes behind me, the road ahead was not without bumps (and potholes), but with patience, I iterated until I had constructed the entirety of GPU memory myself in parallel to Metal, relying on the proprietary userspace only to initialize the device. Finally, all that remained was a leap of faith to kick off the IOKit handshake myself, and I had my first triangle.

These changes amount to around 1700 lines of code since the last blog post, available on GitHub. I’ve pieced together a simple demo animating a triangle with the GPU on-screen. The window system integration is effectively nonexistent at this point: XQuartz is required and detiling the (64x64 Morton-order interleaved) framebuffer occurs in software with naive scalar code. Nevertheless, the M1’s CPU is more than fast enough to cope.

Now that each part of the userspace driver is bootstrapped, going forward we can iterate on the instruction set and the command buffers in isolation. We can tease apart the little details and bit-by-bit transform the code from hundreds of inexplicable magic constants to a real driver. Onwards!

Back to home