Deconstructing Mac's M2 Architecture: Part 0 - This Isn't Gonna Be Easy
When I said this wasn't gonna be easy, I really mean that this shit is HARD AS FUCK!!!!
Prior to getting to the fun aspects of working with hardware directly, I must give some context to how this project began.
One of the most infuriating things as a programmer is deal with layers of abstraction in which functions live. There are plenty of libraries that make it easy to understand how code work; and to be fair, rewriting every single component is some type of Jonathan Blow shit that is incredibly rewarding, but a fucking pain in the ass to achieve. Where I feel his advice most applies, is when there’s a required need to redesign code from the ground up (or at least to the limit of what is possible given modern computer design). Not for the sake of “I can do it better”, but more because: I want something more readable than the most insufferable of Ayn Rand literature. A better way to rephrase what I mean is this, “I hate abstraction that has about 50 layers of bullshit, where I had no say in how it got written”. There’s nobody that knows more about terrible-ass esoteric function naming than Apple. When the Mac’s codebase is involving two operating system (macOS and Apple RTK) and is decades of built functionality, you must live with some layer of abstraction or else it's a nightmare to work with.
Here’s my problem with Apple’s exposed lowest language code being called to as the primary means of doing anything. It is a nightmare to deal with, the more foundational you go. Rendering is done through Metal, which is Apple’s ARM focused renderer. Windowing is done through Cocoa or SwiftUI, which allows you to work with building Mac applications easily. These tend to be much easier to read and communicate with. If you peel back the layers to what Cocoa is built upon, you will stumble upon the nightmare code that helped make Cocoa. These involve the most barebones, foundational calls you can make to the operating system without directly talking to it, and for good reason. But these tools like CoreGraphics (apart of CoreFoundation) have a key problem, their method of abstraction is one that is even harder to understand and work with. Documentation and discussions surrounding it involves far more digging, usually with some outdated advice for C++03.
As cool as using Quartz can be to do rendering (where you skip Metal because it’s too high level), and how useful AppKit can be: their schema is one that is more frustrating to try to understand over writing your own. This takes me back to the whole problem ThePrimeagen has rightfully pointed out: is that people will build a whole layer of abstraction over abstraction to accomplish a task that would have been better done by just learning the initial point of abstraction. Most of the libraries that exist now are built upon Cocoa and Metal (or some render alternative like OpenGL or Vulkan SDK), where function calls are wrapped in their own abstraction that solves NOTHING.
Now while it is much easier to take a non-abstract approach when working with vanilla JavaScript, when it comes C and C++, this is where adopting a Jonathan Blow like approach is hell unless you are working with Linux/create your own OS. And this is because Microsoft and Apple are security behemoths that stay on top of their game to prevent malicious people from making use of exploits. Linux has less problems of working with the frame buffer directly and drawing to it, but it still is a hassle.
Apple is very picky on how people access the hardware directly, which is why directly accessing memory addresses is one that is bound to fail. This is due to a system known as ASLR, which randomizes the true memory addresses of everything so that a user cannot exploit drawing a window without the ability to close it for all eternity. However, properly accessing addresses is fine so long as you have a pointer or allocated buffer you are referencing, because then you are using what is known as Position Independent Code. And this IS allowed by Apple, and is one of the only ways you can go about talking to the hardware directly without sacrificing portability. So you don’t need to be concerned about whether or not your memory addresses are correct throughout, since there’s already an offset factored in that is independent of the code you’ve just written.
This is a good sign that we can work with hardware directly, given how Apple goes about doing it. Given all this prior information to how things work under the hood, we can now jump into where my own interest comes into play.
Where I started my interest in working as low level as possible, beings with the work of Alyssa Rosenzweig, who has an amazing repo dedicated to the development of an M1 Open-Source driver. Her work is essential for a project like this to even start. And without her beautifully designed start, doing this would be near impossible. Much of what was needed were the initial Opcodes, because these addresses are ones that are too difficult to just guess. Remember how I told you about ASLR allowing memory locations to jump? Attempting to scan for memory addresses isn’t possible, since you are playing a fun game of ‘Guess Who’ (Memory Address edition). So having a good source on memory locations prevents us from one of the biggest ass headaches of spending a year searching for valid registers. Hopefully you now understand why her project tackled one of the biggest problems related to coding with hardware.
To begin, we first by constructing an Opcode table for the sake of testing registers. Because what we need to do is throw some test code into the register, make sure it is valid data that can be converted to constants based on shift operators, and then hope we’ve chosen the right numbers/data to get us a valid constant value. This involves a decent amount of math, because depending on the data size you are throwing into the register, it needs to be decompiled by using bitwise shift operations, and put within bounds of what the memory location will recognize (this is the most simple explanation I can give about it).
I have these function calls in a mini terminal designed explicitly to run these hardware checkpoint tests, which each one gets more and more difficult in terms of task:
Command options:
a - Runs agx_disassemble_instr
b - Runs agx_print_fadd_f32
c - Runs agx_print_ld_compute
d - Runs agx_print_src
e - Runs agx_print_float_src
f - Runs agx_instr_bytes
z - Runs agx_disassemble
The first instruction is where we have the most simple request of all time, take one chunk of instructions and do something with it (proper disassembly). The disassemble instruction helps with checking for data that is invalid, so that if I need to diagnose a particular problem, I just change the first value set used for calculations in my test_code array.
There’s quite a lot that has to be done to see just exactly how data gets packed and decoded, which means the debugging process is everything. An so began the process of chucking my own test data into the first register and HOPE it translated back into recognized data. It’s really easy to tell if the data was processed correctly, since you are looking for an Assembly output that tells us whether or not our approach has given us the numbers we should expect. So if we are looking for a subtraction of 2 by 5 (a function I could write and test), we’d expect a negative number, and our output will tell us if we have a negative number. If we need a constant, our program will tell us if we have a constant. This is where its a true test of talking to hardware, and hence, was my first attempt at checking whether or not ‘a’ worked:
# AA 81 50 A0 0A 00
+fadd.32 w0// Raw bytes: AA 81 50 A0 0A 00
// SRC0 Bytes: [2]=0x50, [3]=0xA0
// SRC0 Packed: 0x50A
// SRC0 Decoded: type=1, reg=16, size32=1, abs=0, neg=1, unk=0
, wunk1:16.neg // Value is UNKNOWN. s.type == 1
// SRC1 Bytes: [3]=0xA0, [4]=0x0A
// SRC1 Packed: 0x00A
// SRC1 Decoded: type=0, reg=0, size32=1, abs=0, neg=1, unk=0
Now this result just means that the blocks of code I attempted to put in my register, are not valid for any recognized type since we require our struct with it’s type, to be set to 2. If I have a 0, this is considered an immediate value (aka, this is just data from some Assembly Opcode instruction and relegated to only that instruction, it isn’t data officially passed to the register). If I have 1, this is an unknown data type. If I have a 3, this is an empty string. Anything beyond is data that falls outside of the scope. And to get something valid requires a careful choice on what values I pass onto the hardware itself.
This isn’t meant to be an Assembly tutorial of how bit data gets decompiled into valid numbers (and why it requires a certain size to avoid staying Opcode instructions), just note that there is an annoying ass amount of steps involved for working with various bit sizes on Apple hardware. For example, the register I am using accepts 12-bit data, so its a matter of extracting that data, chopping of the bits we don’t need (bottom/top surgery) to make sure we get a valid constant. Failure to do this results in the error to properly decode data. It’s fucking annoying, but this persistence is important to talking to hardware.
Once we DO find the data to shove into the register, it SHOULD decompile into constant values, and it looks successful like this:
+fadd.32 w0// Raw bytes: AA 81 85 08 A0
// SRC0 Bytes: [2]=0x85, [3]=0x08
// SRC0 Packed: 0x850
// SRC0 Decoded: type=2, reg=5, size32=0, abs=0, neg=0, unk=0
, hconst_5 // s.type set to 2: this is ONLY valid
// SRC1 Bytes: [3]=0x08, [4]=0xA0
// SRC1 Packed: 0x8A0
// SRC1 Decoded: type=2, reg=10, size32=0, abs=0, neg=0, unk=0
, hconst_10 // s.type set to 2: this is ONLY valid
Now, this is only ONE of the 7 required operations to make sure we can decode and encode data properly based on Apple’s registers. Each of the operations is much trickier than the last, meaning it’s going to require a lot of debugging to take the code farther. The first test is one of the hardest to solve, given that you are trying to understand what data you need to give the hardware so that it can talk together.
But since I’ve completed the first task of deconstructing a small bit of data into valid float numbers (which is where we get the hconst_5 and hconst_10 from), this is where we will stop when it comes to this series. As you can tell, this stuff is incredibly tricky to deal with. Even if you look at the documentation and commits dedicated to the first M1 Open Source graphics driver, Alyssa and the team still had to use SOME of the foundational elements Apple leaves exposed due to how tricky it would be to reverse engineer ALL of what’s needed for basic windowing.
If you’d like to follow along as I continue to work on this mess of a project, here is the link to the GitHub repo:
https://github.com/ryanwisemanmusic/TheLilySpark
Til then, happy coding!