Feo3boy
Contents
Background
So, I’m not really into GameBoy; I’ve never had a GameBoy and I never played any classic GameBoy games. But you know what I am into? Solving interesting programming problems. So when my friend Ciaran decided to try writing a GameBoy emulator for fun and education, I was interested in helping out.
He originally started the project in Python as pyGB
, and I made a couple of
contributions to how that emulator handled loading GameBoy cartridge rom dumps, but we
never got much farther than that, certainly not to anything resembling a functional
emulator.
A few years later, Ciaran decided to learn some Rust, a language which I had a fair amount
of experience with, since I’ve been using it for hobby projects since around 2015. The new
project is named feo3boy
since Rust devs just like naming things with names like
ferrous.
The CPU
The CPU is the heart of the emulator. You need a lot of things to run a complete emulator with games and all, but to get bare minimum functionality for running any software at all, all you need is a CPU and some memory, and the memory can just be a byte array. A lot of popular systems to emulate, like the GameBoy have various test suites that have been developed to check whether an emulator behaves correctly, and generally to run these you need a working CPU and then you can add additioal components as you go. For example, for our emulator, we use the Blargg test roms.
The CPU in the GameBoy is a modified Z80, which I’ll refer to as the GBZ80. The GBZ80 has two sets of opcodes, a base set of 244 opcodes, plus one prefix opcode that selects from a second set of 256 additional opcodes. It has 7 8-bit registers and a 4 bit flag register and these can also be used as 4 two-register pairs. It supports mostly 8 bit operations, though some opcodes operate on 16 bits, though these often take two cycles, since the system actually only has an 8 bit adder (actually it may only have a 4 bit adder internally, I’m not sure, but the smallest usable operation is 8 bits).
Opcode Decoding
The first thing I did was work on decoding the opcodes. My goal was to make a simple
interpreter to execute opcodes as it found them. Now you could just write every single one
of the 245 opcodes with a giant match
statement, and that would be the fastest way to
do direct interpretation (foreshadowing), but it’s a huge pain to write. It would be much
easier to divide opcodes into groups with common behavior that differ only by which
registers they operate on, and then operate on that.
I ended up dividing the opcodes into 27 groups. Still a lot, but far less than 245, and much less daunting of a problem. To actually decode the opcodes, it turns out the bit pattern of the opcode has a roughly consistent set of fields you can divide it into, and there’s some good resources on this for the Z80, for example here. Essentially, you treat each opcode as a series of bit fields and then match different fields to determine which opcode it is and which registers it operates on. The opcode decoder lives here.
Timing
Once you have decoded an opcode to figure out its parameters, you can write code to execute each operation type. However, how do you deal with timing? Different operations take different amounts of time, but also different parameters can affect how long an operation takes to run, depending on whether an operand is a register or a memory location.
The GameBoy’s clock runs at 4 MHz, but memory accesses take four clock cycles. Because of this, there’s a convention in GameBoy documentation where each tick of the 4 MHz clock is referred to as a “T-cycle” and every four cycles (so 1 MHz) is referred to as an “M-cycle” presumably meaning “Memory Cycle”. Because instructions are all stored in memory, every instruction takes, at minimum, 1 M-cycle because of the instruction fetch time. The real CPU may actually have pipelining – for example, the actuall “add” operation may happen while the next instruction is fetching, but that turns out not to matter for the behavior of the emulator.
Individual instructions take between 1 M-cycle and 6 M-cycles, depending on what the operation does and whether its parameters are in registers or memory. Some operations, specifically conditional branches, also take a different amount of time depending on whether or not they branched. Now, if we only needed to model the CPU, it would be pretty trivial to just figure out how many cycles an operation took and step the clock by that much after interpreting the instruction. However, the GameBoy has a bunch of other operations such as GPU and Audio that happen concurrently with the CPU and work on the same memory, so I wanted to make sure that the emulator could perform background processing simultaneously whenever the CPU was waiting for an M-cycle for memory access.
Having the emulator pause in the middle of an instruction to do background processing is a
bit tricky, especially since we were working entirely single-threaded. We’d like to call
something like cpu.tick()
and then have that yield
control whenever it needs to wait
for an M-cycle, but there may be variables in-memory from e.g. memory addresses it has
already read from.
For example, consider JP Z,u16
. This instruction reads a 16 bit address from an
immediate value (the next two bytes under the program counter right after the opcode),
then checks if the carry flag is zero to decide whether to write the address it just read
into the program counter or not. This takes 3 M-cycles just to read in the opcode then the
two bytes of the destination address. After the first M-cycle, the opcode is known, after
the second M-cycle the lower byte of the address is known, and after the third cycle the
upper byte of the address is known. Then if we branch we have to delay for yet another
cycle. During each subsequent yield
we need the CPU to remember the data it loaded at
the previous yield point until we get to the end and update the program counter (or
don’t).
What we might like to have for this is a state machine (foreshadowing) which tracks where we are in the opcode and lets us store values as we progress. But to do that, we’d need to generate states for all the opcodes; not just every high level decoded operation but every step of every operation. That’s a lot of states to write, and a bit much to do by hand.
So instead, I came up with a simpler approach. I had the CPU’s tick
function take a
“context” type (which it also uses for accessing things like the system memory), and had
that context include a method called yield1m()
which just does all the necessary
parallel processing to advance the emulator by 1 M-cycle. The trick is that instead of
yielding out of the CPU code and letting the root of the emulator handle the parallel
processing, we instead go down a program stack layer when we “yield” and use the normal
program stack to store the intermediate state of the CPU. Very easy!
A downside of this approach is that it means that the timing of cpu.tick()
is very
inconsistent. It could run for just 1 M-cycle or 6 M-cycles depending on what program is
being run. This also makes all other processing dependent on the CPU – from the caller’s
perspective, you can’t pause at arbitrary ticks.
Execution
So once I had a strategy for controlling timing, I could build the actual execution logic.
With a simple structure to hold the register values, I made a function
that loads an instruction, decodes it into my Opcode
type, and then does a
match
over that to figure out what to actually do.
The different operand types, such as load or store a register, load or store memory, etc.
all implement their own timing behavior internally, which makes the logic for timing each
instruction pretty easy. For example, here’s the 8-bit ld
instruction, which can copy a
value between a number of possible pairs of sources and destinations, including registers,
immediates, and locations pointed to by 16 bit pointers:
/// Implements 8 bit load operations.
fn load8(ctx: &mut impl CpuContext, dest: Operand8, source: Operand8) {
let val = source.read(ctx);
trace!("Evaluating LD {},{} (<- {})", dest, source, val);
dest.write(ctx, val);
}
CpuContext
is the context I mentioned before. That’s the type that provides yield1m()
in order to implement delays, as well as access to memory and the current register state
of the CPU. The source
and dest
arguments might take a different amount of time to run
depending on whether they’re registers (basically 0 time) or memory (takes 1 M-cycle), but
the load8
function doesn’t have to care, because that’s handled by Operand8
’s read
and write
methods. [Here][op8read] is what that looks like:
impl Operand8 {
/// Read this operand from the CPU context, yielding for memory access if needed.
pub(super) fn read(self, ctx: &mut impl CpuContext) -> u8 {
trace!("Operand8::read {}", self);
match self {
Self::A => ctx.cpu().regs.acc,
Self::B => ctx.cpu().regs.b,
Self::C => ctx.cpu().regs.c,
Self::D => ctx.cpu().regs.d,
Self::E => ctx.cpu().regs.e,
Self::H => ctx.cpu().regs.h,
Self::L => ctx.cpu().regs.l,
Self::AddrHL => {
let addr = ctx.cpu().regs.hl();
ctx.yield1m();
ctx.mem().read(addr.into())
}
// Etc...
Notice that all the register accesses take no extra time, while AddrHL
is a pointer
dereference using the HL
register pair as the pointer, and it includes a yield1m()
call before accessing the memory location.
Once I had done this for every operation and operand type, I had a working CPU! And with just some minor fixes for silly mistakes, it passed all of the Blargg CPU tests.
Microcode?
So, I was never really satisfied with the way this executor operates. I really wanted to be able to single-step the CPU by 1 M-cycle at a time, pausing in the middle of instructions as necessary. But I also really didn’t want to have to figure out all of the necessary states for the state machine to make that work.
One of the problems is, its much harder to share states between instructions, since the
next state generally depends on which instruction it is. For example, ld B,(HL)
and ld C,(HL)
both start by performing a yielding memory access reading from the address pointed
to by HL
, but then where the value gets stored depends on which instruction it is, so we
can’t quite have a common state for ld *,(HL)
.
After not working on the emulator for about a year, I came back to it with an interesting new idea. What if, instead of developing unique state-machine states for every single instruction, what if I just used a stack within each instruction and broke the instructions down into a series of sub-instructions that all did smaller parts of it? I could then just pause between sub-instructions at any time and the state would be stored in the stack. This would work like a stack machine, where things like “read from memory” would always push a value onto the stack, and “write to memory” would read values off of the stack. Basically, I decided to put a lower-level virtual machine inside of my virtual machine. Woops. Anyway, I called the sub-instructions Microcode.
The Microcode
type I came up with has 37 variants, which is slightly more than the
number of categories as the decoded Opcode
s I had before. Partly that’s because a number
of opcodes are unique and need unique behavior baked into the microcode, partly because
you need a bunch more utility operations such as stack manipulations for the microcode
stack and jumps for jumpping around within the microcode of a single instruction, and
partly because I added extra features so that I could do things like implement the CPU’s
interrupt handler in the microcode (which sometimes has to wait for M-cycles just like
normal opcodes, and so needs to be accounted for in the microcode state machine), and that
required a bunch of extra specialty microcodes.
I then wrote some code to go through every possible opcode value and decode it into an
Opcode
, which I then used to build unique microcode for that opcode. Basically, I was
able to generate a unique microcode implementation for every opcode (so essentially with
things like which registers it accesses hard-coded), by pattern
matching the arguments in the Opcode
type the same way I did when
writing the previous direct executor. Let’s look at the same 8-bit load operations as
before. Here’s how the microcode for them gets built:
Opcode::Load8 { dest, source } => MicrocodeBuilder::read(source).then_write(dest),
The read operation performs the appropriate type of read for the operand and pushes its
value on the microcode stack. Then the write operation pops the value to write back off of
the microcode stack and stores it at the location specified by that destination. But one
trick here is that the two operands, dest
and source
aren’t microcode instructions,
they’re types that themselves break down into additional microcode instructions, very
similar to how in the direct executor, they would have a read
method that performs the
apropriate amount of yielding. Let’s look at the same one as before, the Operand8
’s
behavior when asking it to read, but in microcode this time.
/// As a MicrocodeReadable, `Operand8` will result in a u8 being pushed onto the microcode
/// stack from the appropriate source.
impl MicrocodeReadable for Operand8 {
fn to_read(self) -> MicrocodeBuilder {
match self {
Self::A => MicrocodeBuilder::read(Reg8::Acc),
Self::B => MicrocodeBuilder::read(Reg8::B),
Self::C => MicrocodeBuilder::read(Reg8::C),
Self::D => MicrocodeBuilder::read(Reg8::D),
Self::E => MicrocodeBuilder::read(Reg8::E),
Self::H => MicrocodeBuilder::read(Reg8::H),
Self::L => MicrocodeBuilder::read(Reg8::L),
Self::AddrHL => MicrocodeBuilder::read(Reg16::HL).then_read(Mem8),
// etc...
That’s very similar to what we saw with the Operand8.read
operation before, but wait
what’s this MicrocodeReadable
interface, an what are these Reg8
, Reg16
, and Mem8
things? And where does this code yield? There’s nowhere that tells the microcode to delay.
Well, MicrocodeReadable
is a trait you can implement to make a type work as an argument
to MicrocodeBuilder::read
, and Reg8
, Reg16
, and Mem8
are all additional types that implement MicrocodeReadable
. We actually have to go down another layer to get to the part that produces actual microcode. Here’s what each of those do when you try to read them:
impl MicrocodeReadable for Reg8 {
fn to_read(self) -> MicrocodeBuilder {
Microcode::ReadReg8(self).into()
}
}
impl MicrocodeReadable for Mem8 {
fn to_read(self) -> MicrocodeBuilder {
MicrocodeBuilder::r#yield().then(Microcode::ReadMem8)
}
}
impl MicrocodeReadable for Reg16 {
fn to_read(self) -> MicrocodeBuilder {
Microcode::ReadReg16(self).into()
}
}
These are all pretty simple, but notice that Mem8
inserts the appropriate yield. The
microcode ReadMem8
instruction only reads from memory, it does not have any inherent
timing behavior. This MicrocodeBuilder
and these helper types make writing out microcode
instructions surprisingly easy! The only thing to watch out for is making sure you are
very clear on what is in the stack where, which I generaly did by spelling it out in
comments on every line of code. For example, here’s the much more complicated procedure to
read a 16 bit from an address in memory:
/// When reading a u16 from memory onto the microcode stack, we start with the 16 bit
/// address of the lower byte already on the microcode stack, and want to end with the 16
/// bit value on the microcode stack in its place. We will read in the order lower byte
/// then higher byte.
impl MicrocodeReadable for Mem16 {
fn to_read(self) -> MicrocodeBuilder {
// stack: ...|al|ah|
// Duplicate the address so we don't lose it when we do the first read.
// stack: ...|al|ah|al|ah|
MicrocodeBuilder::first(Microcode::Dup(2))
// This issues the memory yield and grabs the lower order byte.
// stack: ...|al|ah|vl|
.then_read(Mem8)
// Swap the newly read top byte with the original address.
// stack: ...|vl|al|ah|
.then(Microcode::Swap { top: 1, second: 2 })
// Increment the address to get the second byte.
// stack: ...|vl|AL|AH|
.then(Microcode::Inc16)
// This issues the memory yield and grabs the higher order byte.
// Since the stack always uses lower order byte first, this leaves the two
// bytes in the correct order on the microcode stack.
// stack: ...|vl|vh|
.then_read(Mem8)
}
}
Everything proceeds like this. For example, reading an immediate value involves a series
of steps to read and increment the program counter followed by a memory read from that
address. In the end I don’t think the microcode executor ended up being all that much more
complicated than the direct executor, as it actually follows much the same patterns (and
uses the same Opcode
and related types) for figuring out what has to happen for each
instruction. And it passes the same Blargg CPU tests.
But it has one big problem compared to the original direct executor, which is that it’s extremely slow. For example, running the Blargg CPU test benchmarked as taking 2× as long on my machine, and for a simple GBZ80 program I wrote to compute all the 8-bit Fibonacci numbers, it takes a horrifying 8× as long. Totally unacceptable.
Code Generation
One thing the microcode does that simply writing the logic for each instruction in Rust doesn’t is it provides a fairly high-level view of what each instruction’s behavior is, in a format that’s easy to manipulate in code. Sure, you can use proc macros to read Rust code and try to figure out what its supposed to do, but Rust has a lot of things it can do, so writing a macro that say, reads a Rust function defining the behavior for a GBZ80 opcode and splits it up based on where it yields, saving relevant variables, is quite difficult and even more difficult if you want to do it in a way that’s robust to arbitrary code changes.
On the other hand, the microcode has a very narrow set of 37 things it can do, so reasoning about it for code generation is much, much easier. So I decided to take the microcode that I had, and build a procedural macro to transform it into an executor.
Now, my goal was to generate a state-machine executor, so I could retain the pause-anywhere behavior of the microcode with much less performance hit, but first I decided to try something a little more straightforward: generating a direct executor that doesn’t have any opcode decoding or extra branches.
The original direct executor is not really that optimized for performance. Opcode decoding
has a bunch of branches, and then after you decode the opcode you have to branch again not
just based on the decoded Opcode
but also on various parameters to it, such as its
operands and which ALU operation it uses (add vs sub, etc). For maximum performance, it
would be better to have a separate function for every single opcode, which is compiled to
optimized code that already knows the exact operands it should use. So instead of having a
common load8
that then has to branch on its operands, you would have special code just
for e.g. 0x42
which already knows that its operands are B
and D
.
To facilitate both my direct executor v2 and the future generated state machine executor, I divided microcodes up into two kinds. Microcode interns are microcode operations defined by a pure function with no side effects. They are defined as taking their arguments off of the microcode stack and pushing their return values onto the microcode stack. Microcode externs are any microcodes that operate on anything outside the microcode stack. This includes memory and registers, as well as special operations like yield.
The microcode interns could literally be defined by simple functions which would be independent of which generated executor they were for, and then have code generation just figure out how to plug in the arguments. In fact, the microcode interns so well defined the behaviors of things like the ALU that I ended up refactoring to re-use them even in the original direct executor to avoid duplicating logic for e.g. figuring out which flags to set during addition or subtraction. For example, here’s what the original direct executor’s add implementation looks like today:
match *self {
AluOp::Add => {
let (res, flags) = microcode::defs::add(ctx.cpu().regs.acc, arg);
ctx.cpu_mut().regs.acc = res;
ctx.cpu_mut().regs.flags = flags;
}
It still accesses registers directly just like it did before microcode, but the actual behavior of GameBoy’s add operation is defined by using the microcode interns as a common library.
Once I had the microcode interns defined, I had to figure out how to deal with the microcode stack. Part of the optimization I would do in my generated direct executor v2 would actually be to never have and actual microcode stack. Instead, what I did was go through the microcode at compile time, including visiting all branches, and perform variable assignment based on what values would be on the microcode stack at that point.
Fortunately the microcode only has forward jumps, not backward jumps, so it can only branch, not loop. That made the variable assignment relatively easy, as I could essentially to stack single assignment of a variable for each slot in the microcode stack, but I still had to deal with type conversion, since some microcode instructions would break up 16 bit values into 8 bit values or combine pairs of 8 bit values into 16 bit values. So it took a bit of work to handle calculating the types expected by each microcode instructon and converting the values on the “stack” to the correct types.
Another complication was dealing with branching, since it was possible for a branch to leave values on the stack. I ended up solving this by just constraining branches. To make it easier to generate conditionals, I just required that branches in microcode fully enclose the code in branch, like a real if statement, so I wouldn’t ever have to deal with interleaved gotos. I then also required that all branches end with the same number of bytes on the microcode stack no matter which branch was taken, so that it was trivial to turn the stuff from its stack into a set of variable assignment with a most a few extra type conversions. Fortunately these restrictions were pretty trivial to add, because I hadn’t needed to write any particularly weird behavior for any of the microcode instructions, and this didn’t need to be a fully general programming language.
After that it was pretty easy to just generate a match over all the opcodes and fill in calls to the microcode interns with the computed variable assignments, and replace the externs with calls to another set of functions that I wrote to define the extern behavior in the new direct executor.
I then went to work on the generated state machine for the second generated executor. I
used very similar logic for variable assignment as the first generated executor, the only
difference was, whenever I encountered a yield
, I would wrap up all the variables
currently on the microcode stack into a state and define what came after as a separate
code block. One complication here was that sometime a yield would occur inside of a
branch, and I wanted to be able to generate code that could branch and yield, but end up
back at a shared state later, rather than having the code diverge permanently whenever a
yield happened in a branch. What I eneded up doing was having states only for yields, so
if the microcode yields in a branch, the code from there after diverges until the next
yield on the common path. For example, consider a code path like this pseudocode:
let x = next_immediate();
if x == 1 {
yield;
}
x += 1;
yield;
store(x);
This would be broken into 3 states like this (again pseudocode):
state 1:
let x = next_immediate();
if x == 1 {
yield to state 2;
}
x += 1;
yield to state 3;
state 2:
x += 1;
yield to state 3;
state 3:
store(x);
Notice that part of the common path is duplicated between two different states; x += 1
appears in both state 1 and state 2, which is an artifact of only generating new states on
a yield.
Interestingly, no state in the state machine is needed for the first state in each
instruction, because that part is directly executed my the microcode instruction
ParseOpcode
, which really cuts down on the number of generated states. Many of the ALU
operatins end up not having any states at all, since operations involving only registers
don’t need to yield, so their code ends up nearly identical to their code in direct
executor v2. Because so many opcodes don’t need states, we end up having only 274 unique
states for 500 opcodes+prefix opcodes as well as internal CPU behavior
like interrupt handling.
Both of these excutors work and pass the Blargg tests, and what’s more, they perform great. The direct executor v2 shaves more than a second off of the Blargg CPU test and even the state machine based executor still mathches the performance of the original opcode decoding direct executor. On the fibonacci computation, the stepping executor even runs faster than the original direct executor.
fibonacci-direct time: [5.9980 µs 6.0445 µs 6.0984 µs]
fibonacci-microcode time: [48.740 µs 48.786 µs 48.837 µs]
fibonacci-direct-v2 time: [4.1360 µs 4.1448 µs 4.1545 µs]
fibonacci-stepping time: [5.0310 µs 5.0362 µs 5.0418 µs]
cpu_instrs-direct time: [9.0123 s 9.0247 s 9.0385 s]
cpu_instrs-microcode time: [18.893 s 18.965 s 19.051 s]
cpu_instrs-direct-v2 time: [7.9645 s 7.9748 s 7.9855 s]
cpu_instrs-stepping time: [9.1579 s 9.1701 s 9.1811 s]
The Web Debugger
While working on the emulator, I found it was difficult to track where things were going
wrong. Emulator bugs were often subtle and wouldn’t cause obvious issues until much later.
For example, one bug that made it into the first version of the direct executor was a
mistake where jp hl
(jump to the address given by register pair HL
) would set the
stack pointer rather than the program counter, which of course doesn’t crash the emulator,
though it obviously makes the emulated program do a wildly incorrect thing.
Because the emulator could keep running even if the program inside it was behaving incorrectly, there woudln’t be any kind of error or crash at the emulator level to use to identify where the inner program had gone wrong, and this made it relatively difficult to use something like GDB to debug it. There would not be any errors to direct GDB towards, and GDB breakpoints would be set at locations in the emulator code rather than the emulated code, meaning they would generally be hit constantly or never.
What would be more helpful would be the ability to peer into the state of the program at the level of the emulator, and set breakpoints in the emulated code rather than in the emulator code. To allow me to do this, I decided to write an application-specific debugger interface.
My debugger was a web interface using Yew, a React-like library for writing web
services in Rust running on WebAssembly, which I had used previously for my
Satisfactory Accounting project. Because Yew is a Rust library, it was pretty
easy to just load feo3boy
into it as a library to implement everything I needed such as
single stepping or stepping until reaching a particular program counter value. Here’s what
that looks like:
Conclusion
Those were my main contributions to the project. Ciaran wrote the graphics, audio, and input processing code for the emulator, and while there are still some significant bugs, especially in graphics, it is in a state where it can run some basic games, which is pretty cool.