State machines all the way down

Introduction

Somewhere in the last 30 years, building a CPU went from being something only an elite few could do to something anyone with a bit of free time could do. These days we are pretty much measuring in "CPUs per square millimetre", people compete for how many CPU cores they can fit in a FPGA (hint: thousands, with the CoreScore benchmark), and there are games taking you from relays to programming languages (inspired by courses and books taking you from NAND to Tetris).

The Zilog Z80 CPU in my first computer seemed magical technology at the time, but last year I implemented (a subset) of a Z80 CPU in Python. As but one part of an extended Rube Goldberg Machine.

The Challenge

One of the great things to come out of the 2020 Covid-19 lockdowns were entirely online conferences like ComfyCon, and PyConlineAU. These looked for ways of having "audience participation" while being entirely remote: ComfyCon had active sharing of pet pictures, home labs, etc; and PyConlineAU 2020 had an entire day -- Sunday -- dedicated to social events.

The PyConlineAU 2020 "Rube Codeberg" competition came out on the Sunday "audience participation" day, when Lilly Ryan challenged attendees to "unnecessarily elaborately" engineer a piece of Python code which:

was written in Python 3, up to 200 lines
ran successful to print out "Hello World!" somehow
used the Beautiful Soup Python library
was "understandable/easy to parse by humans", and yet ridiculously over engineered a la Rube Goldberg machines.

I did not end up writing something on the conference Sunday, as there were many other things to do that day, but since I had a great idea for what I "should have written", the following weekend I sat down for a few hours and actually implemented my solution, in Python 3.

In addition to the conference rules, I also:

wanted the text "Hello World!" to not appear anywhere in the program; and
wanted to use the (at the time not built into Python) Python switch/case implementation I had been fiddling with during PyConlineAU 2020, inspired by the talk and blog post by Christopher Neugebauer.

My solution

My brain decided the best way to combine all of this was to implement the program to print "Hello World!" in another language, and then implement a Python 3 program that could execute that other language implementation. Obviously 200 lines of "easy to parse by humans" Python code is not very much to implement an entire language in, so it was clear the best way to proceed was to implement something which was drive by byte codes. I briefly considered transpiling the language into Python byte codes, but I did not know the Python byte code implementation... which is when I realised the byte code implementation I knew best was the Z80 CPU. (I spent several years writing Z80 assembler, in the late 1980s and early 1990s, including a terminal emulation program.)

So that is how I ended up implementing (part of) a Z80 CPU, in Python 3, as a side quest of another project.

Obviously if you run the program it prints out "Hello World!" (assuming you have the required Beautiful Soup module installed from PyPI):

$ python3 hello
Hello World!
$

but the Python program does not contain that text anywhere within it, and the output takes a non-trivial amount of time to arrive:

$ time python3 hello
Hello World!

real   0m0.242s
user   0m0.199s
sys    0m0.032s
$

A quarter of a second might not seem long by historical standards of the Z80, but it is eons of time for a modern CPU.

One gets a hint at how much additional hidden depth is there by asking for --help:

$ python3 hello --help
usage: hello [-h] [--asm] [--bin BIN] [--html]

optional arguments:
  -h, --help  show this help message and exit
  --asm       Output assembly
  --bin BIN   Output binary to file
  --html      Output HTML
$

The assembly option gives you Z80 source code that can be assembled and run on a Z80 system:

$ python3 hello --asm
.ORG    0x100
        LD HL,text
        LD DE,key
decode: LD A,(HL)
        OR A
        JR Z,print
        LD B,A
        LD A,(DE)
        XOR B
        LD (HL),A
        INC DE
        INC HL
        JR decode
print:  LD HL,text
pc:     LD A,(HL)
        OR A
        JR Z, done
        LD E,A
        LD C,2
        PUSH HL
        CALL 5
        POP HL
        INC HL
        JR pc
done:   RST 0

text:   DB 0x18, 0x1c, 0x2f, 0x03, 0x01, 0x4c, 0x3e

        DB 0x01, 0x17, 0x4c, 0x56, 0x11, 0x3f, 0x3a

        DB 00
key:    DB 0x50, 0x79, 0x43, 0x6f, 0x6e, 0x6c, 0x69

        DB 0x6e, 0x65, 0x20, 0x32, 0x30, 0x32, 0x30

        DB 00
$

in particular a Z80 CP/M system. But you do not actually have to do that because the program can also output the Z80 binary for you, ready to run with the --bin OUTPUT.com option:

$ python3 hello --bin hello.com
$ ls -l hello.com
-rw-rw-r-- 1 ewen ewen 143 Sep 12  2020 hello.com
$ ../tools/cpm/cpm hello


Hello World!
$

(with the extra two newlines due to the CP/M emulator used for testing).

And finally the --html option prints out HTML containing the motivating text, and the assembler output listing. (One might hope there was a webserver providing that HTML, but alas I massively ran out of lines for Python, so the HTML is a little BYO browser.)

The Implementation

The Z80 CP/M "Hello World" program

The first part of the implementation was that I needed a Z80 program that would print "Hello World!", itself in a way that was non-obvious: in particular I wanted it not to contain the literal text as either a raw text string or in hexadecimal bytes. So there is both the "text" string and a "key" string, both zero terminated, that between the encode the bytes to print out. The "decode" section of the assembly simply walks the text and key bytes, and XORs them together, forming a simple one time pad. Then the print section of the assembly walks of the decoded text, and prints it out. In this case using the CP/M BDOS conventions, where the C register contains the function, the E (or DE) register contains the parameter, and you CALL 5 to access the BDOS functions. (Using the CP/M BDOS conventions was an obvious choice both because I was familiar with them, and they are fairly compact, but also because it enabled testing the Z80 program independently, as shown above.)

The text and key were themselves generated with a small Python script:

#! /usr/bin/env python3
# XOR string A with string B, and print result as hex
#
# Written by Ewen McNeill , 2020-09-12
#---------------------------------------------------------------------------

import sys

def xor(text, key):
    return [(ord(t) ^ ord(k)) for t, k in zip(text, key)]

def main():
    if (len(sys.argv) <= 2):
        print("Usage: {0} TEXT KEY".format(sys.argv[0]), file=sys.stderr)
        sys.exit(1)

    text = sys.argv[1]
    hexord = lambda x: hex(ord(x))
    key  = sys.argv[2]

    # Bourne style shells remove newlines in command line parameters during
    # command line substitution, so we may have to restore it to make our
    # plain text have the same  length as the key
    #
    # https://unix.stackexchange.com/questions/383217/shell-keep-trailing-newlines-n-in-command-substitution
    #
    if (len(text) == (len(key) - 1)):
        text += "\n"

    if (len(text) != len(key)):
        print("Plaintext and key must be the same length! ({0} != {1})".
                 format(len(text), len(key)), file=sys.stderr)
        sys.exit(1)

    print(", ".join(map(hex, xor(text, key))))  

if __name__ == "__main__":
    main()

and for this challenge I obviously chose the required "Hello World!" text, and "PyConline 2020" as an equal length string to be the one time pad (since it is xor, the pad itself is visible in the assembly output, but obscured by being specified as hex digits).

$ cat gentext
#! /bin/sh
./xor "$(printf 'Hello World!\r\n')" "PyConline 2020"
$ sh gentext
0x18, 0x1c, 0x2f, 0x3, 0x1, 0x4c, 0x3e, 0x1, 0x17, 0x4c, 0x56, 0x11, 0x3f, 0x3a
$

and then I translated those into Z80 Assembler formatting, and put it with the rest of the program I had written.

I assembled the initial program with the zasm -- Z80 Online Assembler for expedience, to avoid yak shaving a Z80 assembler locally as well. Which gave me the assembly listing included in my solution.

As shown above, that program prints out "Hello World!" when run under a Z80 emulator that understands CP/M BDOS printing conventions.

So I needed a Z80 emulator that understood CP/M BDOS printing conventions, written in Python.

The (partial) Z80 CPU implementation in Python

The Z80 emulator is (slightly hidden) in the hello() function of my implementation. Due to both time, and especially space, constraints it implements just enough of the Z80 CPU to be able to run the example program. The entire implementation is just a "forever" loop, that executes one instruction per clock tick (not 100% realistic, as the Z80 instructions were variable length from 4 clock ticks upwards, but time and space constraints limited the accuracy :-) ).

Each instruction is handled by a Switch/Case implementation in Python, inspired by Christopher Neugebauer's talk and blog post, for which the Switch/Case implementation can be seen near the top of the program, and the individual implementations are compactly implemented in the hello() function. Since the cases all needed distinct names (a limitation of the Switch/Case implementation), they are named after the hexcode of the Z80 opcode they are implementing, as that is both short and unique :-)

The majority of the implementations are trivial implementation of a "Z80 State Machine" (as much as was required to execute the example program), part from hcd() (for CALL, addr) which is special cased just to handle the one CP/M BDOS printing routine required, and hc7() (for RST 0) which I used as a "Z80 flavoured" end of program sentinel, and simply leaves exits the Python program. There are also a couple of helper functions to reduce the length of some of the implementations. The PUSH / POP implementations are skipped for space reasons, as they turn out not to be relevant since our CALL 5 implementation does not corrupt the HL variable, so we do not need special steps to preserve it (but they are present in the original Z80 assembly, as the CP/M BDOS does corrupt HL).

It turned out to be important that the newline after the Python function name is optional, in order to save some of the value 200 lines allowed, so all the short implementations are inline. But it also turned out to be important I could support multi-line case statements, an unstated requirement of the original implementation. Particularly LD C, immediate and JR dest / JR condition dest needed multiple line implementations.

Putting it all together

At this point I had a Z80 (CP/M) program that would print the required text, and a (partial) Z80 emulator written in Python that would execute that Z80 program, and produce the output required. The remaining step was to bring the two together, and ensure that I used the required Beautiful Soup module.

Since I wanted to somewhat hide the Z80 program inside the Python program, to make it stand alone, and not immediately obviously connected, I came up with the idea of hiding the Z80 assembly listing output in a Python docstring at the start of the program. From which the obvious thing to do was to bury it in a (trivial, space constrained) HTML page which made fun of how "difficult" it was to do anything in Z80 assembler, and provided a nice counterpoint to the "understandable/easy to parse by humans" Python program that followed.

Using HTML, gave me a direct reason to use the Python Beautiful Soup module as required, which extracts the HTML, and assembly listing, without which the whole program would not work. From that assembly listing, we can get both the assembly input, and the Z80 opcodes for the program, give certain assumptions about the assembly listing output. (Fortunately it turns out the many 1970s/1980s derived output formats were very column aligned, thanks to a punch card / line printer influence!)

Which just leaves the top level program to parse the arguments and dispatch to the right function which is not in main() method, because I ran out of lines (199 out of 200!). It too uses a Switch/Case implementation largely because it was there, and a fairly compact solution.

In the default run:

The hello() function (Z80 CPU partial implementation) is invoked on the output of get_bin() (which retrieves the Z80 opcodes from the assembly listing extracted from the Python docstring with Beautiful Soup)
The hello() function then steps through the Z80 implementation to decode the string to be printed out in the text section of RAM, and then
The hello() function steps through the Z80 implementation to print out the decoded Hello World! string, character by character

And hey presto, "Hello World!" is printed as required, simple as that. Taking only a quarter of a second, about 25x longer than a trivial implementation in Python:

$ time python -c 'print("Hello World!")'
Hello World!

real   0m0.012s
user   0m0.004s
sys    0m0.008s
$

(To be honest I am surprised it is only 25x slower, considering it has both an inefficient Switch/Case emulation, and an inefficient Z80 emulation to contend with!)

Most of the other run options also use Beautiful Soup, to get the assembly listing, with the exception of --html which simply prints out the whole docstring "as is".

Despite various steps to reclaim lines of text as I was running up against the line limit, the program still feels surprisingly readable over a year later.

The original competition did not actually have a line width limit, and Python does allow stacking multiple operations on a single line separated by ; -- semicolons -- but I tried to avoid that except where required to reach the line limit. And to stick to 80 column lines for readability. Spending 50 lines on the "alternative" implementation in Z80 Assembler, and another 20 lines on the Switch/Case implementation, did not exactly help my line count constraints, but despite that I still managed to retain another 33 entirely blank lines.

You can read my whole implementation on GitHub, and that also has links to all the other Rube Codeberg submissions I could find.

Conclusion

It was a fun program to write, and a nice distraction from the world outside in 2020. I was particularly surprised just how simple it was to implement a usable fraction of a Z80 CPU state machine in just a few hours, in a language entirely unsuited to state machine implementations. And it was a lovely loop back to one of the first open source projects I was involved with, "Libero", a Finite State Machine generator from iMatix.

It really is state machines all the way down :-)