ChocoPy is a programming language designed for classroom use in undergraduate compilers courses. ChocoPy is a restricted subset of Python 3, which can easily be compiled to a target such as RISC-V. The language is fully specified using formal grammar, typing rules, and operational semantics. ChocoPy is used to teach CS 164 at UC Berkeley. ChocoPy has been designed by Rohan Padhye and Koushik Sen, with substantial contributions from Paul Hilfinger.
This is a Rust variant of the ChocoPy compiler created as part of the COMPSCI 164 class project that now targets x86.
To build the main compiler and standard library, run this in the base directory:
cargo build
Then use the compiler on the provided input files in the test/ directory, like so:
# Compile to executable output.exe, this compiler autodetects for your platform.
cargo run input.py output.exe
# Compile to object file output.o.
cargo run input.py output.o --obj
# Output untyped AST JSON to STDOUT.
cargo run input.py --ast
# Output typed AST JSON to STDOUT.
cargo run input.py --typed
- Features a hand written Rust lexer and parser. The parser is a left recursive parser with a look-ahead value of 2 to distinguish between declarations and statements.
- Supports outputting an intermediate AST representation of the code. This can be viewed directly through the CLI.
- Type checks the AST to predict and determine expected types for complex statements and declarations. Throws non-fatal type errors stored in the AST to see type errors in the input program. This can be viewed directly through the CLI.
- Generates x86 assembly code, and handles assembly (converting to an object file) on three different platforms: Windows, Linux, and Mac.
- Handles linking against a separate create to represent a standard library. This library handles built-in function implementation, including object allocation and error reporting. This object allocation also executes the garbage collector.
- Implements the mark-and-sweep garbage collection algorithm. This garbage collector is called whenever a new object is allocated, and the total size of allocated objects reaches a threshold.
- To do this, the system walks through all the objects that are heap allocated (global and local references) and marks which ones are currently being used. The system accomplishes this by storing a linked list of objects that are currently on the heap, to make it easy to track every dynamically allocated object.
- The system also recursively tracks member object references within one particular object to mark non-global objects.
- Then in the sweep phase, we walk through the heap linked list again and free all the un-marked objects.
The compiler uses a hand-written lexer and parser.
- Lexer: Implemented as a generator-like component. Since stable Rust does not support generators, this is simulated using an asynchronous pipe model.
- Parser: A recursive descent parser where each
parse_xxx(...)
function maps to a grammar non-terminal. Left recursion is rewritten into loops. Expression parsing uses precedence levels (parse_exprN(...)
) to manage operator hierarchy. The parser uses at most 2-token lookahead, primarily to differentiate declarations from statements.
The compiler supports intermediate typed and untyped ASTs in JSON format, compliant with the CS 164 spec. Internally:
- AST nodes are implemented with idiomatic Rust
struct
s andenum
s, instead of a class hierarchy. - Pattern matching replaces virtual dispatch for semantic analysis.
Unlike the ChocoPy Implementation Guide, this compiler targets x86 instead of RISC-V and diverges in several implementation details.
Symbol names follow these conventions:
$chocopy_main
: User program entry point$global
: Global variable section- Constructors: Use the class name (
MyClass
) - Methods:
<ClassName>.<MethodName>
- Prototypes:
<ClassName>.$proto
- Nested functions:
<ParentSymbol>.<FuncName>
- Standard library: All functions prefixed with
$
(exceptmain
)
User-defined functions are not prefixed. Variable and attribute names are kept as-is. Hidden/internal attributes are prefixed with $
.
RSP
andRBP
retain their conventional roles (stack and frame pointers).- All other general-purpose registers are used freely.
Objects are 64-bit pointers. 0
denotes None
.
int
→ 4 bytesbool
→ 1 byte
They are stored in 8-byte stack slots. In global variables and object fields, alignment is based on their actual size (packed layout).
- Header (24 bytes):
- 8 bytes: Pointer to
$proto
- 16 bytes: Reserved for GC (
$gc_is_marked
,$gc_next
)
- 8 bytes: Pointer to
- Attributes follow the header
- Array-like types (
str
,[T]
) add:- 8-byte
$len
field - Packed element layout (
int
,bool
,str
use 4, 1, and 1 byte respectively)
- 8-byte
Every type C
(including primitives) has a global C.$proto
symbol pointing to a shared prototype object. Prototypes contain:
$size
: Object size (positive) or per-element size for arrays (negative)$tag
: Type tag0
→ user-defined/built-in object-1
→[int]
or[bool]
(plain list)-2
→ other lists (reference elements)
$map
: Reference bitmap for GC- Method table (starting with
__init__
)
Each class C
has a constructor symbol C
. The constructor:
- Allocates memory
- Initializes fields manually (not from prototype)
- Invokes
__init__
- Arguments pushed in right-to-left order
- Nested functions receive static link in
R10
- Stack aligned to 8 mod 16
- Return values in
RAX
- Caller restores stack
- Volatile registers:
RAX
,RCX
,RDX
,RSI
,RDI
,R8–R11
,R10
Exceptions: $chocopy_main
and standard library functions use the system ABI (System V or Windows).
- Outgoing arguments (for nested calls)
- Alignment padding (if needed)
- Temporaries
- Local variables
- Static link (
R10
) - Saved
RBP
(caller’s frame pointer) - Return address
- Local variables: in declaration order, highest at bottom
- Parameters: leftmost has lowest address (at top)
The final binary is composed of:
program.o
: Compiled user programchocopy_rs_std
: Standard runtime librarylibc
: System C library
Example call chain:
main → chocopy_rs_std $chocopy_main → program.o print → program.o $print → chocopy_rs_std fwrite → libc
chocopy_rs_std
provides:
- Built-in function implementations (
$print
,$len
, etc.) - Memory allocation (
$alloc
) and garbage collection - Error handling
- Entry point
main
→ calls$chocopy_main
To prevent conflicts with user-defined main
, the real entry point is in the standard library.
Implements mark-and-sweep GC triggered by $alloc
when a memory threshold is reached.
- Objects dynamically allocated on the heap are linked together via
$gc_next
$gc_is_marked
is set to1
for reachable objects during mark phase- Unmarked objects are deallocated in the sweep phase
- Live objects reset
$gc_is_marked
to0
GC walks through:
- Global references: Map passed at
$init
- Local references: Maps attached after function calls (via
PREFETCHNTA
) - Object members: Maps in the class prototype (
$reference_bitmap
)
Reference maps are bitstrings indicating which fields are pointers. For locals, each map describes one stack frame. The GC walks the full stack to gather all active maps.