2 CLAMS-Forth Overview
“If you’ve seen one Forth, well, you’ve seen one Forth.” ~ Unknown
2.1 Requirements
Compatibility with the Raspberry Pi Pico C/C++ SDK, drivers, and libraries: The ability to use the massive toolset the SDK and community open source projects provide is absolutely crucial to minimize developer time for complex projects. The USB and WiFi / Bluetooth stacks alone would take many months to duplicate in Forth. The primary references for this are Stephen Smith’s RP2040 Assembly Language Programming (Smith 2021), and, of course, Raspberry Pi Pico C/C++ SDK (Raspberry Pi Ltd Accessed 2023-10-22).
Optimized for speed: CLAMS-Forth will be written in RP2040 assembly and will provide an RP2040 assembler.
Digital signal processing: A high-speed block floating point digital signal processing library will be provided.
2.2 Desiderata
Forth 2012 standard compatibility is a strong desire but not an absolute requirement. Much of the CORE word set and some of the CORE EXT word set will be implemented, but a number of tricky, risky or little-used CORE words will be omitted. Many of the two-cell operations, for example, aren’t really needed in a Forth running on a 32-bit architecture like the RP2040.
Parts of the Search-Order and Programming-Tools word sets will be implemented. I want to have the Block word set using blocks in flash memory, but that’s not going to be in the first release.
Custom word sets will be provided for high-speed digital signal processing and SDK / hardware access. All access to the hardware / system level operations will be performed via C language calls to the SDK.
Cooperative multitasking: The RP2040 has two cores, and each core has a divide coprocessor and two interpolators. In addition, the RP2040 has two programmable input / output (PIO) blocks. Cooperative multitasking is the Forth way to exploit this available concurrency and is well-supported by the SDK.
Portability to other boards with the RP2040 microcontroller is possible, but is not on the roadmap yet. As with CLAMS itself, the initial target is the Pimoroni PicoVision.
Floating point support is desirable, but is a fairly low priority. Floating point arithmetic is convenient, and the RP2040 SDK provides optimized floating point libraries, but there’s no hardware floating point arithmetic on the RP2040. It’s too slow for real-time digital signal processing, so it’s not obvious how useful this capability would be.
2.3 Design / architecture
The top-level design and architecture are based on Dr. Chen-Hanson Ting’s eForth Overview (Pintaske and Ting 2018). In eForth, a small number of primitive words are implemented with a macro assembler, then the rest of the system is built in Forth on top of those words. Brad Rodriguez’ CamelForth (Pintaske and Rodriguez 2018) also provides some implementation guidance.
The original plan for CLAMS-Forth was to use subroutine threading and inline code. There is a partial implementation, but I decided to switch to a more “traditional” but less efficient direct threaded implementation. There are two main reasons:
The branch instructions in the RP2040 are program counter relative, and the available target distances are limited by the instruction formats. In particular, the function call (
BL
) instruction has an available range of plus or minus 16 megabytes. While this is enough within the RP2040’s flash or SRAM, it is not enough to cross the address space distance between SRAM and flash.CLAMS-Forth will have a dictionary split between flash and SRAM. The entire system dictionary - Forth virtual machine, text interpreter and compiler - will be in flash. Only words created by the user at run time will be in SRAM.
Dr. Ting’s eForth for Discovery (Pintaske and Ting 2019) gets around this limitation by copying the system dictionary from flash to RAM at boot time. But the hardware he used has more RAM and a more complete instruction set than the RP2040. The RP2040 only has 264 kilobytes of SRAM and much of that will be used for audio buffers. The more code and tables safely stashed away in the 2 megabytes of flash, the better.
Even with a dictionary entirely in SRAM at run time, constructing a
BL
function call instruction while compiling a user “colon” definition is a tricky process, difficult to document, understand and maintain. If you’re interested in the details, see (Pintaske and Ting 2019, sec. 3.5.5). I decided to switch to direct threading, rather than spend an extra week coding both a flash-to-SRAM relocation and an algorithm to compute a relative branch address that is then split into four separate bit fields of two 16-bitthumb
instructions.
2.4 Status / roadmap
There is a partial implementation of a subroutine-threaded version with inlining. I am leaving that in the repository in a development branch but won’t be working on it unless I find that the direct-threaded version cannot meet speed requirements. The subroutine-threaded version is in branch
forth-stc
.The direct-threaded version is nearing a release. I expect to have the system dictionary, text interpreter and compiler done by January 1, 2024. It is in the branch
forth-dtc
.The next step is to implement synthesis algorithms. The general process will be to prototype them in Forth, converting to assembler when needed. That will create the requirements for the digital signal processing library.