Memory Safety Simplifies Microprocessor Design
Contemporary microprocessor designs are based on pervasively shared mutable state. This introduces numerous vulnerabilities that have been exploited to violate the security of our computing platforms. Mechanisms to mitigate these hazards, such as Memory Management Units (MMUs), increase the complexity and power-consumption of CPUs. The movement to multicore processors amplifies the problem and adds more complexity to synchronize shared-memory access across cores (e.g.: cache coherency logic).
By focusing on memory safety from the beginning, we can greatly simplify microprocessor designs, using less logic and consuming less power. The key is moving away from shared mutable state. One way to do that is to prefer immutable data and stateless computation, as is done in pure functional programming. Another way is to exclusively use private mutable state. Both techniques can be used, in a complementary manner, to achieve provable memory safety.
Contemporary Processors
Nearly all contemporary processor designs evolved from the von Neumann architecture, which originally specified a single processing unit with a local memory and connections to external I/O devices. The processor executed instructions sequentially, performing I/O, calculating values, and mutating local storage. This design is fine if you only have one thing to do. However, the design soon evolved to support multitasking, which gave the illusion of multiple independent processes by multiplexing the processor core.
Unfortunately, from the perspective of the processor there is still only one thread of control, focused on the currently-executing instruction sequence. The software process abstraction required additional hardware to support the separation of resources used by each process to prevent them from interfering with each other. However, it was also desirable for processes to communicate with each other, which they usually did through access to shared memory. So a limited form of interference was explicitly permitted.
Increasingly complex hardware was designed to help address the hazards caused by this model of shared mutable state among independent processes. In an attempt to improve performance, multicore systems were introduced. Rather than multiplexing a large number of processes onto a single core, multiple cores were available to execute processes in parallel [1]. The downside of this approach was an increase in the potential hazards of sharing mutable state. This led to even more complex hardware to maintain memory coherence among the processor cores. All of this complex logic also continued to increase power consumption.
Shared Mutable State Considered Harmful
The data representing the state of a computation can be accessed safely by making it either immutable or non-shared. If data is immutable (read-only) it can be safely shared among multiple processes. Even if it is accessed concurrently, there is no risk of interference because the data does not change. If data is mutable (read-write) then it must not be shared. It must be owned by a single process at any given time, and cannot even be read by another process whenever there is a possibility that the owner may change it.
Shared mutable state can be avoided by moving away from the load/store memory model of contemporary processors. Instead we can design processors that use an asynchronous event-driven model where any mutable process state is private. By making the event data immutable, it can be safely communicated between processes. Asynchronous events provide a natural transactional boundary for private-state mutation, and allows ample opportunities for concurrent processing. Transactional event handling can be multiplexed onto one or more processing cores without risking interference.
This new model eliminates significant hardware and software complexity. No MMU is required because all memory accesses are safe by design. Cache coherency is greatly simplified since there is only one process that can cause mutation of the data it owns, and that mutation is performed in a transactional manner in response to an asynchronous event. Each core can execute instructions without interference from any other because there is no contention. And there is no need for blocking because event generation is asynchronous. Synchronization between processes, when needed, is coordinated through events.
An Alternative Processor Design
We propose a new processor architecture driven by asynchronous message-events. All processing is initiated by the occurrence of a message-event. An event designates a message-handling target and an immutable message payload. The processor manages a queue of pending (sent, but not delivered) message-events. When an event reaches the head of the queue, if the event target is not busy an event-handling transaction is initiated. Otherwise the event is recycled to the tail of the queue.
Message-handling targets are reactive objects, called actors [2]. Each actor maintains a behavior, which defines the current message-handling procedure and any private mutable state owned by the actor. During an event-handling transaction, the actor may send new messages, create new actors, and designate a new behavior for subsequent messages. An actor’s behavior ends with either a commit (applying all transactional effects) or an abort (acting as if it never saw the event, and discarding it). An actor is busy when it is handling a message-event, and becomes idle after a commit or abort. Note that an actor’s private state is updated transactionally, reverting to the previous state if the transaction is aborted.
Most messages are sent as part of an event-handling transaction. However, I/O devices may also send messages. These messages take the place of device interrupts, but may carry arbitrary additional data. Messages may also be sent to I/O devices, to initiate activity or administer device settings. From the perspective of the sender, an I/O device is just another actor.
Object-Capability Security
The main idea behind object-capability security can be summarized by the phrase, “If you don’t have a reference, you can’t use it.” Conversely, designation implies authorization. In other words, having a reference is de facto authorization to use the object. This makes reasoning about access control straightforward; the reference graph is the potential access graph [3].
Having a reference to an actor means you have authorization to send an asynchronous message to that actor; but not access its internal structure. The only ways you can get an actor reference are: create a new actor, receive an actor reference in a message, or be given an actor reference as part of your initial private state.
Object capabilities (ocaps) are well-supported by this processor architecture. Actor references are capabilities. The inherent safety and security of ocaps allows significant simplifications in systems built on this platform. There is no need for the typical split between kernel and user-space (and the cost of calls between the two). Fine-grained security is enforced directly by the processor, allowing secure services (in the form of actors) to be used from ordinary untrusted components. There is also no need for process isolation mechanisms (like an MMU), since fine-grained isolation is already provided by actors. Even temporal dependencies are loosened by asynchronous messaging. Attenuated authority is easy to implement with light-weight proxy actors, making it easy to apply the Principle of Least Authority (POLA) [4].
Continuation Processing
Machine instruction semantics are defined in terms of an abstract stack machine. Instruction parameters come either directly from the instruction (immediate) or are popped from the stack. Results are pushed onto the stack. Instructions are always executed as part of a message-event transaction. The context of such execution is called a continuation [5].
A continuation contains all of the information needed to execute an instruction. This includes a reference to the current instruction (IP, or instruction pointer), a reference to the top of the stack (SP, or stack pointer), and a reference to the message-event (EP, or event pointer). The processor manages a queue of continuations. When a continuation reaches the head of the queue, the designated instruction is executed in the context of the continuation. As part of that execution, most instructions update the continuation, designating the next instruction to execute and the updated context. The updated continuation is then recycled to the tail of the queue. This approach gives us instruction-level concurrency, interleaving multiple independent computations without the risk of interference. Essentially, there is a zero-cost context switch before each instruction.
A new continuation is created for each message-event dispatched, with EP (the event pointer) designating the event. The target actor’s behavior provides the initial IP (first instruction to execute) and initial SP (the actor’s private state). The continuation lives for the duration of the message-event transaction. A commit/abort instruction does not produce an updated continuation, thus the continuation is not re-queued and dies. No effects caused by instruction execution are observable by any other computation until a commit occurs. A commit moves pending message-events to the processor event-queue, and updates the behavior/state of the current actor (if a new behavior/state has been specified). Unreferenced actors will eventually be reclaimed by the garbage collector.
Machine Instructions
Data is represented by fixed-size groups of bits called words. Each word is tagged as either a direct or an indirect value. If the value is direct, the rest of the bits in the word define the literal value. If the value is indirect, it designates a memory location that contains the value. Indirect references are further tagged as either transparent or opaque. Ordinary data structures are transparent, and may be traversed. Capabilities (actor references) are opaque. Tagging allows us to safely handle different kinds of values in the processor, mitigating the potential for inappropriate use [6].
Machine instructions are executed in the context of a continuation, which represents the current state of a message-event transaction. Information accessible to an instruction is limited to that which is reachable through the continuation. For example, the current (target) actor and the message payload are reachable via the event pointer. The stack may contain both direct and indirect values (transparent or opaque). There are no instructions for pointer math or load/store of arbitrary memory locations. This is the primary means by which memory safety is maintained. Capabilities may only be compared for identity or used to designate message targets.
Instructions fall into several general categories. There are computation instructions, which generally pop values from the stack and push the result of a calculation. There is a branch instruction, which chooses a continuation based on a boolean value. There are stack manipulation instructions, which rearrange values on the stack. There are instruction for managing (immutable) data-structures, such as lists composed of value-pairs. There are event access instructions, which extract information from the message-event. There are actor-primitive instructions, which create new actors, send new messages, and designate new behavior/state for the current actor. And finally, there are administrative instructions that perform meta-operations such as commit/abort.
Memory Management
Main memory is organized as quad cells, each containing four word values (direct or indirect). Using quads exclusively avoids memory fragmentation, which greatly simplifies the memory management hardware. In an implementation with 64-bit words, a quad contains 256 bits of data. Smaller devices, perhaps suited to IoT, could have 16-bit words, and 64-bit quads. Since references address quad-units, a considerable address-space is available even when using 16-bit words [7].
An unusual feature of this processor architecture is hardware support for automatic memory management. Many instructions implicitly allocate memory, so allocation must be a very fast operation. Since all allocations are quads, this can be very efficient. There is also a concurrent garbage collector implemented in hardware that runs in parallel with event dispatch and instruction execution. Garbage is identified in terms of reachability from a root-set of references. If a location is not reachable, it is garbage and will be reclaimed for future allocation [8].
The primary data-structures used by the processor fit nicely into quads. Continuation-queue entries consist of IP, SP, EP, and a reference to the next entry in the queue. Event-queue entries consist of a sponsor, target actor, message, and a reference to the next entry in the queue. Instructions typically consist of a type-field, opcode, immediate value, and next IP value. Pairs, from which lists (and other structures) are formed, consist of a type-field, head value, and tail value. As you can see, most indirect values contain a type-field, indicating how the rest of the quad should be interpreted. Processor-managed queue-entries contain a field for linking them into the queue, so there is effectively no limit on queue size. The same applies to the stacks used for private actor state and temporary storage during computation. By constructing the stacks from pairs, it is easy to manage many independent stack segments, each identified by a single reference to the top of the stack.
Larger data-structures, such as network packets or images, can be accessed the same way you would an external storage device, with an actor interface. The interface could provide random access, a stream abstraction, intelligent bulk-operations, or all of the above. The opportunities to provide powerful, high-performance, smart-data abstractions are endless.
Summary
We’ve presented a novel microprocessor architecture with a simple memory-safe design. It features instruction-level concurrency, object-capability security, and automatic memory management. The design is based on dispatching asynchronous events to reactive objects which manage private mutable state. All of the complexity involved in resolving contention for shared mutable storage is eliminated. Strong process and memory isolation removes the need for a split between kernel and user-space, while providing much more fine-grained access control. Implementations are expected to use significantly less power due to the simplicity of the required logic.
A prototype of a simulator implementing this architecture has been created. The uFork virtual machine is implemented in C and compilable for 16, 32, or 64-bits words. It features a gdb
-style debugger capable of single-stepping virtual machine instructions. Implemented on this platform are a PEG-parser toolkit [9] and a concurrent Scheme dialect [10] with Actor extensions. An assembler for virtual-machine instructions is also available from Scheme. The simulator demonstrates the viability of this architectural approach.
References
- [1]
- G. Heiser and K. Elphinstone. L4 Microkernels: The Lessons from 20 Years of Research and Deployment. ACM Transactions on Computer Systems, 34(1). ACM, April 2016.
- [2]
- G. Agha. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, Cambridge, Mass., 1986.
- [3]
- M. Miller. From Objects To Capabilities, Online at
http://www.erights.org/elib/capability/ode/ode-capabilities.html
- [4]
- M. Miller, K. Yee, J. Shapiro. Capability Myths Demolished, Technical Report SRL2003-02, Systems Research Laboratory, Johns Hopkins University, 2003.
- [5]
- J. Reynolds. The Discoveries of Continuations. Lisp and Symbolic Computation. 6 (3–4), Kluwer Academic Publishers, 1993.
- [6]
- H. Levy. Capability-Based Computer Systems. Digital
Press, Bedford, Massachusetts, 1984. - [7]
- S. Yvon and M. Feely. A Small Scheme VM, Compiler and REPL in 4K. Proceedings of the 13th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages (VMIL ’21). ACM, October 2021.
- [8]
- G. Steele Jr. and G. Sussman. Design of LISP-based Processors, or SCHEME: A Dielectric LISP, or Finite Memories Considered Harmful, or LAMBDA: The Ultimate Opcode. memo 514, MIT Artificial Intelligence Laboratory, March 1979.
- [9]
- B. Ford. Parsing Expression Grammars: A Recognition Based Syntactic Foundation. Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 2004.
- [10]
- G. Sussman and G. Steele Jr. Scheme: an Interpreter for Extended Lambda Calculus. memo 349, MIT Artificial Intelligence Laboratory, December 1975.
Special thanks to Alan Karp for his thorough review and feedback improving the quality of this post.
Tags: actor, capability, concurrency, CPU, microprocessor, ocaps, POLA, power, processor, safety, security, synchronization
Sounds like a good system to run Pony (https://www.ponylang.io/discover/) on.
Hmmm… could be an interesting compilation target for Pony, I suppose. The uFork prototype implements a LISP/Scheme REPL.