Debugging Actor Systems
Software development is a defect injection process. With every line of code we write, we have a chance of introducing unintended behavior into the system. This chance increases with conceptual complexity. The more difficult a system is to understand, the greater our chance of introducing defects.
In his 1980 Turing Award lecture, Tony Hoare said [1]:
I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.
The first method is far more difficult.
For decades our industry has taken the “easy” route, resulting in the hopelessly complex systems that surround us today. The unreliability that we experience is a natural consequence of our approach, as Tony observed. Actor programming has the potential to support the first, more “difficult”, route.
Some claim that asynchronous messaging, an inherent aspect of actor systems, greatly increases conceptual complexity. Although actor systems exhibit unbounded non-determinism in message reception, the ability to apply local reasoning actually reduces conceptual complexity. Actors can be designed, implemented and understood individually. Then they are composed into interacting systems that retain the properties we reasoned about (dare I say “proved”) during design. Composite actor-based components interact just like low-level actors. They form a fractal-like self-similar structure that supports reasoning about the system at each and every scale of composition.
Inevitably, we will make mistakes. We will write code that doesn’t say what we mean. We will fail to anticipate some unwanted interaction. So what do we do then? Our mental model of the system diverges from its actual behavior. We need to correct our mental model in order to understand the system we built, rather than the system we imagined. Two strategies for doing this are observation and reasoning.
Observation
Observation focuses on what the system is actually doing. Take, for example, a vending machine. We insert a coin, then push a button, and expect a product to be dispensed. If this doesn’t happen, we would like to observe that machine’s operation in order to determine where the process is failing. Did the machine accept our coin? Is there product to dispense? Is something blocking the mechanism? Answering these questions requires visibility into details of the machine that are hidden in normal operation.
Instrumentation
One way to provide visibility into the system’s operation is to add instrumentation. These are the dials and read-outs you might see on a monitoring dashboard. They allow observation of specific data representing a selected subset of the machine’s internal state. In software systems, this instrumentation comes in the form of logs and debugging inspectors. Logs provide a trace of “interesting” activities, produced as the system operates, ideally not interfering with normal operation. Debugging inspectors, on the other hand, usually require stopping (and single-stepping) some part of the system. Various aspects of system state are examined while operation is suspended. Note that, in general, this violates encapsulation of the objects observed.
In an actor system, logging can be introduced by injecting proxies between any two communicating parties. The proxy records the message (to a log) and forwards it to the original destination. This logging is essentially invisible to both the sender and the receiver because sending a message already involves an arbitrary delay and potential message re-ordering. A real-time monitoring instrument can also be attached to the log, providing a continuous read-out of system status.
An even more powerful form of logging can be provided directly by the actor run-time. By definition, each message delivered is limited to three kinds of effects; creating actors, sending messages and becoming a new behavior. The run-time environment can record the provenance of each effect, allowing us to trace back through the causal chain of message deliveries leading up to an event of interest. We can examine these chains to determine why something happened, or note what break-in-the-chain prevented it. In the Humus Simulator/Debugger, the DEBUG
button displays this kind of provenance log.
Testing
Another kind of visibility is provided by testing. Testing is the creation of a controlled environment (the test harness or fixture) in which the operation of a component can be observed. I’m referring to “black-box” testing, where the encapsulation of the objects-under-test are not violated. Instead they are surrounded by mocks and/or stubs that play the part of collaborators to the objects-under-test. Expected interactions can be scripted and compared with actual interactions. As with logging proxies, this capability can be provided strictly within the actor system and does not require any special access or control of the run-time environment.
LET expect_inactive_beh(admin) = \msg.[ SEND (#Unexpected, SELF, msg) TO admin ]
The most basic test fixture is a behavior that expects no interaction. If an actor with this behavior receives any message at all, the administrator admin is notified with an #Unexpected
message.
LET expect_verify_beh(admin) = \msg.[ CASE msg OF ($admin, #verify) : [ SEND TRUE TO admin ] _ : [ SEND (#Unexpected, SELF, msg) TO admin ] END BECOME expect_inactive_beh(admin) ]
A more interesting fixture is one that verifies that no interaction has occurred. After the system-under-test has been stimulated in some way, a #verify
message from the administrator produces positive confirmation (a TRUE
message) that no prior messages have been received by the fixture. If a prior message is received, the administrator is notified. In either case, the fixture’s behavior becomes expect_inactive_beh, because no further interaction is expected.
LET expect_request_beh(admin, request, reply, next_beh) = \msg.[ CASE msg OF (cust, $request) : [ SEND reply TO cust BECOME next_beh ] ($admin, #verify) : [ SEND (#Missing, SELF, request) TO admin BECOME expect_inactive_beh(admin) ] _ : [ SEND (#Unexpected, SELF, msg) TO admin BECOME expect_inactive_beh(admin) ] END ]
The most common fixture is one that represents a scripted interaction consisting of a request and a reply. When the expected request arrives, the scripted reply is sent to the customer cust, and the fixture moves to the state represented by next_beh. If a #verify message from the administrator arrives, the administrator admin is notified (with a #Missing
message) that the expected request did not occur. Any other message triggers an #Unexpected
notification to the administrator. All failures move the fixture to expect_inactive_beh, trapping further interaction.
CREATE system-under-test WITH \service.[ SEND (sink, #ping) TO service ] CREATE test-fixture WITH expect_request_beh( println, #ping, #pong, expect_verify_beh(println)) SEND test-fixture TO system-under-test AFTER 1000 SEND (println, #verify) TO test-fixture
In this example, the system-under-test simply generates a #ping
request for a hypothetical service. Our test-fixture represents a mock service implementation that responds to #ping
with #pong
. A delayed #verify
message causes TRUE
to appear on the console (via println), indicating that one-and-only-one #ping
request was processed. With a little support from the run-time environment, the verification message could instead be triggered when the run-time became idle (had no pending messages). This kind of controlled execution environment is very helpful for reliable testing.
Reasoning
So, we need instrumentation to provide visibility into the operation of a system. Then we can construct tests that exercise our system under controlled conditions. We can even automate checking the results of our tests. But, this will never prove that our system has no bugs. As Dijkstra noted [2]:
Program testing can be used to show the presence of bugs, but never to show their absence!
We need a different approach to establish confidence in our software. Programs are finite representations of potentially-infinite processes. We need a way to reason about the infinite possibilities and assure ourselves (and our customers) that the system’s behavior remains within desirable limits. As our reasoning becomes progressively more formal, we may eventually consider it a “proof” of some kind.
Of course, there are limits to what we can “prove” about any formal system, beginning with the infamous “halting problem“. When we have multiple threads-of-execution (or “cores”) concurrently modifying a shared memory-space, the factorial growth of the interleaving of potentially interfering operations makes exploration of the state-space infeasible. We need to keep the complexity of individual elements small enough to understand them completely, and combine them in ways that avoid an explosion of potential interactions.
Consider the problem of confronting a Neanderthal armed with a club. Although there are an infinite number of relative positions we could be in, it is sufficient to partition them into; outside club-range (safe), and within club-range (dangerous). Now I can express a condition I care about (safety) in terms of a binary space-partition among infinite possibilities. Let’s say I surround myself with an impassable barrier through which the club may still reach. I can assure may safety by simply remaining more-than-club-length distance away from the barrier. Thus I’ve added another element to the system, but I have not increased the complexity of determining my safety.
In actor systems, unbounded non-determinism is often thought to introduce intractable complexity. This is no more true than infinite relative positions makes the safe-from-the-Neanderthal problem intractable. In fact, there is no globally-observable state in an actor system whose permutations we could consider in the first place. Instead we can (and must) apply local reasoning to individual actors as well as configurations of collaborating actors.
Each actor is a state-machine. Each message received is an “event”. An actor’s response to an event is finite, consisting of; creating actors, sending messages and (maybe) becoming a new behavior. Becoming a new behavior is how an actor changes state. It is the only way an actor’s state can change. We can effectively reason about an actor by considering its individual (usually quite small) state-space.
We can reason about groups of collaborating actors by considering the possible interactions between their individual state-machines. This does not have to lead to a combinatorial explosion. Part of good actor-oriented design is composing actors in such a way that their aggregate complexity does not get out-of-control. Careful protocol design facilitates low coupling between distinct groups of actors. By limiting the “surface area” of interactions between groups, we can keep complexity manageable. We can consider the interactions within each group, and the interactions inherent in their connecting protocol, separately with no loss of generality. The Serializer is a simple example of this technique.
If this approach seems unfamiliar, consider how we “debug” distributed internet applications. We don’t try to observe or control (not to mention single-step) all of the interacting systems simultaneously. We reason about each system in isolation, then consider the protocols in which they participate. Each step along the way we protect ourselves from needing to know about the internals of another system. This is like constructing our Neanderthal-protection barrier. We no longer have to worry about bad behavior outside of our own system because we’ve established systemic safety at our boundaries.
References
- [1]
- C. A. R. Hoare. The Emperor’s Old Clothes. Communications of the ACM 24(2), February 1981.
- [2]
- E. W. Dijkstra. Notes On Structured Programming, On the reliability of mechanisms, April 1970.
Tags: actor, composition, consistency, debugging, instrumentation, logging, observation, proof, provenance, reasoning, testing
In the paragraph before the Observation section, you write:
“Our mental model of the system diverges from its actual behavior. We need to correct our mental model in order to understand the system we built, rather than the system we imagined.”
This gap appears to be the crux for modern computing as it applies equally to software as it does all the way down to the hardware.
Indeed, you are correct that we must usually adjust our mental models to maintain consistency with the implementation-specific behaviors that we observe from our work. But in order for us to build an intuition rather than a slow reasoning process, we must have immediate and unequivocal feedback (e.g. see Daniel Kahneman on cognitive psychology’s System 1 & 2 for reasoning). To this end, the programming environment and its interface to the user must adapt to us, the potential programmer, as much if not more than our mind’s mental model must adapt to the formalisms of the underlying computing architecture.
Cheers,
Joe Gorse