e-lang: Distributed Space-Time Debugging (was: MAC calculation)

From: Mark Miller <markm@caplet.com>
Replying To: Mark Miller <markm@caplet.com>
Date: Tue, 26 Nov 2002 00:22:57 -0800
Subject: [e-lang] Distributed Space-Time Debugging (was: MAC calculation)
At 08:42 PM 11/25/2002 Monday, Mark Miller wrote:
>At 06:21 PM 11/26/2002 Tuesday, Rob Withers wrote:
>>It will be interesting to analyze, and a remote debugging capablity will be
>>critical.[...]
>
>Funny you should mention that.  We just had some interesting victories for 
>distributed debugging of E code.  I hope to explain more about this soon...

But first some history. 

<history> 

As Chris Hibbert once pointed out to me, the stack-list-view is central to 
those traditional debuggers we like,  with one list entry per activation 
frame. An example below is a traditional (Smalltalk-80-like) debugger 
organized in three major panes, where the "^" represents which item in each 
pane is selected (view the following in a fixed width font):

+-------------------------------+-+
|   factorial(1)                +-+ # Stack pane 
| ^^factorial(2)^^              +-+ # (with scroll bar)
|   factorial(3)                | |
+-------------------------------+-+
| def factorial(n :int) :int {    | # Source pane
|     if (n <= 0) {               |
|         1                       |
|     } else {                    |
|         n * factorial(n - 1)    |
|     }       ^^^^^^^^^^^^^^^^    |
| }                               |
+-------------+-------------------+
|   factorial |   3               | # Variable pane
| ^^n^^       |                   | # (value of selected variable
+-------------+-------------------+ #  on right)

Back when Dean & I were working together on Joule at Agorics, I became 
worried about how one would debug Joule computation,  since Joule is a purely 
asynch-message-oriented model of computation with no call-return stack.  In 
answer, Dean had yet another brilliant idea (when does the guy sleep), which 
he calls "space-time debugging":

In a space-time debugger, you start by replacing the stack pane 
by a causality pane.  Rather than a list view, the causality pane uses an 
outline-editor-like tree view.  Each entry still represents an activation 
(what happens when an object receives a message). But in the Joule 
computational model (as with its ancestors Actors and Concurrent Prolog) 
these activation frames are not suspended waiting on each other. Rather, 
these are simply the atomic steps by which computation proceeds. Each 
activation causes other activations by sending messages. The causality graph 
considered purely in these terms therefore forms the tree visualized by the 
tree-view.

In Dean's conception, computation is advanced by a sort of visual pun.  As 
the outline is expanded on a particular branch for the first time, the 
debugger advances computation along that path. When you were at the frontier 
of the outline, you were looking at live-but-suspended computation, and so 
could use actual variable values for the variable display. When you selected 
any prior event on the outline, some magic would have to occur to calculate 
what the values of those variables were at that previous time. There were a 
variety of paper designs for how to do this, but they were all hard. (How 
many ways can you spell "determinism"?)

Doug Barnes and the Electric Communities gang brought immediate call-return 
locally-sequential programming together with Joule-like event-send  
programming for the first time in creating Original-E.  Monica Anderson of 
EC did for Original-E the first implementation of a space-time debugger.  
This happened before Original-E specified partially-ordered message delivery 
http://www.erights.org/elib/concurrency/partial-order.html . In Monica's 
debugger, the user could advance computation along any frontier of the 
outline, and could thereby try to make those non-deterministic choices most 
likely to cause a bug to manifest.  Data on the frontier was inspectable as 
in Dean's design, but past activations could not be selected, and so no 
restoration of prior data states needed to happen.

</history> 

Even though Original-E combines the two computational models, Monica's 
debugger didn't have a stack view,  as stack-oriented debuggers for Java were 
hard, and the existing ones were not pluggable components. Nevertheless, 
what I've dreamt of for E was something like:

+-------------------------------+-+
| - foo <- bar(3)               +-+ # Causality pane 
| ^^+ zip <- zap(7)^^           +-+ # (with scroll bar
|   - fnord <- glorp()          | | #  and +/- for outline)
+-------------------------------+-+
| ^^zip <- zap(7)^^             +-+ # Stack pane
|   zippity.dodah()             +-+ # (with scroll bar)
|   foo.bar(3)                  | |
+-------------------------------+-+
| def zippity {                   |
|     to dodah() {                |
|         zip <- zap(x)           |
|         ^^^^^^^^^^^^^           |
|         blah.blah()             |
|         fnord <- glorp()        |
|     }                           |
| }                               |
+-------------+-------------------+
|   fnord     |   7               | # Variable pane
| ^^x^^       |                   | # (value of selected variable
+-------------+-------------------+ #  on right)

The top pane shows that the event started by "foo" receiving the message 
"bar(3)" in turn sent the message "zap(7)" to "zip" and sent "glorp()" to  
"fnord".  The events in which these messages are received are children on 
the "foo <- bar(3)" event. The "+" before "zip <- zap(7)" indicates that 
this part of the outline may be expanded, and that therefore this event 
caused yet other events.

On selecting the "zip <- zap(7)", the Stack pane shows the call stack as it 
was when this event was spawned, ie,  when the "zap(7)" message was 
eventually-sent to "zip". "zip <- zap(7)" is therefore at the top of this 
stack, and its causality-parent, "foo <- bar(3)", appears as an immediate 
delivery at the bottom of this stack.

Unfortunately, restoring past data states remains as hard as it ever was,  
and support for deterministic replay seems almost as far off as when I first 
started fantasizing about it. Fortunately, we could get much of the bang for 
very little of the buck.

In rebuilding the event-loop mechanisms for the next release, I paid a lot 
more attention to regaining another piece of technology we had at EC  -- 
causality tracing.  Much as Java's Throwables includes a stack trace for 
helping to figure out why something bad happened, EC's causality tracing 
allows one to additionally trace backward through the chain of prior events. 
Instrumentation that allows such looking backwards also allows one to look 
forward, and log the causal chain of events.

For the next release I also rebuilt the interp-loop mechanism for the new 
release, replacing old crufty Java-written thread code  (Interp.java) with 
new somewhat-less-crufty E-written event-loop code.  (If you don't what the 
interp-loop mechanism is, don't worry about it.) Unfortunately, the new code 
had a subtle but fatal bug that I couldn't find using IDEA's excellent 
traditional thread-and-stack-oriented debugger, nor could Terry and I find 
the bug in a code review.

Instead, we realized that the causality tracing information could be logged 
and used to produce a post-mortem version of the space-time debugger,  but 
without any data state.  I modified the causality tracing mechanism to 
capture both Java and E stack traces (the latter thanks to a suggestion by 
Dean), and to log the E stacks as of the time each message is sent.  Here's 
a sample entry:

=== 2002-11-24T05:05:12.729Z (SendingContext.trace:SendingContext.java:87) DBG
causality:  :  
<Eventual ref> <- run()
--- event: <Vat parsing -.e in <runs in parsing -.e>>:3
--- sent by: <Vat start in <runs in start>>:2
---  @ run/0: <file:/C:/e/src/esrc/org/erights/e/elang/cmd/makeRepl.emaker#:span::126:29::126:30>
---  @ org.erights.e.elang.cmd.makeRepl$makeRepl__C$repl__C$loop_3__C#run/0
---  @ run/0: <file:/C:/e/src/esrc/org/erights/e/elang/cmd/makeRepl.emaker#:span::142:16::142:16>
---  @ org.erights.e.elang.cmd.makeRepl$makeRepl__C$repl__C#run/0
---  @ run/0: <file:/C:/e/src/esrc/org/erights/e/elang/launcher/eLauncherAuthor.emaker#:span::167:20::167:20>
---  @ org.erights.e.elang.launcher.eLauncherAuthor$eLauncherAuthor__C$eLauncher__C$2#run/0

The stack trace was captured at the moment that event #2 in Vat "start" sent 
the inter-vat message that, on delivery, will be event #3 in Vat "parsing".  
The log captures both E source position spans and fully qualified names.

Terry & I then sat down with the logs and a big sheet of paper and hand 
simulated what our fantasy post-mortem debugger would show for this log.  The 
computation we were examining involved 3 vats and one non-vat thread, all in 
one JVM. We used the same visualization as above, but used colors to 
distinguish these four contexts. In the process of hand simulating the 
debugger ui, we came to understand the code better, and we found the bug.

Even though all four contexts were in one JVM in this case, I still consider 
this a triumph for distributed debugging.  It will be some work to instrument 
CapTP to enable this kind of post-mortem for inter-vat computation, and it 
will take some care to ensure we don't blow security while we do so. But 
given that, this ui design and everything Terry and I experienced while 
simulating it would apply equally well if the vats had been distributed 
among different machines. I am unaware of anything else in the universe that 
gives this level of practical support for building distributed systems.

Besides helping debug, Terry points out that this space-time ui should also 
make it much easier for people to learn E's computational model. 


----------------------------------------
Text by me above is hereby placed in the public domain

        Cheers,
        --MarkM