e-lang: Distributed Space-Time Debugging (was: MAC calculation)

From: "Rob Withers" <rwithers12@attbi.com>
Date: Wed, 27 Nov 2002 08:53:27 -0500
Subject: Re: [e-lang] Distributed Space-Time Debugging (was: MAC calculation)
Mark, 

this is very interesting.  I need to run to work, but I wanted to briefly mention that running in 'simulated' mode would imply this instrumentation and could be defined as a distributed simulation.  If you encounter a security boundary, where what happens inside of a capability is opaque to your observation, you could at least capture what happens across that boundary. 

In Smalltalk, an object referenced from an activation frame may have indeed have had its state changed by operations later in the same stack due to side effects.  Example:  aCollection removeAll: anotherColl.  Since this can occur within one thread of immediate call/return, it should be no great surprise that this happens with multiple 'threads'.  An interesting idea comes to my mind, since we can gain access to the resolvers of promises.  We could freeze them from resolving and step them.   

Though it is sequential, the opentalk framework of VW does a very nice job of constructing a distributed stack, where each frame is marked by the host and thread that owns that particular activation record.   If I set a remote breakpoint, then that remote stack pops up locally, including any remote calls made (possibly locally :)

cheers,
robert 
  ----- Original Message ----- 
  From: Mark Miller 
  To: e-lang@mail.eros-os.org 
  Sent: Tuesday, November 26, 2002 3:22 AM
  Subject: [e-lang] Distributed Space-Time Debugging (was: MAC calculation)


  At 08:42 PM 11/25/2002 Monday, Mark Miller wrote: 

    At 06:21 PM 11/26/2002 Tuesday, Rob Withers wrote: 
    >It will be interesting to analyze, and a remote debugging capablity will be
    >critical.[...]

    Funny you should mention that.  We just had some interesting victories for 
    distributed debugging of E code.  I hope to explain more about this soon... 

  But first some history. 

  <history> 

  As Chris Hibbert once pointed out to me, the stack-list-view is central to 
  those traditional debuggers we like,  with one list entry per activation 
  frame. An example below is a traditional (Smalltalk-80-like) debugger 
  organized in three major panes, where the "^" represents which item in each 
  pane is selected (view the following in a fixed width font):

  +-------------------------------+-+
  |   factorial(1)                +-+ # Stack pane 
  | ^^factorial(2)^^              +-+ # (with scroll bar)
  |   factorial(3)                | |
  +-------------------------------+-+
  | def factorial(n :int) :int {    | # Source pane
  |     if (n <= 0) {               |
  |         1                       |
  |     } else {                    |
  |         n * factorial(n - 1)    |
  |     }       ^^^^^^^^^^^^^^^^    |
  | }                               |
  +-------------+-------------------+
  |   factorial |   3               | # Variable pane
  | ^^n^^       |                   | # (value of selected variable
  +-------------+-------------------+ #  on right)

  Back when Dean & I were working together on Joule at Agorics, I became 
  worried about how one would debug Joule computation,  since Joule is a purely 
  asynch-message-oriented model of computation with no call-return stack.  In 
  answer, Dean had yet another brilliant idea (when does the guy sleep), which 
  he calls "space-time debugging":

  In a space-time debugger, you start by replacing the stack pane 
  by a causality pane.  Rather than a list view, the causality pane uses an 
  outline-editor-like tree view.  Each entry still represents an activation 
  (what happens when an object receives a message). But in the Joule 
  computational model (as with its ancestors Actors and Concurrent Prolog) 
  these activation frames are not suspended waiting on each other. Rather, 
  these are simply the atomic steps by which computation proceeds. Each 
  activation causes other activations by sending messages. The causality graph 
  considered purely in these terms therefore forms the tree visualized by the 
  tree-view.

  In Dean's conception, computation is advanced by a sort of visual pun.  As 
  the outline is expanded on a particular branch for the first time, the 
  debugger advances computation along that path. When you were at the frontier 
  of the outline, you were looking at live-but-suspended computation, and so 
  could use actual variable values for the variable display. When you selected 
  any prior event on the outline, some magic would have to occur to calculate 
  what the values of those variables were at that previous time. There were a 
  variety of paper designs for how to do this, but they were all hard. (How 
  many ways can you spell "determinism"?)

  Doug Barnes and the Electric Communities gang brought immediate call-return 
  locally-sequential programming together with Joule-like event-send  
  programming for the first time in creating Original-E.  Monica Anderson of 
  EC did for Original-E the first implementation of a space-time debugger.  
  This happened before Original-E specified partially-ordered message delivery 
  http://www.erights.org/elib/concurrency/partial-order.html . In Monica's 
  debugger, the user could advance computation along any frontier of the 
  outline, and could thereby try to make those non-deterministic choices most 
  likely to cause a bug to manifest.  Data on the frontier was inspectable as 
  in Dean's design, but past activations could not be selected, and so no 
  restoration of prior data states needed to happen.

  </history> 

  Even though Original-E combines the two computational models, Monica's 
  debugger didn't have a stack view,  as stack-oriented debuggers for Java were 
  hard, and the existing ones were not pluggable components. Nevertheless, 
  what I've dreamt of for E was something like:

  +-------------------------------+-+
  | - foo <- bar(3)               +-+ # Causality pane 
  | ^^+ zip <- zap(7)^^           +-+ # (with scroll bar
  |   - fnord <- glorp()          | | #  and +/- for outline)
  +-------------------------------+-+
  | ^^zip <- zap(7)^^             +-+ # Stack pane
  |   zippity.dodah()             +-+ # (with scroll bar)
  |   foo.bar(3)                  | |
  +-------------------------------+-+
  | def zippity {                   |
  |     to dodah() {                |
  |         zip <- zap(x)           |
  |         ^^^^^^^^^^^^^           |
  |         blah.blah()             |
  |         fnord <- glorp()        |
  |     }                           |
  | }                               |
  +-------------+-------------------+
  |   fnord     |   7               | # Variable pane
  | ^^x^^       |                   | # (value of selected variable
  +-------------+-------------------+ #  on right)

  The top pane shows that the event started by "foo" receiving the message 
  "bar(3)" in turn sent the message "zap(7)" to "zip" and sent "glorp()" to  
  "fnord".  The events in which these messages are received are children on 
  the "foo <- bar(3)" event. The "+" before "zip <- zap(7)" indicates that 
  this part of the outline may be expanded, and that therefore this event 
  caused yet other events.

  On selecting the "zip <- zap(7)", the Stack pane shows the call stack as it 
  was when this event was spawned, ie,  when the "zap(7)" message was 
  eventually-sent to "zip". "zip <- zap(7)" is therefore at the top of this 
  stack, and its causality-parent, "foo <- bar(3)", appears as an immediate 
  delivery at the bottom of this stack.

  Unfortunately, restoring past data states remains as hard as it ever was,  
  and support for deterministic replay seems almost as far off as when I first 
  started fantasizing about it. Fortunately, we could get much of the bang for 
  very little of the buck.

  In rebuilding the event-loop mechanisms for the next release, I paid a lot 
  more attention to regaining another piece of technology we had at EC  -- 
  causality tracing.  Much as Java's Throwables includes a stack trace for 
  helping to figure out why something bad happened, EC's causality tracing 
  allows one to additionally trace backward through the chain of prior events. 
  Instrumentation that allows such looking backwards also allows one to look 
  forward, and log the causal chain of events.

  For the next release I also rebuilt the interp-loop mechanism for the new 
  release, replacing old crufty Java-written thread code  (Interp.java) with 
  new somewhat-less-crufty E-written event-loop code.  (If you don't what the 
  interp-loop mechanism is, don't worry about it.) Unfortunately, the new code 
  had a subtle but fatal bug that I couldn't find using IDEA's excellent 
  traditional thread-and-stack-oriented debugger, nor could Terry and I find 
  the bug in a code review.

  Instead, we realized that the causality tracing information could be logged 
  and used to produce a post-mortem version of the space-time debugger,  but 
  without any data state.  I modified the causality tracing mechanism to 
  capture both Java and E stack traces (the latter thanks to a suggestion by 
  Dean), and to log the E stacks as of the time each message is sent.  Here's 
  a sample entry:

  === 2002-11-24T05:05:12.729Z (SendingContext.trace:SendingContext.java:87) DBG
  causality:  :  
  <Eventual ref> <- run()
  --- event: <Vat parsing -.e in <runs in parsing -.e>>:3
  --- sent by: <Vat start in <runs in start>>:2
  ---  @ run/0: <file:/C:/e/src/esrc/org/erights/e/elang/cmd/makeRepl.emaker#:span::126:29::126:30>
  ---  @ org.erights.e.elang.cmd.makeRepl$makeRepl__C$repl__C$loop_3__C#run/0
  ---  @ run/0: <file:/C:/e/src/esrc/org/erights/e/elang/cmd/makeRepl.emaker#:span::142:16::142:16>
  ---  @ org.erights.e.elang.cmd.makeRepl$makeRepl__C$repl__C#run/0
  ---  @ run/0: <file:/C:/e/src/esrc/org/erights/e/elang/launcher/eLauncherAuthor.emaker#:span::167:20::167:20>
  ---  @ org.erights.e.elang.launcher.eLauncherAuthor$eLauncherAuthor__C$eLauncher__C$2#run/0

  The stack trace was captured at the moment that event #2 in Vat "start" sent 
  the inter-vat message that, on delivery, will be event #3 in Vat "parsing".  
  The log captures both E source position spans and fully qualified names.

  Terry & I then sat down with the logs and a big sheet of paper and hand 
  simulated what our fantasy post-mortem debugger would show for this log.  The 
  computation we were examining involved 3 vats and one non-vat thread, all in 
  one JVM. We used the same visualization as above, but used colors to 
  distinguish these four contexts. In the process of hand simulating the 
  debugger ui, we came to understand the code better, and we found the bug.

  Even though all four contexts were in one JVM in this case, I still consider 
  this a triumph for distributed debugging.  It will be some work to instrument 
  CapTP to enable this kind of post-mortem for inter-vat computation, and it 
  will take some care to ensure we don't blow security while we do so. But 
  given that, this ui design and everything Terry and I experienced while 
  simulating it would apply equally well if the vats had been distributed 
  among different machines. I am unaware of anything else in the universe that 
  gives this level of practical support for building distributed systems.

  Besides helping debug, Terry points out that this space-time ui should also 
  make it much easier for people to learn E's computational model. 


  ----------------------------------------
  Text by me above is hereby placed in the public domain 

          Cheers, 
          --MarkM