next up previous
Next: How to support the Up: Design and Implementation of Previous: What are the implications

How does event scheduling and rollback work under PRC?

Under PRC, as shown in Figure 2, event scheduling and rollback are essentially duals of one another. In the forward execution case, event scheduling occurs by having each processor remove the smallest unprocessed event from its local priority queue and process it. Rollback is then realized by normal event scheduling, except that timestamps now have negative values using the above scheme. Here, the timestamp of last processed event of the LP being undone is negated and re-processed along the reverse code path. Internal to each LP are the random number generator seed values contained within the event, such that the send time of the event can be re-generated by a reverse call to the random number generator as described by Property 3 in Section 3. The reverse event handler will re-exchange the seed and message data items from the LP state, which undoes any state changes made by processing the event. Moreover, the reverse event handler recreates and schedules others events into the past of this LP as well as other LPs. If the destination LPs are moving forward, they too will be switched to rollback or reverse execution on receipt of a reverse event. This scheme may sound quite straight forward on the surface, but the implications of this approach to rollback raise a number of issues.

First, we observe that because PRC by definition utilizes only the current state of an LP, combined with the last event or last set of events it processed and most recently sent forward events to undo its forward computation, events can be locally committed without having to know the global virtual time (GVT) of the system. Consequently, there is no need for GVT calculations and fossil collection to be done in optimistic PRC parallel simulation systems. Based on our previous experience with distributed Time Warp systems, we believe the removal of GVT and fossil collection computations will allow parallel PRC models to scale when run on large, distributed processor configurations [17,18,22].

While the removal of GVT and fossil collection overheads may be the silver lining for optimistic parallel PRC simulations, rollback processing does have a potential dark side. Because rollback is realized as the reverse of forward processing, this implies that rollback itself is an optimistic computation and subject to being rolled back as well. Unlike TiPRC, rolling back an LP, results in the scheduling of events in the reverse to potentially other LPs. Thus, to complete a rollback an LP may have to wait on a reverse event from another LP. Because such a causality chain is not known a priori, the LP has the option to continue processing its local set of reverse events. If a reverse event arrives in the LP's rollback past, then the LP must rollback the rollback, which is equivalent to executing events in the forward direction (i.e., increasing timestamps). Consequently, a oscillation between forward and reverse execution may ensue to complete a single rollback. If this oscillation spreads to all processors in the system, will the system live lock?

To avoid this problem, we do observe that it might be possible to exploit the discovered lookahead of the forward execution path to perform the rollback in a conservative, non-risky way. Meaning, PRC would optimistically process events in the forward case yet rollback using a conservative synchronization approach so as to avoid rollback thrashing. Because the lookahead of the system is determined ``on the fly'' as the forward path is explored, it need not be known at runtime.

Because support for a bi-directional processing capability is required, the event scheduler must have two priority queues. The first contains events that are scheduled into the logical future of the simulator, either forwards or backwards depending on the start and end time values specified at runtime. The second queue is used to contain all events scheduled in the logical past (i.e., event to ``rollback'' an LP). The scheduling discipline will process all events from the logical past queue prior to processing any events from the logical future queue. We believe this approach will ensure that all LPs rolling back will get first priority, thus stamping out the spread of incorrect computation faster and ideally making the system more stable.


next up previous
Next: How to support the Up: Design and Implementation of Previous: What are the implications
Christopher D. Carothers 2002-03-07