DSIM: A Distributed Optimistic Parallel Discrete Event Simulator

Gilbert Chen and Boleslaw Szymanski, Center for Pervasive Computing and Networking, Rensselaer Polytechnic Institute

Introduction

Clusters have steadily and gradually become the main stream of high performance computing platforms -- the number of supercomputers among the Top500 list that are cluster-based jumped from 208 in November 2003 to 291 in June 2004, an increase of 40% in merely 7 months.  Without doubt, this trend will continue in the near feature, as clusters can be readily built from off-the-shelf processors and fast networks.

The PDES community is trying hard to keep up with this trend.  Conservative simulations on more than a cluster with more than 1000 processors have been reported.  However, we hardly see any large-scale distributed simulation based on optimistic protocols (the largest one was on 64 processors), which offer a greater prospect than the conservative protocols do, because they are less dependent on lookahead.

DSIM is for clusters. At its heart is several novel features which enable the simulator to run at an aggregate speed of  218 million events per second on 1033 processors (the lemieux cluster at the Pittsburgh Supercomputing Center, ranked 34th as of November 2004).  The speedup is 296, compared to the sequential execution of the same model but of a much smaller size (1/1024). The key features of DSIM include:

Time Quantum GVT Algorithm This new GVT algorithm needs no message acknowledgment, relies on short messages whose length is constant regardless how many processors are being used, and can deliver accurate GVT values continuously with minimum latency and overhead.
Local Fossil Collection In DSIM, each LP maintains its own processed event list which can be easily kept sorted.  An LP attempts to collect fossil events only before it is about to process a new event. It is very likely that the memory released by the reclaimed events will be immediately reused during the processing of the new event, resulting in better locality.
Shared Memory / Distributed A DSIM simulation can run on both shared-memory and distributed-memory machines, without changing the LP models.  The only change that needs to be made to support both type of machines is to define an LP instantiation function differently. 
Irrevocable Actions Unlike GTW, which provides special I/O events for actions that cannot be undone, DSIM takes a more general approach.  Besides the Process() and Undo() methods, which are called to process and undo an event respectively, each LP can define a Commit() method which is to be called when the event is being fossil-collected.  Any irrevocable actions must be performed inside this Commit() function.
Event Memory Management DSIM allows events of different sizes to be defined within one simulation. Events are allocated and freed by a specially designed memory allocator class called CorsaAllocator.  Moreover, events of the same size will be managed by the same CorsaAllocator instance, making event memory manipulation extremely efficient.  All this is hidden from the users, as event types do not need to be explicit declared.  Any data type can be directly used as an event type.
State Saving / Reverse Computation DSIM users can freely choose either state saving or reverse computation, or a mix of both, to make the events undoable.
Artificial Rollbacks The main difficulty with debugging PDES programs is that errors are hard to reproduced.  Even adding printf() functions or merely compiling the program with different options may dramatically change the behaviors of these programs.  DSIM is equipped with an unreliable event queue, which, with a specified probability, can retrieve a random event that is not necessarily the earliest.  Thus, rollbacks are now artificially introduced into the sequential execution, and the sequence of rollbacks will always remain the same every time you run the program.

Getting, Installing, and Testing DSIM

Examples: