* Faculty       * Staff       * Students & Alumni       * Committees       * Contact       * Institute Directory
* Undergraduate Program       * Graduate Program       * Courses       * Institute Catalog      
* Undergraduate       * Graduate       * Institute Admissions: Undergraduate | Graduate      
* Colloquia       * Seminars       * News       * Events       * Institute Events      
* Overview       * Lab Manual       * Institute Computing      
No Menu Selected

* News


Scalability and Performance of Applications in HPC

Speaker: Dr. Tanzima Islam
Lawrence Livermore National Laboratory

April 14, 2016 - 4:00 p.m. to 5:00 p.m.
Location: Center for Biotechnology and Interdisciplinary Studies, Bruggeman Room
Hosted By: Dr. Christopher Carothers (x2930)


In this talk, I present large-scale system solutions to scalable checkpointing in two different execution environments. First, I present a distributed solution, falcon, for storing checkpoint files on available shared resources in cycle sharing systems. Experiments on DiaGrid, a production grid spanning multiple institutions in the midwest with 50K processes, show that falcon improves the performance of benchmark applications between 11% and 44% compared to Condor’s checkpointing solution, depending on the size of checkpoints and the location of storage. Second, I designed and developed a large-scale system for HPC environments, mcrEngine, that leverages data-semantics to aggregate and compress checkpoint files better before storing on stable storage, such as parallel file system. In a tera-flop supercomputing cluster at LLNL, mcrEngine achieves approximately 4x compression on hard-to-compress scientific datasets and improves application performance by reducing recovery overhead by more than 62%.


Tanzima Islam is a postdoctoral research staff member in the scalability group at Lawrence Livermore National Laboratory. She earned her Ph.D. in Computer Engineering from Purdue University in 2013. Her research covers several areas of building large-scale complex systems solutions for improving the resilience of both high-throughput and high-performance computing environments. Her research interests lie in tackling challenges in the intersection of the three major thrust areas in High Performance Computing (HPC) research -- resilience, performance, and power. She is currently leading a project on developing methodologies for evaluating proxy applications that are vehicles of DOE’s exascale co-design efforts. In addition to close collaborations within LLNL, her collaborators include Atomic Weapons Establishment (AWE), University of Florida, University of Hamburg, University of Illinois at Urbana Champaign, Purdue University, and University of Arizona. Results from her research have been published in leading computer systems and HPC venues, including SC, and IPDPS. Her research papers in building reliable and scalable checkpointing solutions received Best Paper nominations in SC’09 and SC’12. She is also the recipient of the 2014 Director’s Science & Technology award at LLNL for excellence in publication. Islam earned her bachelor’s degree from Bangladesh University of Engineering and Technology in 2006.

Last updated: April 8, 2016