A fault-tolerance protocol for parallel applications with communication imbalance
2018-04-06T14:33:39Z
Proceedings - Symposium on Computer Architecture and High Performance Computing. Vol 2016 - January p. 162-169
Articulo
The predicted failure rates of future supercomputers
loom the groundbreaking research large machines are
expected to foster. Therefore, resilient extreme-scale applications
are an absolute necessity to effectively use the new generation
of supercomputers. Rollback-recovery techniques have been
traditionally used in HPC to provide resilience. Among those
techniques, message logging provides the appealing features of
saving energy, accelerating recovery, and having low performance
penalty. Its increased memory consumption is, however, an
important downside. This paper introduces memory-constrained
message logging (MCML), a general framework for decreasing the
memory footprint of message-logging protocols. In particular, we
demonstrate the effectiveness of MCML in maintaining message
logging feasible for applications with substantial communication
imbalance. This type of applications appear in many scientific
fields. We present experimental results with several parallel codes
running on up to 4,096 cores. Using those results and an analytical
model, we predict MCML can reduce execution time up to 25%
and energy consumption up to 15%, at extreme scale.
Instituto Tecnológico de Costa Rica
Lidia Gómez
Cartago - 300m Este del Estadio Fello Meza. Apartado 159-7050.
2550-2263, 2550-2365