Pooria Namyar and Arvin Ghavidel, University of Southern California; Daniel Crankshaw, Daniel S. Berger, Kevin Hsieh, and Srikanth Kandula, Microsoft; Ramesh Govindan, University of Southern California; Behnaz Arzani, Microsoft
Cloud providers install mitigations to reduce the impact of network failures within their datacenters. Existing network mitigation systems rely on simple local criteria or global proxy metrics to determine the best action. In this paper, we show that we can support a broader range of actions and select more effective mitigations by directly optimizing end-to-end flow-level metrics and analyzing actions holistically. To achieve this, we develop novel techniques to quickly estimate the impact of different mitigations and rank them with high fidelity. Our results on incidents from a large cloud provider show orders of magnitude improvements in flow completion time and throughput. We also show our approach scales to large datacenters.
NSDI '25 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:
author = {Pooria Namyar and Arvin Ghavidel and Daniel Crankshaw and Daniel S. Berger and Kevin Hsieh and Srikanth Kandula and Ramesh Govindan and Behnaz Arzani},
title = {Enhancing Network Failure Mitigation with {Performance-Aware} Ranking},
booktitle = {22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25)},
year = {2025},
isbn = {978-1-939133-46-5},
address = {Philadelphia, PA},
pages = {335--357},
url = {https://www.usenix.org/conference/nsdi25/presentation/namyar},
publisher = {USENIX Association},
month = apr
}
