Skip to main content
  • Conferences
  • Students
Sign in
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor

USENIX ATC '15 button

Get more
Help Promote graphics!


  •  Twitter
  •  Facebook
  •  LinkedIn
  •  Google+
  •  YouTube
Tweets by @usenix
  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy
Tweet

connect with us

Authors: 

Zhenyu Guo, Sean McDirmid, Mao Yang, and Li Zhuang, Microsoft Research Asia; Pu Zhang, Microsoft Research Asia and Peking University; Yingwei Luo, Peking University; Tom Bergan, Microsoft Research and University of Washington; Madan Musuvathi, Zheng Zhang, and Lidong Zhou, Microsoft Research Asia

Abstract: 

Cloud services inevitably fail: machines lose power, networks become disconnected, pesky software bugs cause sporadic crashes, and so on. Unfortunately, failure recovery itself is often faulty; e.g. recovery can accidentally recursively replicate small failures to other machines until the entire cloud service fails in a catastrophic outage, amplifying a small cold into a contagious deadly plague! We propose that failure recovery should be engineered foremost according to the maxim of primum non nocere, that it “does no harm.” Accordingly, we must consider the system holistically when failure occurs and recover only when observed activity safely allows for it.

Zhenyu Guo, Microsoft Research Asia

Sean McDirmid, Microsoft Research Asia

Mao Yang, Microsoft Research Asia

Li Zhuang, Microsoft Research Asia

Pu Zhang, Microsoft Research Asia and Peking University

Yingwei Luo, Peking University

Tom Bergan, Microsoft Research and University of Washington

Madan Musuvathi, Microsoft Research Asia

Zheng Zhang, Microsoft Research Asia

Lidong Zhou, Microsoft Research Asia

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {181943,
author = {Zhenyu Guo and Sean McDirmid and Mao Yang and Li Zhuang and Pu Zhang and Yingwei Luo and Tom Bergan and Madan Musuvathi and Zheng Zhang and Lidong Zhou},
title = {Failure Recovery: When the Cure Is Worse Than the Disease},
booktitle = {14th Workshop on Hot Topics in Operating Systems (HotOS XIV)},
year = {2013},
address = {Santa Ana Pueblo, NM},
url = {https://www.usenix.org/conference/hotos13/session/guo},
publisher = {USENIX Association},
month = may
}
Download
Guo PDF
  • Log in or register to post comments
  • Privacy Policy
  • Contact Us

© USENIX
EIN 13-3055038