Skip to main content
  • Conferences
  • Students
Sign in
Gold Sponsor
Gold Sponsor
Gold Sponsor
Gold Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Bronze Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Industry Partner

SREcon16 button

Registration Fee: $400
Register Now
Thanks to generous sponsorship, early bird pricing is now permanent for SREcon15!

Venue:
Hyatt Regency Santa Clara
5101 Great America Pkwy
Santa Clara, CA 95054

Questions?
About SREcon?
About the Call for Participation?
About the Hotel/Registration?
About Sponsorship?

Tweets by @SREcon
  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy
Tweet

connect with us


  •  Twitter
  •  Facebook
  •  LinkedIn
  •  Google+
  •  YouTube

Monday, March 16, 2015 - 1:30pm-2:30pm

 Coburn Watson, Netflix, Inc.

Abstract: 

The Netflix architecture is based on hundreds of microservices running in the cloud at massive scale across numerous AWS regions. Achieving excellent availability of such a complex system requires a capable operations methodology. At Netflix we have a shared services team which seeks to lower operational barriers for individual service teams in order to improve both aggregate and microservice-level reliability. The challenge lies in finding the right balance of responsibility between a shared service support team and the devops engineers on the microservice team itself. We have taken an approach in which tooling and associated methodologies developed by our Operations Engineering organization tackle the following subset of operational activities at a platform-level:

  • Continuous integration and deployment
    • automated staggered deployment of microservice code across cloud regions
    • automated analysis of canary versus baseline code
  • Tuning of curcuits in the system which respond to localized failures
  • Improved observability for both macro and micro performance dimensions
  • Identification and termination of server instances which are outliers

Through elimination of such undifferentiated heavy lifting, the teams can shift their focus onto product development versus being mired in operational complexity. The key benefit is the improvement of engineering velocity alongside reliability. As an organization. a direction needs to be taken on where to draw the line for operational responsibilities. This is no different in the Netflix "Freedom and Responsibility" culture.

This presentation will cover the operational complexities we have abstracted away from our microservice engineering teams, the associated decision factors, and future direction of the program.

Coburn leads the Cloud Performance and Reliability Engineering team at Netflix. His team works to optimize the use of massive cloud resources with a keen focus on system performance and reliability. Prior to Netflix. he was at Rearden Commerce, HP, and numerous other companies. working to improve the performance of large scale distributed systems.

Coburn Watson, Netflix, Inc.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {208894,
author = {Coburn Watson},
title = {Netflix {RaaS}: Reliability as a Service},
year = {2015},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = mar
}
Download
View the slides

Presentation Video 

Presentation Audio

MP3 Download

Download Audio

  • Log in or register to post comments
  • Privacy Policy
  • Contact Us

© USENIX
EIN 13-3055038

LISA is a registered trademark of the USENIX Association.