What is DevOps? - Mike Loukides, 2012-06-07

Download this free report(PDF, Mobi, EPUB)

Adrian Cockcroft's article aboutNoOps at Netflix ignited a controversy that has been smouldering for some months.John Allspaw'sdetailed response to Adrian's article makes a keypoint: What Adrian described as "NoOps" isn't really. Operationsdoesn't go away. Responsibilities can, anddo, shift over time, and as they shift, so do job descriptions. But nomatter how you slice it, the same jobs need to be done, and one ofthose jobs is operations. What Adrian is calling NoOps at Netflixisn't all that different from Operations at Etsy. But thatjust begs the question: What do we mean by "operations" in the 21stcentury? IfNoOps is a movementfor replacing operations with something that looks suspiciously likeoperations, there's clearly confusion. Now that some of the passion has dieddown, it's time to get to a betterunderstanding of what we mean by operations and how it's changed overthe years.

At a recent lunch, John noted thatback in the dawn of the computer age, there was no distinction betweendev and ops. If you developed, you operated. You mounted the tapes,you flipped the switches on the front panel, you rebooted when thingscrashed, and possibly even replaced the burned out vacuumtubes. And you got to wear a geeky white lab coat. Dev and ops startedto separate in the '60s, whenprogrammer/analysts dumped boxes of punch cards into readers, and "computeroperators" behind a glass wall scurried around mounting tapes inresponse to IBM JCL. The operators also pulled printouts from line printersand shoved them in labeled cubbyholes, where you got your output filedunder your last name.

The arrival of minicomputers in the 1970s and PCs in the '80s broke down the wall betweenmainframe operators and users, leading to the system and networkadministrators of the 1980s and'90s. That was the birth of modern "IT operations"culture. Minicomputer users tended to be computing professionals withjust enough knowledge to be dangerous. (I remember when a new directorwas given the root password and told to "create an account foryourself" ... and promptly crashed the VAX, which was shared by about30 users). PC users required networks; they required support; theyrequired shared resources, such as file servers and mail servers. And yes, BOFH ("BastardOperator from Hell")serves as a reminder of those days. I remember being told that "noone" else is having the problem you're having — and not getting beyondit until at a company meeting we found that everyone was having theexact same problem, in slightly different ways. No wonder we want opsto disappear. No wonder we wanted a wall between the developersand the sysadmins, particularly since, in theory, the advent of thepersonal computer and desktop workstation meant that we could all be responsible for our own machines.

But somebody has to keep the infrastructure running, including theincreasingly important websites. As companies and computing facilities grew larger, the fire-fighting mentality ofmany system administrators didn't scale. When the whole company runs onone 386 box (like O'Reilly in 1990), mumbling obscure command-line incantationsis an appropriate way to fix problems. But thatdoesn't work when you're talking hundreds or thousands of nodes atRackspace or Amazon. From an operations standpoint, the big story ofthe web isn't the evolution toward full-fledged applications that runin the browser; it's the growth from single servers to tens ofservers to hundreds, to thousands, to (in the case of Google orFacebook) millions. When you're running at that scale, fixingproblems on the command line just isn't an option. You can't afford letting machines get out of syncthrough ad-hoc fixes and patches. Being told "We need 125 serversonline ASAP, and there's no time to automate it" (as Sascha Bates encountered) is a recipe for disaster.

The response of the operations community to the problem of scale isn'tsurprising. One of the themes of O'Reilly'sVelocity Conferenceis "Infrastructure asCode." If you're going to do operations reliably, you need to make itreproducible and programmatic. Hence virtual machines to shieldsoftware from configuration issues. HencePuppet andChef to automateconfiguration, so you know every machine has an identical softwareconfiguration and is running the right services. HenceVagrantto ensure that all your virtualmachines are constructed identically from the start.Hence automated monitoring tools toensure that your clusters are running properly. It doesn't matterwhether the nodes are in your own data center, in a hosting facility,or in a public cloud. If you're not writing software to managethem, you're not surviving.

Furthermore, as we move further and further away from traditionalhardware servers and networks, and into a world that's virtualized onevery level, old-style system administration ceases to work. Physicalmachines in a physical machine room won't disappear, but they're nolonger the only thing a system administrator has to worry about.Where's the root disk drive on a virtual instance running at some colocationfacility? Where's a network port on a virtual switch? Sure,system administrators of the '90s managed these resources withsoftware; no sysadmin worth his salt came without a portfolio of Perl scripts. The difference is that now the resources themselves may bephysical, or they may just be software; a network port, a disk drive, or a CPU has nothing to dowith a physical entity you can point at or unplug. The only effectiveway to manage this layered reality is through software.

So infrastructure had to become code. All those Perl scripts show thatit was already becoming code as early as the late '80s; indeed, Perlwas designed as a programming language for automating systemadministration. It didn't takelong for leading-edge sysadmins to realize that handcraftedconfigurations and non-reproducible incantations were a bad way to runtheir shops. It's possible that this trend means the endof traditional system administrators,whose jobs are reduced to racking up systemsfor Amazon or Rackspace. But that's only likely to be the fate ofthose sysadmins who refuse to grow and adapt as the computingindustry evolves. (And I suspect that sysadmins who refuse toadapt swell the ranks of the BOFH fraternity, and most of us would behappy to see them leave.) Good sysadmins havealways realized that automation was a significant component of theirjob and will adapt as automation becomes even more important. Thenew sysadmin won't power down a machine, replace a failing disk drive,reboot, and restore from backup; he'll write software to detecta misbehaving EC2 instance automatically, destroy the badinstance, spin up a new one, and configure it, allwithout interrupting service. With automation at this level, the new "ops guy" won't care if he's responsible for a dozensystems or 10,000. And the modern BOFH is, more often than not, anold-school sysadmin who has chosen not to adapt.

James Urquhart nails it when he describes how modern applications, running in the cloud, still need to be resilient and fault tolerant,still need monitoring, still need to adapt to huge swings in load,etc. But he notes that those features, formerly provided by theIT/operations infrastructures, now need to be part of theapplication, particularly in "platform as a service" environments.Operations doesn't go away, it becomes part of thedevelopment. And rather than envision some sort of uber developer,who understands big data, web performance optimization, applicationmiddleware, and fault tolerance in a massively distributedenvironment, we need operations specialists on the development teams.The infrastructure doesn't go away — it moves into the code; and thepeople responsible for the infrastructure, the system administratorsand corporate IT groups, evolve so that they can write the code thatmaintains the infrastructure. Rather than being isolated, they needto cooperate and collaborate with the developers who create theapplications. This is the movement informally known as "DevOps."

Amazon's EBS outage last yeardemonstrates how the nature of"operations" has changed. There was a marked distinctionbetween companies that suffered and lost money, and companies that rodethrough the outage just fine. What was the difference? The companiesthat didn't suffer, including Netflix, knew how to design forreliability; they understood resilience, spreading data across zones, and awhole lot of reliability engineering. Furthermore, they understoodthat resilience was a property of the application, and they workedwith the development teams to ensure that the applications couldsurvive when parts of the network went down. More important than theflames about Amazon's services are the testimonials of howintelligent and careful design kept applications running while EBS wasdown.Netflix's ChaosMonkey is an excellent, if extreme, example of atool to ensure that a complex distributed application can surviveoutages; ChaosMonkey randomly kills instances and services within theapplication. The development and operations teamscollaborate to ensure that the application is sufficiently robust towithstand constant random (and self-inflicted!) outages withoutdegrading.

On the other hand, during the EBS outage, nobody who wasn't an Amazonemployee touched a single piece of hardware. At the time,JD Long tweeted that the best thing about the EBS outage was that his guysweren't running around like crazy trying to fix things. That's how itshould be. It's important, though, to notice how this differs fromoperations practices 20, even 10 years ago. It was all over beforethe outage even occurred: The sites that dealt with it successfullyhad written software that was robust, and carefully managed their dataso that it wasn't reliant on a single zone.And similarly, the sites that scrambled torecover from the outage were those that hadn't built resilience intotheir applications and hadn't replicated their data across different zones.

In addition to this redistribution ofresponsibility, from the lower layers of the stack to the applicationitself, we're also seeing a redistribution of costs. It'sa mistake to think that the cost of operations goes away. Capitalexpense for new servers may be replaced by monthly bills from Amazon,but it's still cost. There may be fewer traditional IT staff, and therewill certainly be a higher ratio of servers to staff, butthat's because some IT functions have disappeared into the developmentgroups. The bonding is fluid, but that's precisely the point. Thetask — providing a solid, stable application for customers — is the same. The locations of the servers on which that applicationruns, and how they're managed, are all that changes.

One important task of operations is understanding the cost trade-offsbetween public clouds like Amazon's, private clouds, traditionalcolocation, and building their own infrastructure. It's hard to beatAmazon if you're a startup trying to conserve cash and need to allocate or deallocate hardware to respond tofluctuations in load. You don't want to own a huge cluster to handle your peak capacity but leave it idle most of the time. But Amazonisn't inexpensive, and a larger company can probably get abetter deal taking its infrastructure to a colocation facility.A few of the largest companies will build their own datacenters. Cost versus flexibility is an importanttrade-off; scaling is inherently slow when youown physical hardware, and when you build your data centers tohandle peak loads, your facility is underutilized most of the time.Smaller companies will develop hybrid strategies, withparts of the infrastructure hosted on public clouds like AWS orRackspace, part running on private hosting services,and part runningin-house. Optimizing how tasks are distributed between these facilitiesisn't simple; thatis the province of operations groups. Developing applicationsthat can run effectively in a hybrid environment: that's theresponsibility of developers, with healthy cooperation with anoperations team.

The use of metrics to monitor system performance is another respect inwhich system administration has evolved. In the early '80s or early'90s, you knew when a machine crashed because you started getting phonecalls. Early system monitoring tools like HP's OpenView providedlimited visibility into system and network behavior but didn't givemuch more information than simple heartbeats or reachability tests.Modern tools like DTrace provide insight into almost every aspect ofsystem behavior; one of the biggest challenges facing modernoperations groups is developing analytic tools and metrics that cantake advantage of the data that's available topredict problems before they become outages. We now have access tothe data we need, we just don't know how to use it. And the more werely on distributed systems, the more important monitoring becomes.As with so much else, monitoring needs to become part of theapplication itself. Operations is crucialto success, but operations can only succeed to the extentthat it collaborates with developers and participates in thedevelopment of applications that can monitor and heal themselves.

Success isn't based entirely on integrating operations intodevelopment. It's naive to think that even the best developmentgroups, aware of the challenges of high-performance, distributedapplications, can write software that won't fail. On this two-waystreet, do developers wear the beepers, or IT staff? As Allspawpoints out, it's important not to divorce developers from theconsequences of their work since the fires are frequently set bytheir code. So, both developers andoperations carry the beepers. Sharing responsibilitieshas another benefit. Rather thanfinger-pointing post-mortems that try to figure out whether an outagewas caused by bad code or operational errors, when operations anddevelopment teams work together to solve outages,a post-mortem canfocus less on assigning blame than on making systems more resilient inthe future. Although we used to practice "root cause analysis" afterfailures, we're recognizing that finding out the single cause is unhelpful. Almost every outage is the result of a "perfect storm" ofnormal, everyday mishaps. Instead of figuring out what went wrongand building procedures to ensure that something bad can never happenagain (a process that almost always introduces inefficiencies andunanticipated vulnerabilities), modern operations designs systems thatare resilient in the face of everyday errors, even when they occur inunpredictable combinations.

In the past decade, we've seen major changes in software developmentpractice. We've moved from various versions of the "waterfall"method, with interminable up-front planning, to "minimum viableproduct," continuous integration, and continuous deployment. It'simportant to understand that the waterfall and methodology of the '80saren't "bad ideas" or mistakes. They were perfectly adapted to an ageof shrink-wrapped software. When you produce a "gold disk" andmanufacture thousands (or millions) of copies, the penalties forgetting something wrong are huge. If there's a bug, you can't fix ituntil the next release. In this environment, a software release is ahuge event. But in this age of web and mobile applications, deployment isn't sucha big thing. We can release early, and release often; we've movedfrom continuous integration to continuous deployment. We've developedtechniques for quick resolution in case a new release has seriousproblems; we've mastered A/B testing to test releases on a smallsubset of the user base.

All of these changes require cooperation and collaboration betweendevelopers and operations staff. Operations groups are adopting, and in many cases, leading in theeffort to implement these changes. They're the specialists inresilience, in monitoring, in deploying changes and rolling themback. And the manyattendees, hallway discussions, talks, and keynotes at O'Reilly'sVelocity conference show us that they are adapting. They're learningabout adopting approaches to resilience that are completely new tosoftware engineering; they're learning about monitoring and diagnosingdistributed systems, doing large-scale automation, and debugging underpressure. At a recent meeting, Jesse Robbins described scheduling EMTtraining sessions for operations staff so that they understood how tohandle themselves and communicate with each other in an emergency.It's an interesting and provocative idea, and one of many things thatmodern operations staff bring to the mix when they work withdevelopers.

What does the future hold for operations? System and network monitoringused to beexotic and bleeding-edge; now, it's expected. But we haven't taken it farenough. We're still learning how tomonitor systems, how to analyze the data generated by modernmonitoring tools, and how to build dashboards that let us see and usethe results effectively. I've joked about "using a Hadoop cluster tomonitor the Hadoop cluster," but that may not be far from reality.The amount of information we can capture is tremendous, and far beyondwhat humans can analyze without techniques like machine learning.

Likewise, operations groups are playing a huge role in the deployment ofnew, more efficient protocols for the web, likeSPDY. Operations isinvolved, more than ever, in tuning the performance of operatingsystems and servers (even ones that aren't under our physicalcontrol); a lot of our "best practices" for TCP tuning were developedin the days of ISDN and 56 Kbps analog modems, and haven't beenadapted to the reality of Gigabit Ethernet, OC48* fiber, and theirdescendants. Operations groups are responsible for figuring out howto use these technologies (and their successors) effectively. We're only beginning to digest IPv6 and the changes itimplies for network infrastructure. And,while I've written a lot aboutbuilding resilience into applications,so far we've only taken baby steps. There's a lot there that we still don't know. Operations groups havebeen leaders in taking best practices from older disciplines(control systems theory, manufacturing, medicine) and integrating theminto software development.

And what about NoOps? Ultimately, it's a bad name, but the namedoesn't really matter. A group practicing "NoOps" successfully hasn'tbanished operations. It's just moved operations elsewhere and calledit something else. Whether a poorly chosen name helps or hindersprogress remains to be seen, but operations won't go away; it willevolve to meet the challenges of delivering effective, reliablesoftware to customers. Old-stylesystem administrators may indeed be disappearing. But if so, they are beingreplaced by more sophisticated operations experts who work closely withdevelopment teams to get continuous deployment right; to build highlydistributed systems that are resilient; and yes, to answer the pagersin the middle of the night when EBS goes down. DevOps.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。