Testing in the Cloud

Taking Back the Cloud

When I first started to hear the very buzzworthy term "the Cloud" (meaning all disrespect), I did not see the point. You've got servers, servers talk to each other, they may or may not be on premesis, they may or may not be rented, etc. "Oh no!" they said, "The Cloud is a whole different thing!" I just scratched my head.

Ignoring all the other things that may or may not be part of "the Cloud", there is one function that is vital : provisioning! (And another function that is just as vital: deprovisioning!) Just to pick on AWS — because I'm familiar with them and because they're big enough to take it — a cloud service provider often adds other functionality, such as SQS and S3, just to name two common amongst many. In my experience, these are great services that work as advertised (even if there is some fine print in the advertisment).

But if I step back and put on my very pointy tester hat, I may become concerned at this point that now I've thrown some variety of simian wrench in my testing cogs. Assuming that we use these functions for vital business needs (and graciously assuming that they work perfectly), how do we ensure that we are using these services correctly? Is data making it through? Is the integrated service performing as desired?

There are many solutions to the problem. There is monitoring. As a last resort, there's always manual testing. There is mocking the service (e.g. for AWS, moto may be of use). One can use separate environments for testing. In a way, "the Cloud" hasn't changed testing at all: all the challenges stil remain and many of the methodologies carry over. What does matter is the attitude and approach: be careful! Don't hear "the Cloud" and return to the bad old days where testing is a second class citizen, things break (in production, of course), and then one is left with the horrible task of back-fitting testing on a system held together with spit and chicken wire where you can't even effectively do black box testing.

I've decided to take back the term the Cloud to mean "It doesn't matter where I run it." The Cloud that I care about includes the ability the ability to deploy agnostically. The Cloud might be AWS. The Cloud might be packet.net or other bare-metal provider. The Cloud might be a set of machines or virtual machines in your own data center. The Cloud might be your laptop. If you care about testing and a fast turnaround development cycle, it shouldn't matter.

Foundations of Testing

If you're going to test in the Cloud — or anywhere else — the first step is to be make sure you can test in an automated fashion. Nothing I'm going to talk about here is particularly unique: you can read a lot of good literature on testing and find the same lessons.

I'm not going to talk about manual testing: a human looking at a screen, going through workflows, checking items off a check list, and filing bugs where things go awry. There is no substitute when testing an application for human eyes on a screen with an inquiring mind behind them. But you can't code it. The approach I am concerned with is the approach you can code: automated tests, test harnesses, tools (which may be useful to manual testers if done with some consideration), continuous integration and deployment. In a healthy shop, you will have all of these (and more!).

So what is a test? People (including myself) talk a lot about unit tests but don't really write them. That's because while doing something like computing the highest prime less than N is something that one can easily write a unit test for, somthing like account activation may require many multiple units to perform. And of course, you'll need to activate an account in order to log in and get on with doing things. Etc. So while I try to use terms like "unit test" and "integration test" correctly, in practice they're a bit ill-defined. What is your unit? What is your integration? I think understanding these things are much more interesting and important than debating what is a unit test versus an integration test, etc. Like design patterns, terminology exists to be a help, not just a fun debate topic.

Tests are building blocks

The right way to compose tests is to make them building blocks. You have a test that does something small and asserts it is done correctly. If, for example, this is a login test, you may have another test that requires login as a precondition to it's state to do something that only a logged in user can do (and then assert). If you make tests boxes that may be run if their preconditions are satisfied — either through fulfillment of a dependency tree or by other means — then you have the ability to test the system as a whole. This is harder to do than it sounds, particularly if you are retrofitting a system that doesn't have good test coverage.

If it's hard to test, it's probably wrong

I'm not the first to point out that if it is hard to test something, the system under test (SUT; what a great term) might be in need of a little rework. In essence, testing is mostly substituting for what a human being is trying to do with the SUT; whether that be API calls or interacting with a web site, if testing it requires a lot of complications or mocking or testing side-effects, you are likely to run in to trouble of your own when deploying the sytem, or automating the system, or trouble-shooting the system. This thought certainly deserves an article of its own and not a hand-wavy paragraph under cloud testing, and there are several resources available that fill in philsophical gaps here.

Good Practices First!

If you're extremely lucky as an automation and testing engineer, you've gotten to the software before too many folks have made your life too difficult. I've, sadly, rarely had this luxury except for software where I am the sole author. But no matter what, if a piece of software or a whole system composed of multiple pieces of software is going to have long-term great success, you will need these pieces (and practices!).

This is all tangential, of course, to all the aspects that make a great product (ignoring human interest entirely). Clear communication, clear goals, active elimination of technical debt, good design ... those are all necessary, of course, and ultimately highly symbiotic and synergistic with the mere testing and deployment practices I'll discuss below. (Think first! will be taken as a given.) However, as an automation and testing lead, I only have a few things under my control and even they are a handful.

Automation First!

So you want to make an awesome GUI that does everything? Probably not. What you probably want is library code that is structured to abstract your understanding of the problem in a useful way and then interfaces around it. If you have a decent underlying API, it is pretty straight-forward to turn this API into (say) a command line program, a RESTful web service, and then into a user-facing fancy AJAXY web site that consumes this RESTful web service.

If your only portal to, say, delete a record is to use a GUI (HTML or otherwise) or touch the database directly, you have a couple of problems. For testing, you've pretty much named your blackbox approach: selenium (which is great software, by the way; I'm just speaking out that for backend software testing it shouldn't be you're only resort). Or you can futz with the database directly, and hope that what your testing expects from the database is on par with what the software innards expect.

But wait a minute...are you writing a database or a model? If you're just front-ending a database, might as well write your website in MySQL. Data will be closely integrated. You don't have data abstraction. I'm not saying this is a negative, I'm just saying that the database is often not the whole story of the model.

Imagine insteaad of deleteing one record you now have to delete, oh, I don't know, say 30000 of them based on a specific criteria. You have a couple of choices: give someone a hand-written piece of paper (or worse, a spreadsheet) with the criteria of how to select records desired and have him or her go through each of the 30000 screens and click DELETE (and, of course, OK). You didn't design your system to be automated, which means that it's not really testable and not really deployable in different configurations. Instead of doing their jobs, your testing engineers will first have to put on their platform hats and work on making the system testable.

Thinking about how your software (and I don't care how end-user it is) may be controlled through automation is critical to the design process. It is crtical for testing, it is critical for maintenance, but perhaps most importantly, it is critical for deciding what your software actually is. Thinking of a database example, want to migrate back-ends? You had better have an API layer that presents what your software is independent of the back end. Thinking of the Cloud, want to migrate providers? Your system (except the provisioning component, ideally) should not have to change.

Deployment First!

Are you writing software or an implementation?

As an engineer, I am mostly interested in finding general solutions to specific classes of problems. Solving a problem is great, but when you solve two mostly similar problems, are you maintaining two solution sets or a single solution with different configuration (that is, input parameters)? The thing about software is the "soft" part: programmability. It is usually no more difficult (in terms of system complexity or lines of code) to have software that is independent of the particular implementation desired.

By considering deployment as a prerequisite to the implementation of a solution, one can often separate the question of "what is software?" from "how do I solve this problem?". More topically, considering the deployment story lends towards testable software.

Continuous Integration First!

There was and is much ado about test driven development (TDD). I have been lucky enough to practice TDD, mostly in projects in my own. I'm not a purist: about the only thing I'm dogmatic about is antidogmatism. To me, TDD isn't necessarily about writing the test first (e.g. test as spec), making it fail, and then fixing it. When it works, it's great. More often, I'm trying to figure out what the spec is and decide on that for an indefinite period before refactor. So I'll write some test, write some code, and rinse + repeat until I have working tests and code (works great for small teams).

One problem space that it is often erroneously said TDD solves is actually having the programmers run the tests. Never once have I seen the oft-tauted policy of "developers must run the tests prior to check-in" ever work if there was not a continuous integration (CI) system in place. I would rather have a project that reliably on every push builds the software on a blank slate and launches an empty set of tests that I can add to than having a full sweet of tests that developers don't run prior to pushing.

Agnosticism First!

This is an opinionated one, and also a choice as to the style of project that I am interested in. Of course you can write a complex system that is built only on AWS and is so tightly integrated thereto that you'll never need to migrate.

Except that you might.

All of this points to having software being intent made manifest. There are solutions that are only enabled by performance leaps. But in general, if you want a queue, or a database, or a notification, my advice is to mold your software to a form that follows function and then enable interfaces for supporting brands of technology choices, as required for your work.

Monitoring your Cloud

The objective of monitoring is to obtain an instantaneous snapshot of system status. Information from multiple nodes and subsystems, from low level metrics like free disk to end to end black box validation. As such monitors are tests that yield results quickly. With typical Linux architecture, this means either a script with a short run time, or a daemon service that dispatches results continuously to amortize startup cost and/or to keep state.

For correctness checks, of paramount importance is to ensure that alerts are true and actionable. There's no point in having wide coverage but low fidelity. This may seem obvious, but the big difference between correctness testing and correctness monitoring is that for monitors you are trying to catch edge conditions (errors) that you don't want to happen, you want to eliminate as quickly as possible, and that may not occur in a sterile lab. How ever many 9 you're working towards in uptime, expect to work to at least one more for monitoring fidelity.

There are many process strategies to work towards eliminating false positives and making true positives actionable. Have runbooks (checklists should be a basic currency for DevOps). Turn these runbooks into scripts. When the script is mature and flexible enough to cover edge cases, hook it to an event reactor for self-healing.

Sounds simple, right? Then why do so few cloud systems have working self-healing for even simple cases? In order to get there, the monitoring checks must be useful to developers to gain system insight. At the basic correctness check level, use the same best practices you would for any non-trivial test harness. A monitor should be usable in isolation: the insight gleaned from monitoring should be runnable against local subsystems and directly comparable to deployed systems. Monitoring checks should be run as part of continuous integration checks. The logic needed for correctness checks should be available to other consumers. For example, if you're checking a REST API, you shouldn't write a separate API consumer for your check script. Write an API consumer (a client) that may be utilized for correctness tests, service health checks, as well as other DevOps needs. Of course that itself needs to be tested: it's turtles all the way down.

Ensuring you have the means to reuse software let's you get stuff done and limits the rate of expansion of the ever-growing testing front.

It is important for all development to come at the source from the mindset of the user. Cloud monitoring is no exception. Your test's execution will likely first be noticed when it wakes up someone who understands the SUT even less than you and has to deal with whatever their check does. Give them the breadcrumbs you'd want to find to help yourself out. Make sure the script is self-evident to the point that if they need to run their own tests, they don't have to climb a mountain.

Metrics checks are a pattern commonly seen in cloud monitoring: data is "continuously" fed (usually) from multiple systems into a single aggregation point (perhaps with a queue or two and some processing in between). It is, for example, probably not worth alerting that memory spiked for a particular instant measured if conditions return to normal (though it is probably worth noting). If trends tend towards unfavorable, data should be present and analyzable so that the problem can be understood and solved before the whole cloud topples. Cloud intelligence is an exciting and growing business. Gathering data without the ability to react to it is of no utility. I'd rather have no data than not enough, no doubt. Feeding logs to a central, queryable location is a win. It is probably worth ensuring all of your monitoring results are recorded. Developing monitoring whose primary effect is recorded data for later analysis, however, should be query-driven. If you define what you're looking for first, you might have the resources to find it. Otherwise, you have a fancy system that will likely be shelved.

In the same way that what a unit test checks should be what is intended, a monitor that does not yield insight, whose usage is not evident, or that is silenced observes nothing...but probably does sink time.

The Myth of the Development Environment

A cloud deployment (a system), by its very nature, is multi-node with multiple roles. Cloud architecture at scale is enabled by loose-coupling but high integration. A typical environment, for example, may have a few different varieties of webservers, multiple backend stores (key-value, RDB, NAS, etc), a queuing system, a service discovery system, a monitoring system, an escalation system, a data pipeline, and so on. Gone are the dark ages where one debated whether a system should be written in one language or another: so long as there is a sane protocol and the subsystem is reliant and performant, we can now intelligently debate the merits and drawbacks of subsystem software itself without bringing language into it (or if you can't, maybe it's not worthwhile). Configuration management, in whatever form, may be used to intelligently manage mutual dependencies and interactions across a swath of software.

The level of integration enabled is also required: in order to test a particular service as one would in production, you must be able to rely on all the services it relies on, and so on, as well as the ability to drive your configuration management and deployment tools.

Since decoupling a coupled system is hard, this leads to the notion of a developer environment: if one wishes to test a change, one does so in a dedicated developer environment that has all the nodes — or at least all the services — as production. Functionally, this may work fine for a small number of developers. There is considerable cost in terms of computing resources: because one has made a system that isn't accurately testable locally (or at least so inconvenient to do so that business pressures will prohibit the investment) , a whole environment is maintained for testing a subset of services.

What is actually desired is the ability to test reliably on the set of subsystems affected configured in a manner functionally equivalent to the production state . In the case of a highly integrated system, this is the star topology. But consider actual computer resources and node placement: for the desired testing traffic (which you might consider a fuzz of parameter space) , could the component subsystems required for integration fit on your laptop ? How many physical boxes world it take? If the number is high, you might be doing it wrong. The ability to accurately measure and know this information is critical to cloud operations.

Your partial cloud will not be exactly the same as your production target. But I have never seen a Dev environment that was the same as a production environment and remain that way for any amount of time. There are worthwhile experiments, such as holistic load testing, that do require an entire set of services. While individual components can and should be load tested with realsitic load, this is no substitute for end to end load testing.

Let me also distinguish between a dev environment and an integration environment. Depending on business needs and constraints, it may make sense to have an integration environment to where builds and packages are promoted for a bake in period before being pushed out to production. It's cheap — in terms of human time expediture, not cost — but not ideal. You'll want the ability to route traffic adjustably between the new systems and the old, which also allows A/B testing and data comparison in real time to measure the effect of the changes. You'll want the ability to roll back to previous versions.

A dev enviroment, at best, is a kludge for the ability to meaningfully run tests and experiments on integrated systems. A dev environment requires roughly the cost and maintenance of your production environment, requires careful and oft-neglected communication about what is being developed so as to avoid stepping on another's toes, and can usually only support one development project at a time. If you've come to accept DRY as a necessary software practice, perhaps Don't Repeat Your Cloud may be a new guiding principle. What is desired is the ability to replicate what you actually need. While this requires the understanding of concerns and consideration towards their separation, the investment matures to an architecture that is malleable and agnostic to deployment target as well as empowers developers and operators to meaningfully assemble building blocks as wished. While no two sets of test environments are truly identical, if you give up on the ability to understand and isolate concerns, what is the point of testing or continuous integration at all (outside of a gradient-shaded rounded-corner button with the word "build" on it)?

The Beauty of a Blank Slate

Probably the single easiest part of traditional continuous integration to screw up is teardown. Short of spinning up a VM from a golden image, running tests, and decomissioning, it is very hard to ensure that there are not artifacts polluting your environment 100% of the time. 95%? Easy. 100%? Not so much. This hasusually not been the process for efficiency reasons, and perhaps, historically speaking, the extra manpower spent making production simulators production-like and manually intervening when tests intermittently fail was sometimes justified.

But now we're entering a rennaisance of virtualization. Automatic provisioning and configuration of environments is no longer a fantasy, but a necessity in the age of cloud. Images and containers may be shipped around freely and spun up quickly, and may (and should!) be built in a continuous manner. Transitioning to this world where build artifacts contain your software, rather than just package it may be leveraged not only to give you a rapid deployment model, when properly done it may provide most of the features herein described trasparently. The blank slate is free. Software and how we think about software just has to catch up.

When I'm looking for a piece of software, I usually want it to do one thing or an integrated set of things specifically. I don't want it to tell me how I want to do them (which isn't to say I haven't learned lessons from some pieces of highly opinionated software). In the pre-cloud world, where getting "the next production" online took perhaps a exasperating week of ops time, if a piece of software did too much, it was often just annoying. For the loose coupling deamnded by the cloud, the abiliity to integrate is key: if software does too much (or proscribes too closely how things are to be done), deployment and maintenance can easily descend to the nightmare of constant babysitting.

Writing software is easy; maintaining software is hard.

When composing cloud architecture, one usually starts with a communication graph of subsystems (notably, not necessarily of compute nodes). This model gives you all you need for knowing, for a particular problem, what services are needed and what configuration is necessary to deploy them. Malleability, then, is the accounting for this in one's deployment model which may be extended to any applicable environment, from a laptop to public cloud offerings if the principles of virtualization and orchestration are observed. If you want to lock in to offering-specific services, such as SQS or DDB, you may have good business reasons to do so. But please give some thought to how engineers may go about end to end testing. I hope you'll come up with a better solution than one environment per.

Giving developers, operators, and other technical stakeholders the ability to simulate your cloud offering is essential. If you have time-series data stores reflecting cloud performance and health, a developer should be able to spin up an equivalent data pipeline in order to do comparative analytics and study patterns.. The visible end product, of course, is the production data and its event reactors, and having the ability to perform as real world as possible data throughput is invaluable to this end. Another benefit, however, is exposure to the data through its generation and processing.

There's no substitution for real world load. You probably won't come up with every single edge case in parameter space . The world might come pretty close though. Recording and replay of real load through configurations of sets of subsystems is a cloud holy grail. If something cryptic goes awry today, can you troubleshoot it tomorrow? If you can't, how do you plan on finding that needle before the next time the haystack blows over? A new class of integration regression tests can become the spin up of subsystems, replay of fault-causing traffic, and their monitoring (it also gives cute ways to monitor the monitors, since you've learned what you expect to happen). A new class of load fuzz testing may be a selection of traffic played through with a volume dial. If you're gathering these metrics, finally you can say when your system is going to fall over. For real.

The point of testing and monitoring is to gain insight into the functionality of a system. For cloud testing, creating a system that is testable in an automated fashion enables the decoupling of intent from implementation that is necessary for keeping your cloud maintainable.