Embracing Failure in Distributed Cloud Applications

03/22/2020 — 5 Min Read — In Distributed Systems

Building a reliable application in the cloud is different than building a reliable application in an enterprise setting. While historically you might have purchased higher-end hardware to scale up, in a cloud environment you must scale out instead. Costs for cloud environments are kept low through the use of commodity hardware. Instead of focusing on preventing failures and optimizing mean time between failures, in this new environment the focus shifts to mean time to restore.

The goal is to minimize the effect of a failure. Embrace the fact that failures will happen, and design to handle them.

Reasons for failures

Developer: Unhandled exception, caused by something the developer was not expecting and did not handle. The natural order of events is that the service logs the exception and terminates. The developer analyzes the logs, and takes corrective action in the code either to avoid the exception in the future, or to handle it more gracefully. This is an iterative process.
DevOps: Scaling down the number of service instances. When the orchestrator takes the instances down, it is possible that the instance being stopped cannot shut down in a graceful way, and the request it was processing might fail.
DevOps: Updating service code to a new version also might result in the instance not shutting down gracefully. A service instance might be processing a request at the time it is taken down for upgrade, resulting in that instance being processed again.
Orchestrator: Moving service code from one machine to another. The orchestrator's job is to make sure the service is up and running, and in doing so, it might decide to shut down an instance and move it to another piece of hardware.
Force majeure: Hardware failure, such as with the power supply, overheating fans, hard disk, network controller, router, or bad network cable, among others.
Force majeure: Datacenter outages due to natural disasters or attacks.

It's rare that an entire service or region will experience a disruption, but even those events must be planned for. When architecting applications for the cloud, you should:

Assume failures will happen and design for resiliency.
Avoid single points of failure through redundancy.

Assuming that failures will happen and that machines will go down, applications should not depend on a single machine to continue operating. A popular analogy used when describing how we should think about servers is the “pets vs. cattle” analogy—the notion of treating servers like cattle, not pets.

They are part of a herd, almost identical, and when they get sick, we replace them with another one instead of nursing them back to health. If any server in the organization is known by name and it routinely causes pain when it's down, then it's likely being treated like a pet.

During the design phase, you should perform a failure mode analysis (FMA). The goal of the FMA is to identify possible points of failure, and define how the application will respond to those failures.

How will the application detect this type of failure?
How will the application respond to this type of failure?
How will you log and monitor this type of failure?

Design self-healing applications

Design an application to be self-healing when failures occur. This requires a three-pronged approach:

Detect failures.
Respond to failures gracefully.
Log and monitor failures to give operational insight.

How you respond to a particular type of failure may depend on your application's availability requirements.

Recommendations

Retry failed operations: Transient failures might occur due to momentary loss of network connectivity, a dropped database connection, or a timeout when a service is busy. Build retry logic into your application to handle transient failures.
Protect failing remote services (circuit breaker design pattern): It's advisable to retry after a transient failure, but if the failure persists, you can end up with too many callers hitting a failing service. This can lead to cascading failures as requests back up. Use the circuit breaker design pattern to fail fast (without making the remote call) when an operation is likely to fail.
Isolate critical resources (bulkhead pattern): Failures in one subsystem can sometimes cascade. This can happen if a failure causes some resources, such as threads or sockets, from being freed in a timely manner, leading to resource exhaustion. To avoid this, partition a system into isolated groups, so that a failure in one partition does not bring down the entire system.
Perform load leveling: Applications may experience sudden spikes in traffic that can overwhelm services on the backend. To avoid this, use the queue-based load leveling pattern to queue work items to run asynchronously. The queue acts as a buffer that evens out peaks in the load.
Fail over: If an instance can't be reached, fail over to another instance. For things that are stateless, like a web server, put several instances behind a load balancer or traffic manager. For things that store state, like a database, use replicas and fail over. Depending on the data store and how it replicates, this might require the application to deal with eventual consistency.
Compensate for failed transactions: In general, avoid distributed transactions because they require coordination across services and resources. Instead, use compensating transactions to undo any step that already completed.
Use checkpoints on long-running transactions: Checkpoints can provide resiliency if a long-running operation fails. When the operation restarts (for example, it is picked up by another virtual machine), it can be resumed from the last checkpoint.
Degrade gracefully: Sometimes you can't work around a problem, but you can provide reduced functionality that is still useful. Consider an application that shows a catalog of books. If the application can't retrieve the thumbnail image for the cover, it might show a placeholder image. Entire subsystems might be noncritical for the application. For example, in an e-commerce site, showing product recommendations is probably less critical than processing orders.
Throttle clients: Sometimes a small number of users create excessive load, which can reduce your application's availability for other users. In this situation, throttle the client for a certain period of time. See the throttling pattern for more information.
Block bad actors: Just because you throttle a client, it doesn't mean the client was acting maliciously. It just means that the client exceeded its service quota. But if a client consistently exceeds their quota or otherwise behaves badly, you might block them. Define an out-of-band process for the user to request getting unblocked.
Use leader election: When you need to coordinate a task, use leader election to select a coordinator. That way, the coordinator is not a single point of failure. If the coordinator fails, a new one is selected. Rather than implement a leader election algorithm from scratch, consider an off-the-shelf solution such as Apache ZooKeeper.
Test with fault injection: All too often, the success path is well tested but not the failure path. A system could run in production for a long time before a failure path is exercised. Use fault injection to test the resiliency of the system to failures, either by triggering actual failures or by simulating them.
Embrace chaos engineering: Chaos engineering extends the notion of fault injection by randomly injecting failures or abnormal conditions into production instances.

Wrapping Up

Even today, the breakdown or ‘outage’ of a cloud service happens surprisingly frequently. When you are planning or developing a distributed application, it is a bad idea to assume attributes and qualities in your network that aren’t necessarily there: far better to plan on the assumption that your network will be costly, and will occasionally be unreliable and insecure.
Remember that (successful) applications evolve and grow so even if things look Ok for a while if you don't pay attention to the failures they will rear their ugly head and bite you.

Hope you find this article useful. Please share your thoughts in the comment section.

I’d be happy to talk! If you liked this post, please share, cheers. See you next time.

PreviousUnderstanding Why Leader Matters in Distributed Systems

NextStuff you should know about Indexing in Databases

Keep me updated about software engineering stuff