To quote Michael Dell, “the cloud isn’t a place, it’s a way of doing IT.“ As IT becomes more and more central to what every company does, understanding cloud native best practices is key not only for the IT department – but for every part of a business.
This post is the fourth of a seven-part series examining how cloud native can help businesses deliver on their promise of better, faster, cheaper. This part explains how automatic backup and disaster recovery are key to providing a better, more resilient service to customers with a fast time to recovery which reduces lost revenue when disaster strikes.
Cloud Native Best Practices: Automatic Backup and Disaster Recovery
Gone are the days where backup meant pressing hard to get the carbon copy triplicate. Today, most business information – and value – lies on computer servers rather than paper. In this digital age, the business impact of IT failures and outages can almost not be overstated. The average cost of IT downtime is $336,000 per hour and 93% of companies that lose a data center for 10 days or more will go out of business in the next year. As IT departments become the core around which every other department operates, for business continuity, it is imperative that IT departments have effective plans in place to automatically backup systems and recover when disaster strikes.
Taking a cloud native approach to IT can reduce costs and operational overhead. However, without proper disaster recovery in place too, all of these gains can be quickly erased. Setting up an effective recovery process for when outages occur is a three-step process. First, each part of the IT system must be interchangeable and replaceable (see part two of this series Pets vs. Cattle for a full explanation). Second, there needs to be automatic snapshots and backups of systems so a replacement copy is ready to go whenever a problem occurs. Finally, a recovery process from these backups needs to be established and practiced.
Modern IT environments have many moving parts making it extremely easy to miss a piece during backup and recovery. Automation and a practiced plan are key to having a successful process that ensures business continuity. Automation can help avoid manual mistakes, ensure important steps are not overlooked, and speed up the whole process. When something does go wrong, besides the backup copy, there also needs to be a practiced plan to fix and/or replace the broken system. Even for companies that do have backups, 75% were not able to restore all of their lost data, with 23% unable to recover any data at all. Disaster recovery planning – and testing – are critical to verify that recovery systems actually work and ensure business continuity.
For cloud native businesses, building backup and recovery of Kubernetes clusters is key. However, Kubernetes itself comes with no out-of-the-box mechanisms for this critical business need. Project Velero (originally Heptio Ark) was created as an open source tool to safely backup, restore, and perform disaster recovery for Kubernetes cluster resources and persistent volumes – filling this serious business gap. Project Velero allows users to automatically schedule Kubernetes cluster backups, replicate and test them across cloud providers, and restore them in case of loss. Velero has seen widespread adoption across the Kubernetes ecosystem with support for all the major cloud providers with Digital Ocean even providing their own documentation. When we designed Kubermatic Kubernetes Platform, we knew every business we work with needs backups and disaster recovery built into their Kubernetes infrastructure. We chose to use Project Velero because it is the best in class open source tool for managing backup and disaster recovery of Kubernetes cluster resources and persistent volumes.
We built Kubermatic Kubernetes Platform with an out-of-the-box integration of Project Velero to give our customers the best tools to minimize disaster impact, allowing our customers to create a more reliable service with faster time to recovery ensuring they don’t lose revenue due to extended system outages. Check out part five: API driven architecture to understand how to design a best in class production system.