As you know, the last few days have been a significant challenge. Now that we’ve recovered I wanted to take the time to share the details of what happened and what our plan is moving forward to ensure you receive the great service you know and deserve from us.
On Sunday, December 11 at 1PM PT, during a scheduled routine off-site backup procedure, a critical error was performed that caused an unprecedented complete outage of our backend systems. Our inquiry and root cause determination efforts confirmed that a simple human error was responsible. Specifically, while following the procedure on our primary backup server, inadvertently we ran instructions intended for that backup instance on our primary operational database. The instruction dropped our primary database and cascaded to all redundant systems.
Why did take so long to recover? Within one hour of the failure, the entire team convened to address the service outage and begin recovery efforts. Unfortunately, due to the primary failure affecting all replicated instances, we were required to conduct a transactional rebuild of the databases used to run the service. This transaction record recovery was a massive data importation effort on the order of many billions of transactions. On Monday, December 12th at 10:50PM PT, the main backup restore procedure was complete. However, this backup was missing the last 24 hours prior to the system failure. In order to avoid potential fatal systems failures and to data integrity, we decided to enable only DNS records at that time and proceed with recovering transactional logs. This process took much longer and was completed on Wednesday at 11AM PT and we were able to bring the systems completely back online with no data loss.
What did we learn? We learned that when we started providing this service to the world, we made design and data layout decisions that made sense at the time but no longer do. Over the years we’ve been fortunate to grow our service in ways we never expected. That success resulted in data needs we hadn’t yet considered. We will be working diligently to increase our systems resilience and availability.
How will we ensure this doesn’t happen again? We’ve developed a new database design that will address our need of faster recovery should we ever need to do a transactional recovery again. Additionally, we’ve implemented a fault tolerant replication model that will allow us time to stop replications before complete outage can happen should something happen to the primary database instances.
We at ChangeIP are proud to be your DNS provider. We couldn’t do any of this without your support. Letting you down has felt absolutely horrible. We’ve identified and are taking the steps to ensure this never happens again but want to express to you both our apology and gratitude for your business. We will be issuing a free month of service to all of our paying customers. Please accept this credit as we work to regain your trust. To take advantage of the free month service please open a ticket through our helpdesk.