Today, we performed the upgrade of the Fuga platform to the Liberty release of OpenStack. It was a project that had been delayed for some time because we had to overcome more difficulties than we expected in our testing and staging environments and these needed to be resolved before putting anything into production. These issues have all been resolved. We have successfully upgraded the live platform today. However, this too, took a little longer than we had anticipated.
A (slightly) extended maintenance window
We started at 07:30 CEST and had planned to be done around 09:00. It was 15:40, however, before we could finally give the ‘all clear’ sign. There were two main reasons for this. Firstly, we needed to perform a firmware upgrade to the switches in the platform before the platform update itself. This process (especially the resulting reboots) took far longer than we had anticipated. Because we had to do them one at a time to avoid downtime, we could start the actual OpenStack updates only just before the end of the initially planned maintenance window.
The second reason it all took longer was a resulting change in plans in the update procedure. In order to prevent unannounced and undesired downtime for our customers during office hours, we had to adjust the rest of the update procedure. We had to be really sure that we didn’t temporarily break things and therefore chose to upgrade most components individually instead of at the same time. Afterwards, we made sure everything worked before we started on the next component. The result: a significantly extended time frame.
Minimal impact on availability
The modified procedure resulted in a much longer maintenance window, but one with minimal impact on availability. We initially even thought we had zero downtime, but it turned out we did have two short networking outages for some of our users (from 12:42 to 12:46 and from 13:22 to 13:24). Around 14:00, all components had been updated and all services were online once more. We had one persistent issue remaining that concerned volumes attached to instances, but that, fortunately, didn’t have an impact on the availability. We were finally able to tackle this last remaining problem at 15:40.
The Horizon dashboard was also unreliable during maintenance. We will probably simply take it offline in the future during this type of maintenance in order to avoid confusion and frustration to our users.
So, what’s next?
The result is that we’re now running on a new version of OpenStack, the updated dashboard is live and we’re all set to implement the new features we’ve been working on the past few months. We’ll inform you as soon as we’ve scheduled the dates and times we will make these available to you.