The following has been provided by the DataCite Tech Team:
After everything has been tested for a long time we switched to the new machine on May 11th. The switch was absolutely smooth. There was no service disruption at all!
But four days later some connection timeouts for MDS occur and shortly afterwards MDS became completely unreachable. Unfortunately due to a configuration error our monitoring system did not noticed this. (This is of course fixed now!) This was the main reason for the long duration of the outage.
After noticing the problem the day after, we immediately switched back to the old machine. Everything was back to normal and we had time to investigate.
So what caused the outage? The connection between two key server components (Tomcat and Apache with proxy_ajp) broke down. The reason for this is unclear. Unfortunately we were also unable to reproduce the error no matter how hard we hit MDS. In this case it is obviously very hard to find a fix letting us feel confident enough for another try with this setup.
So after some discussion we decided to circumvent any potential roots of the problem. We migrated to a more modern and scalable web server (nginx). This took us a while, but the setup is now in place and we have already switched to it on Sunday. We are very confident that we now have a modern and reliable system.
However this switch was not as smooth as the one before. Two problems occurred:
1. We had to install a new SSL certificate due to expiring of the old one. However we missed to include the intermediate certificate. This might have broken your API clients. Due to browser caching this might have only affected a minority of UI users. This was fixed immediately after we got to know it on Tuesday.
2. We have also enabled HTTPS on schema.datacite.org and http://www.datacite.org. This causes a problem in MDS when reading the schema needed for validation. MDS was rejecting all metadata uploads. This is also fixed now.
Both problems are hard to detect at time of the switch or beforehand, because due to caching both did not occur immediately.
We are very sorry for all the inconvenience. We learned from the issues, e.g. improved our monitoring system. We are very confident that MDS is now stable again, and that all future server migrations will be smooth.
DataCite Tech Team