[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

No Outage (not at 0650Z anyway!)



All -

I wanted to take a moment and post a final message regarding our recent outage, to let everyone know how things concluded and what we did about this.

As you recall, the problem began shortly after we "upgraded" the system to include TMDA processing for all of the IETF lists. Two nights after applying that upgrade, we had a significant outage; then we had an additional shorter outage several days later.

Having spent quite a bit of time researching and discussing our findings, I've drawn the conclusion that the server crashed because it ran out of kernel page table memory. The server itself is quite huge - with 8 processor cores running at 3GHz, 8GB of RAM, dual 300GB mirrored drive arrays, and so forth. But a comparatively small portion of memory is used to manage all the rest of the memory in the system, and that small portion became exhausted as a result of the increased process load brought in by TMDA.

TMDA was not in itself the direct cause of the failure, TMDA was the "straw", so to speak. The direct cause of the failure was that a key section of kernel memory became exhausted, and the server attempted to shut down processes to recover that memory. This caused most public services (web, email, FTP) to be shut down, in essence causing the outage. The server itself was still up and running... but no public processes were running... so the end result was the same.

Once we realized what had happened, the solution was quite easy - we simply increased (dramatically) the size of the memory area used by the kernel, to more than twice the maximum size needed to operate all of the existing services. Observation of the server under regular load after this adjustment seems to have confirmed that this adjustment did in fact resolve the problem, and we have made these adjustments permanent on all IETF servers.

Another issue relating to the outage was the overall time of the outage. In order to reduce the time of the outage, I've already implemented additional steps, such as the installation of remote power control, and remote server console access, on all IETF servers. This, obviously, allows us to immediately interact with and restore a failed server, without the need to send on-call staff to the site in the middle of the night. We had already planned these additions; however, because of the large volume of work pushed from "during cutover" to "post cutover", those items had not yet been implemented. In consultation with AMS owners, I took those items to the top of the list and implemented them personally.

In addition, again thanks to commentary from the community, I've implemented additional kernel changes on the IETF servers. Going forward, the server will no longer be permitted to selectively kill processes in an out-of-memory condition. Instead, the server will execute an immediate reboot. Moreover any type of software-based failure of this type (an "oops" or panic) will also now result in an immediate reboot.

Another project on our "to-do" list was the implementation of server and system monitoring software. Again, this was planned for the transition period, but timing constraints forced us to delay this important component. We have now implemented the popular NAGIOS system across our Fremont data center, and are configuring its quite complex test set and extending the system to our Orlando data center this coming week.

Obviously, all of these changes together will dramatically reduce the potential length (and possibility!) of any future outages.

Finally, the topic was raised about additional servers.

We currently operate four servers dedicated to the IETF. Each is configured and built as I outlined above. One server currently functions as the primary operations server, and handles all of the IETF's traffic without ever really breathing hard. The recent failure, of course, is an exception, but that was a kernel utilization issue, which was a rather unique thing, and not an overload of the server's hardware resources. Each one of the servers we've assigned to the IETF is the processing and power equivalent of three servers used by the former provider. I was confident that a single server could handle all of the operations of the IETF, and I still am.

Indeed, I agree strongly with the comment someone made that "throwing another server at this won't solve the problem." That turned out to be correct. OTOH, I also agree strongly with the comment about not "putting all of our eggs in one basket."

So as I was saying, we have four servers, two in each of two data centers, and are currently running in such a way that an instant failover can be initiated to any of the other servers as desired, in the event of a longer-term or hardware failure of the production server. So we have four identical eggs in two identical baskets, as it were.

In addition, we are planning to implement a fifth offsite backup server at ISOC, which will again be identical in function to the other four.

But I am more than open to the concept of putting in additional servers should the need arise. One person commented today about adding additional mail exchange servers. This is another comment I agree with; therefore, in the coming weeks I'll be setting up an additional server in Fremont to handle just incoming mail processing. Obviously, that will not happen overnight, but it is something we will do going forward.

So, this outage, while unexpected, was not particularly catastrophic. No data was lost, the outage cause was identified and corrected, and we've taken appropriate steps to mitigate future outages and bring them under much more manageable levels.

Of course, my personal vote would be to never have an outage again.

I hope this information is useful and informative, and brings to closure any open questions regarding this outage. TMDA remains in place, thanks primary to the efforts of Henrik Levkowetz, and the servers are running smoothly. In addition, I wish to thank everyone who provided support and commentary during this highly stressful time. Thank you all.

I'm now "closing this ticket," but please feel free to contact me at any time if you have further questions or concerns. Please be reminded that I am not on the WGChairs list; therefore, you'll need to CC me if you want me to see something.

Best regards,
Glen Barney