[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
No Outage (not at 0650Z anyway!)
- To: Henrik Levkowetz <henrik at levkowetz.com>, Russ Housley <housley at vigilsec.com>, Ray Pelletier <rpelletier at isoc.org>, Working Group Chairs <wgchairs at ietf.org>, Karen Moreland <kmoreland at amsl.com>, Alexa Morris <amorris at amsl.com>, Matt Larson <mlarson at amsl.com>, jari.arkko at piuha.net, ekr at networkresonance.com, danny at tcb.net, Henrik Levkowetz <henrik at levkowetz.com>, Bill Fenner <fenner at fenron.com>, joelja at bogus.com, James M Galvin <galvin at elistx.com>, Lisa Winkler <lwinkler at amsl.com>, testlist at mail.ietf.org
- Subject: No Outage (not at 0650Z anyway!)
- From: Glen <glen at amsl.com>
- Date: Thu, 15 May 2008 20:42:40 -0700
- Delivered-to: ietfarch-wgchairs-archive at core3.amsl.com
- Delivered-to: wgchairs at core3.amsl.com
- In-reply-to: <48289AEB.6000706@levkowetz.com>
- List-archive: <https://www.ietf.org/mailman/private/wgchairs>
- List-help: <mailto:wgchairs-request@ietf.org?subject=help>
- List-id: Working Group Chairs <wgchairs.ietf.org>
- List-post: <mailto:wgchairs@ietf.org>
- List-subscribe: <https://www.ietf.org/mailman/listinfo/wgchairs>, <mailto:wgchairs-request@ietf.org?subject=subscribe>
- List-unsubscribe: <https://www.ietf.org/mailman/listinfo/wgchairs>, <mailto:wgchairs-request@ietf.org?subject=unsubscribe>
- References: <4828704D.5010809@amsl.com> <48289AEB.6000706@levkowetz.com>
- Sender: wgchairs-bounces at ietf.org
- User-agent: Thunderbird 2.0.0.14 (Windows/20080421)
All -
I wanted to take a moment and post a final message regarding our recent
outage, to let everyone know how things concluded and what we did about
this.
As you recall, the problem began shortly after we "upgraded" the system
to include TMDA processing for all of the IETF lists. Two nights after
applying that upgrade, we had a significant outage; then we had an
additional shorter outage several days later.
Having spent quite a bit of time researching and discussing our
findings, I've drawn the conclusion that the server crashed because it
ran out of kernel page table memory. The server itself is quite huge -
with 8 processor cores running at 3GHz, 8GB of RAM, dual 300GB mirrored
drive arrays, and so forth. But a comparatively small portion of memory
is used to manage all the rest of the memory in the system, and that
small portion became exhausted as a result of the increased process load
brought in by TMDA.
TMDA was not in itself the direct cause of the failure, TMDA was the
"straw", so to speak. The direct cause of the failure was that a key
section of kernel memory became exhausted, and the server attempted to
shut down processes to recover that memory. This caused most public
services (web, email, FTP) to be shut down, in essence causing the
outage. The server itself was still up and running... but no public
processes were running... so the end result was the same.
Once we realized what had happened, the solution was quite easy - we
simply increased (dramatically) the size of the memory area used by the
kernel, to more than twice the maximum size needed to operate all of the
existing services. Observation of the server under regular load after
this adjustment seems to have confirmed that this adjustment did in fact
resolve the problem, and we have made these adjustments permanent on all
IETF servers.
Another issue relating to the outage was the overall time of the outage.
In order to reduce the time of the outage, I've already implemented
additional steps, such as the installation of remote power control, and
remote server console access, on all IETF servers. This, obviously,
allows us to immediately interact with and restore a failed server,
without the need to send on-call staff to the site in the middle of the
night. We had already planned these additions; however, because of the
large volume of work pushed from "during cutover" to "post cutover",
those items had not yet been implemented. In consultation with AMS
owners, I took those items to the top of the list and implemented them
personally.
In addition, again thanks to commentary from the community, I've
implemented additional kernel changes on the IETF servers. Going
forward, the server will no longer be permitted to selectively kill
processes in an out-of-memory condition. Instead, the server will
execute an immediate reboot. Moreover any type of software-based
failure of this type (an "oops" or panic) will also now result in an
immediate reboot.
Another project on our "to-do" list was the implementation of server and
system monitoring software. Again, this was planned for the transition
period, but timing constraints forced us to delay this important
component. We have now implemented the popular NAGIOS system across our
Fremont data center, and are configuring its quite complex test set and
extending the system to our Orlando data center this coming week.
Obviously, all of these changes together will dramatically reduce the
potential length (and possibility!) of any future outages.
Finally, the topic was raised about additional servers.
We currently operate four servers dedicated to the IETF. Each is
configured and built as I outlined above. One server currently
functions as the primary operations server, and handles all of the
IETF's traffic without ever really breathing hard. The recent failure,
of course, is an exception, but that was a kernel utilization issue,
which was a rather unique thing, and not an overload of the server's
hardware resources. Each one of the servers we've assigned to the IETF
is the processing and power equivalent of three servers used by the
former provider. I was confident that a single server could handle all
of the operations of the IETF, and I still am.
Indeed, I agree strongly with the comment someone made that "throwing
another server at this won't solve the problem." That turned out to be
correct. OTOH, I also agree strongly with the comment about not
"putting all of our eggs in one basket."
So as I was saying, we have four servers, two in each of two data
centers, and are currently running in such a way that an instant
failover can be initiated to any of the other servers as desired, in the
event of a longer-term or hardware failure of the production server. So
we have four identical eggs in two identical baskets, as it were.
In addition, we are planning to implement a fifth offsite backup server
at ISOC, which will again be identical in function to the other four.
But I am more than open to the concept of putting in additional servers
should the need arise. One person commented today about adding
additional mail exchange servers. This is another comment I agree with;
therefore, in the coming weeks I'll be setting up an additional server
in Fremont to handle just incoming mail processing. Obviously, that
will not happen overnight, but it is something we will do going forward.
So, this outage, while unexpected, was not particularly catastrophic.
No data was lost, the outage cause was identified and corrected, and
we've taken appropriate steps to mitigate future outages and bring them
under much more manageable levels.
Of course, my personal vote would be to never have an outage again.
I hope this information is useful and informative, and brings to closure
any open questions regarding this outage. TMDA remains in place, thanks
primary to the efforts of Henrik Levkowetz, and the servers are running
smoothly. In addition, I wish to thank everyone who provided support
and commentary during this highly stressful time. Thank you all.
I'm now "closing this ticket," but please feel free to contact me at any
time if you have further questions or concerns. Please be reminded that
I am not on the WGChairs list; therefore, you'll need to CC me if you
want me to see something.
Best regards,
Glen Barney