[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
No Outage (not at 0650Z anyway!)
- To: Henrik Levkowetz <henrik at levkowetz.com>, Russ Housley <housley at vigilsec.com>, Ray Pelletier <rpelletier at isoc.org>, Working Group Chairs <wgchairs at ietf.org>, Karen Moreland <kmoreland at amsl.com>, Alexa Morris <amorris at amsl.com>, Matt Larson <mlarson at amsl.com>, jari.arkko at piuha.net, ekr at networkresonance.com, danny at tcb.net, Henrik Levkowetz <henrik at levkowetz.com>, Bill Fenner <fenner at fenron.com>, joelja at bogus.com, James M Galvin <galvin at elistx.com>, Lisa Winkler <lwinkler at amsl.com>, testlist at mail.ietf.org
- Subject: No Outage (not at 0650Z anyway!)
- From: Glen <glen at amsl.com>
- Date: Tue, 13 May 2008 00:09:46 -0700
- Delivered-to: ietfarch-wgchairs-web-archive at core3.amsl.com
- Delivered-to: wgchairs at core3.amsl.com
- In-reply-to: <48289AEB.6000706@levkowetz.com>
- List-archive: <https://www.ietf.org/mailman/private/wgchairs>
- List-help: <mailto:wgchairs-request@ietf.org?subject=help>
- List-id: Working Group Chairs <wgchairs.ietf.org>
- List-post: <mailto:wgchairs@ietf.org>
- List-subscribe: <https://www.ietf.org/mailman/listinfo/wgchairs>, <mailto:wgchairs-request@ietf.org?subject=subscribe>
- List-unsubscribe: <https://www.ietf.org/mailman/listinfo/wgchairs>, <mailto:wgchairs-request@ietf.org?subject=unsubscribe>
- References: <4828704D.5010809@amsl.com> <48289AEB.6000706@levkowetz.com>
- Sender: wgchairs-bounces at ietf.org
- User-agent: Thunderbird 2.0.0.14 (Windows/20080421)
All -
Good morning.
At 06:55 GMT - five minutes past the time of our two outages this past
week - the server is running quite smoothly and unencumbered, and has
not (at least so far) gone down or shown any signs of doing so.
Progress was definitely made today. Things we did include such elements
as: (Suppressing the desire to work "Fear" and "Surprise" into this
message)...
* Did a bunch of digging through the kernel. Found some strangeness in
the way memory was being handled generally, found a possible way of
adding some sanity to the oom-killer process, re-tuned the kernel to
implement and fix that.
* In concert with Henrik, installed a modified version of the TMDA
handler to fine-tune how TMDA is dispatched, thus reducing significantly
the added load from TMDA (with great success, thank you Henrik!)
* Installed a logging system to capture snapshots of all kind of server
data and log it; in the event that the server -does- go down this will
hopefully give us something more to look at.
* Found a way to exempt certain processes from the oom-killer, applied
those exemptions to the syslog and cron and sshd processes. I -hope- I
never get to test this, but I applied it nonetheless.
And of course,
* Sat up with the machine until now, to watch firsthand what is happening.
I did see a lot of python activity, and a lot of datatracker activity,
and that activity increased dramatically after 06:30Z. TMDA activity
also increased dramatically; the new handler performed very well and our
overall load remained low. The usual other things all ran correctly and
as they should, and the server performed perfectly, averaging less than
2% cpu, peaking at 17% at one point, with memory staying well below the
50% mark throughout the night.
So did these changes solve the problem? The re-tuning and the TMDA
stuff -may- have. I won't stay up every night, but I *will* stay up
Wednesday-into-Thursday again, since that's when the last crash
occurred, just to see what happens.... and we'll go from there.
I also found a way to have the oom-killer reboot the server if it gets
into that much trouble. I haven't turned that on yet; however, when
this little episode is all done, I will turn it on, so that any failures
of this type in the future are hopefully limited to a few minutes of
automatic recovery, rather than outages like this.
Have a good night/day as appropriate, and, if nothing happens between
now and then, I'll post another email Thursday.
Glen Barney