[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Outage
All -
Last night we had another outage. This outage was identical in every
respect to the outage last week. It occurred with exactly the same log
entries, at exactly the same time - to the minute - that the outage last
Wednesday night/Thursday morning happened.
On the good side, we know that the hardware - which is brand new - is
not failing. There are detailed log entries that show that the server
is overloading and oom-killing processes. The server itself stays up,
but kills off so many processes that public things, such as web and
email and so forth simply stop working. And, because I installed remote
power control Thursday, Matt was able to power-cycle the server remotely
and get us back online much more quickly.
On the bad side, this is the second time this has happened, and the only
thing we've changed recently - and the thing that the evidence points to
- is the addition of TMDA to all of the mailing lists.
The problem is that, although this has happened at exactly the same time
each night, that is the only pattern so far. After Thursday's outage, I
have personally sat up with the machine eachFrom wgchairs-bounces at ietf.org Mon May 12 09:29:07 2008
Return-Path: <wgchairs-bounces at ietf.org>
X-Original-To: wgchairs-archive at optimus.ietf.org
Delivered-To: ietfarch-wgchairs-archive at core3.amsl.com
Received: from [127.0.0.1] (localhost [127.0.0.1])
by core3.amsl.com (Postfix) with ESMTP id D87E13A6804;
Mon, 12 May 2008 09:29:07 -0700 (PDT)
X-Original-To: wgchairs at core3.amsl.com
Delivered-To: wgchairs at core3.amsl.com
Received: from localhost (localhost [127.0.0.1])
by core3.amsl.com (Postfix) with ESMTP id 8BEE13A6818
for <wgchairs at core3.amsl.com>; Mon, 12 May 2008 09:29:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -100.044
X-Spam-Level:
X-Spam-Status: No, score=-100.044 tagged_above=-999 required=5
tests=[AWL=0.451, BAYES_00=-2.599, FH_RELAY_NODNS=1.451,
HELO_MISMATCH_COM=0.553, RDNS_NONE=0.1, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32])
by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id x06AiyxDRQSx for <wgchairs at core3.amsl.com>;
Mon, 12 May 2008 09:29:04 -0700 (PDT)
Received: from mail.amsl.com (mail.amsl.com [IPv6:2001:1890:1112:1::14])
by core3.amsl.com (Postfix) with ESMTP id B38743A6804
for <wgchairs at ietf.org>; Mon, 12 May 2008 09:29:04 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
by thunder2.amsl.com (Postfix) with ESMTP id A49C47FC2;
Mon, 12 May 2008 09:28:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
Received: from mail.amsl.com ([64.170.98.20])
by localhost (thunder2.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id Y6ZHvr9QO5LE; Mon, 12 May 2008 09:28:40 -0700 (PDT)
Received: from [192.168.0.12] (c-76-103-55-1.hsd1.ca.comcast.net [76.103.55.1])
by thunder2.amsl.com (Postfix) with ESMTP id 4107348158;
Mon, 12 May 2008 09:28:40 -0700 (PDT)
Message-ID: <4828704D.5010809 at amsl.com>
Date: Mon, 12 May 2008 09:29:01 -0700
From: Glen <glen at amsl.com>
User-Agent: Thunderbird 2.0.0.14 (Windows/20080421)
MIME-Version: 1.0
To: Russ Housley <housley at vigilsec.com>, Ray Pelletier <rpelletier at isoc.org>,
Working Group Chairs <wgchairs at ietf.org>,
Karen Moreland <kmoreland at amsl.com>,
Alexa Morris <amorris at amsl.com>, Matt Larson <mlarson at amsl.com>
Subject: Outage
X-Enigmail-Version: 0.95.6
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: wgchairs at ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Working Group Chairs <wgchairs.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/wgchairs>,
<mailto:wgchairs-request at ietf.org?subject=unsubscribe>
List-Archive: <https://www.ietf.org/mailman/private/wgchairs>
List-Post: <mailto:wgchairs at ietf.org>
List-Help: <mailto:wgchairs-request at ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/wgchairs>,
<mailto:wgchairs-request at ietf.org?subject=subscribe>
Sender: wgchairs-bounces at ietf.org
Errors-To: wgchairs-bounces at ietf.org
All -
Last night we had another outage. This outage was identical in every
respect to the outage last week. It occurred with exactly the same log
entries, at exactly the same time - to the minute - that the outage last
Wednesday night/Thursday morning happened.
On the good side, we know that the hardware - which is brand new - is
not failing. There are detailed log entries that show that the server
is overloading and oom-killing processes. The server itself stays up,
but kills off so many processes that public things, such as web and
email and so forth simply stop working. And, because I installed remote
power control Thursday, Matt was able to power-cycle the server remotely
and get us back online much more quickly.
On the bad side, this is the second time this has happened, and the only
thing we've changed recently - and the thing that the evidence points to
- is the addition of TMDA to all of the mailing lists.
The problem is that, although this has happened at exactly the same time
each night, that is the only pattern so far. After Thursday's outage, I
have personally sat up with the machine each night a night and watched it run
with no problems well past midnight... except for (you guessed it!) last
night, when, after several nights with no problem, I figured that the
re-tuning of processing rates and limits had solved the problem, and
went to bed at a slightly more reasonable time.
*sigh*
Of course, I've gone through all the obvious things. Nothing
specifically abnormal is running at that particular time on our end - no
cron jobs or anything like that. We run backups, syncs, updates, etc.,
every hour throughout the day and night, so nothing special is happening
on our end at 23:50 local time to cause this. But clearly, something is
going on at that time to cause us grief - at least on Sundays and
Wednesdays. (Note, of course, that these are local times here - they
translate into Monday and Thursdays at 06:50 GMT, if I'm doing that
conversion correctly.
I'm pretty confident that by taking TMDA back offline, this problem
would go away. However, I don't want to do this, and I know most of YOU
don't want that either. What I'd really like to do is catch this thing
in the act, and find out what's going on and why. Unfortunately, that,
too, may be unreasonable.
So, I'll be spending the rest of the day trying to find other clues as
to what's happening. And we're then going to deploy a second server,
and assign all of the mail processing functions to it: Incoming mail
handling, virus checking, spam scanning/tagging, and TMDA. This is
going to take a couple of days to accomplish correctly, of course, but
this is the plan we have right now. Between now and then, I will stay
up with the server each night at least until after 23:50, to try to
ensure that no more outages occur (at least not at that time.)
Heretofore, the server running the IETF has barely been breathing hard
at all. I cannot believe that the addition of TMDA is causing this much
trouble... although TMDA is by its own author's admission not really
designed for this kind of volume. It grates on me to think that this
can somehow push this huge server over the edge. However, as I watch
TMDA run, it does seem to consume quite a bit of time and resources, and
does seem to be rather a dog.
Anyway, as I said, we're going to split these functions off onto another
server. We'll make this transition as quick and painless as possible,
and, hopefully, that will be the end of it.
Please remember that I am not on the wgchairs lists, so if anyone has
anything to offer, please be sure to reply-all or at least include me.
Thanks for your patience during this time of adjustment and growth.
Glen Barney
IT Director
Association Management Solutions
nd watched it run
with no problems well past midnight... except for (you guessed it!) last
night, when, after several nights with no problem, I figured that the
re-tuning of processing rates and limits had solved the problem, and
went to bed at a slightly more reasonable time.
*sigh*
Of course, I've gone through all the obvious things. Nothing
specifically abnormal is running at that particular time on our end - no
cron jobs or anything like that. We run backups, syncs, updates, etc.,
every hour throughout the day and night, so nothing special is happening
on our end at 23:50 local time to cause this. But clearly, something is
going on at that time to cause us grief - at least on Sundays and
Wednesdays. (Note, of course, that these are local times here - they
translate into Monday and Thursdays at 06:50 GMT, if I'm doing that
conversion correctly.
I'm pretty confident that by taking TMDA back offline, this problem
would go away. However, I don't want to do this, and I know most of YOU
don't want that either. What I'd really like to do is catch this thing
in the act, and find out what's going on and why. Unfortunately, that,
too, may be unreasonable.
So, I'll be spending the rest of the day trying to find other clues as
to what's happening. And we're then going to deploy a second server,
and assign all of the mail processing functions to it: Incoming mail
handling, virus checking, spam scanning/tagging, and TMDA. This is
going to take a couple of days to accomplish correctly, of course, but
this is the plan we have right now. Between now and then, I will stay
up with the server each night at least until after 23:50, to try to
ensure that no more outages occur (at least not at that time.)
Heretofore, the server running the IETF has barely been breathing hard
at all. I cannot believe that the addition of TMDA is causing this much
trouble... although TMDA is by its own author's admission not really
designed for this kind of volume. It grates on me to think that this
can somehow push this huge server over the edge. However, as I watch
TMDA run, it does seem to consume quite a bit of time and resources, and
does seem to be rather a dog.
Anyway, as I said, we're going to split these functions off onto another
server. We'll make this transition as quick and painless as possible,
and, hopefully, that will be the end of it.
Please remember that I am not on the wgchairs lists, so if anyone has
anything to offer, please be sure to reply-all or at least include me.
Thanks for your patience during this time of adjustment and growth.
Glen Barney
IT Director
Association Management Solutions