[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Outage



Thanks to all of you for your helpful commentary today. My apologies for not replying individually, but please bear with me as I cover a few things.

Jim Galvin suggested concurrency limits. Excellent idea. Alas, we already have that in place:

smtpd_client_connection_count_limit = 1
smtpd_client_connection_rate_limit = 4
smtpd_client_event_limit_exceptions = ${smtpd_client_connection_limit_exceptions:$mynetworks}

In a private message, I was told:

Ordinarily I would just let you handle this, but you did ask if people
had anything to offer, so...

Absolutely. My job is just to keep things running as smoothly as possible and take care of requests from you, the community representatives, as efficiently as I can. So I'm always appreciative of thoughts and ideas when things like this happen. I've got several Linux engineers here, but we are always interested in other ideas.

Same sender:
- I assume that "oom" is out of memory. Unix systems don't typically
  kill processes when the box starts running low on memory. They just
thrash a lot as more and more of the system starts to reside outside of RAM. So, if you're seeing processes get killed, it's
  probably because those processes overflowed the maximum memory
  size, in which case you've got a pretty good guess that something
  is severely wrong with those processes. So, what's failing?

Yes, I remember those days. Actually, our system (OpenSuse 10.3), and, apparently, new systems, have something called "oom_killer"

http://linux-mm.org/OOM_Killer

which is what started running at 23:50 each of those two nights and killing things.

May 11 23:50:28 core3 tmda-wrapper: User: nobody
May 11 23:50:28 core3 tmda-wrapper: USER     : magma
May 11 23:50:28 core3 tmda-wrapper: USER     : magma-request
May 11 23:50:28 core3 tmda-wrapper: SENDER   : atsalmonella at cites.in
May 11 23:50:28 core3 tmda-wrapper: SENDER   : atsalmonella at cites.in
May 11 23:50:28 core3 tmda-wrapper: RECIPIENT: magma at core3.amsl.com
May 11 23:50:28 core3 tmda-wrapper: RECIPIENT: magma-request at core3.amsl.com
May 11 23:50:28 core3 tmda-wrapper: EXTENSION:
May 11 23:50:28 core3 tmda-wrapper: EXTENSION:
May 11 23:50:28 core3 tdma-wrapper:  * Input piped to tmda-filter
May 11 23:50:28 core3 tdma-wrapper:  * Input piped to tmda-filter
May 11 23:50:31 core3 kernel: find invoked oom-killer: gfp_mask=0xd0, order=1, oomkilladj=0
May 11 23:50:31 core3 kernel:  [<c0159b2a>] out_of_memory+0x69/0x1a7
May 11 23:50:31 core3 kernel:  [<c015b0be>] __alloc_pages+0x219/0x2d6
May 11 23:50:31 core3 kernel:  [<c01d7ec3>] copy_to_user+0x25/0x3a
May 11 23:50:31 core3 kernel:  [<c015b1a7>] __get_free_pages+0x2c/0x3a

Lots of TMDA stuff runs just before the oom-killer, then LOTS of entries for the oom-killer, and so forth and so on, until even syslog gets killed.

In the old days, I've seen systems thrash and thrash and I could still log in and recover them if I was willing to wait 5 minutes for a shell. Then a quick "nice --19 sh" got me some access.

But this oom_killer thing kills of syslogd, sshd, crond, apache, postfix, etc., until the system "is happy" again. So the server doesn't really "go down" - but, for all intents and purposes, it is the same thing.

As an option, I'm looking into ways of disabling this process. So far, no luck - the required /proc entries are missing on core3. But I will keep looking.

- Do you have more detailed process logs? The natural thing to
- Have you tried turning on process accounting? That will tell
- Can you take tcpdumps around the time to see if there's some
- As a last ditch, why not just have a watchdog process on another

One of the things we wanted to do during the cutover was setup nagios and/or other monitoring processes to do this work. Time did not permit us that option. I have one of my engineers working on that now, and it should be ready in a day or two.

I, of course, still want to determine the cause. So I'm also implementing those suggestions as appropriate for this situation - all of which were excellent.

Henrik wrote:

Not that I can see.  I have only 2 cronjobs which do external accesses once
per day, at local time (now CEST) 07:00 and 16:30.  Other cron jobs run once
per hour, the heaviest of those starts at NN:45 (45 minutes after the hour
every hour).

"The heaviest" would then be starting five minutes before the server ran out of memory...

... but again, things have been running just fine from the beginning, until this week.

I have a job that started at 11:30 PM... it's a find on the entire filesystem, so I thought, maybe that was the problem. But I ran it today by hand, and it took 8 minutes and ran with no visible resource impact.

Henrik then said:

I guess it really should be running as a daemon.

And Joel said:

If you're running it as part of local delivery just doing a little locking will put enough backpressure on it to keep it from spamming you with processes.

I'm much more in favor of this. Daemonization is over-working it a bit, but having messages spin on a lock file makes much more sense to me. I think something like that could be written in to the initial wrapper we're using, and I will consult privately with the author about that.

Finally, Jari wrote:
I feel your pain.

THANK YOU, Jari, for your support.

We're just trying to accommodate everything the IETF wants here. It's VERY frustrating and stressful to have a brand new huge server start failing without clear cause in the middle of the night. So be assured we're all scrambling here to get this resolved as quickly as possible.

For now, I really would like to try to solve the problem directly rather than just throw another server at it. I have a second server, right next to the first, equally big (8 3GHz cores, 8GB Ram, 2x 300GB mirrored drive arrays, 1333 bus, etc. etc.) sitting as a hot backup for the production machine - it is not a far cry to reorganize that machine to be the mail exchanger server and let it bear that load.

But the bottom line is that "normally", this big huge server runs at a load average of less than 4, with 3-6% of CPU utilized, and 25% of RAM utilized, and that's the case even during cron jobs. So for this server to start running oom-killer, something particularly ugly must be happening. And I don't want to just "cover up" that problem by throwing another $10,000 server at it. I really want to *find* the problem and fix it.

Which will take several days of watching, and the potential for another crash or two, but which, I think, is worth it.

I don't want to just "dismiss" a problem like this (or any problem.) I want to solve it.

So unless there is strong objection, my plan is to put in the additional monitoring we've all discussed, and watch and see if we can "catch this in the act" again. If everyone is willing to go along with that plan, I think it will provide the best possible outcome.

And if we find nothing else after several days, I'll bring that second server online as an inbound mail processor, and see how THAT goes.

Thank you all, again, for your support with this matter.

Glen