![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
The MRDs were up but not responding, due to full log filesystems. The xproxies would time out on an ldapsearch() call, and then try to reconnect. They would then hang in an ldapbind() call. Chris Shufei Wen wrote:
Chris,I might have missed your point. What's the conclusion of the root cause of the problem. Were the mrds up or down at the time? First you said xproxy processes hang when mrd were shutdown and latter you indicated that mrd returns ACKs?Thanks, Shufei Chris Eastlund wrote:Steve,The problem was with the MRD, which filled the log filesystem, which halted the MRD. Both MRD machines filled their logs.The proxy processes were hung waiting on the MRD. When the MRD was shut down, the proxy processes all recorded an MRD bind failure and alot of sessions (5000) with sl=63000 or so. And then logins started again. The question is why any proxy process kept trying a mrd query for 63,000 seconds or 17.5 hours. The ldap search call has a timeout (default for m2k of 90 seconds) and gets tried twice by xproxy, and once (for timeout fails) in the libstdxdir library. This should take 3 minutes, max.After the search fails, the proxy will attempt to close and reopen the session. That's where I think things hang. The MRD system returns an ACK, so the connection seems up and the connection timeout doesn't apply.When ldapsearch is run against such a listener, truss shows: connect() # returns EINPROGRESS pollsys() time() write(4,....) # seems to be the login sequence pollsys(0xFFBFF0E8,5,0,0) #I think this means a poll() call with no timeout. I can't find a pollsys() man page, as it's a Solaris internal call.There are web pages noting this problem from about 2002, and our version of openldap is older than that. Chris Steve Prisco wrote:including GeorgeSteve------------------------------------------------------------------------ *From:* Al Robinson [mailto:awr at maillennium.att.com] *Sent:* Thursday, January 08, 2009 3:39 PM *To:* 'M2K Development Team' *Cc:* Mail Testers; PRISCO, STEVE (ATTLABS) *Subject:* PXC02 OS patchPatrick was running a load test on lzfwpxc02 last night and it was running fine until it wasn't. It currently thinks all pop proxy processes on the blpop interface are busy. At least that's what the logs are saying and the response from mailman when a new connection is attempted. The offered load was 2161 simultaneous sessions. For most of the night, mailman reported hiwater at approximately 3500/5000. At 4:32 it reported a hiwater of 4439/5000 followed by an XSFLOOD at 4:34. At this point, it doesn't seem to respond any more. Subsequent XSTAT logs with the load still active report hiwater of 0/5000 and 500k+ xnconns. The latest XSTAT showed a hiwater of 0/5000 with 347 xnconns. We need development to look at the server and determine if it is a mailman problem or a problem with the OS patch. I assume a core file will be needed, but we haven't touched the system yet. Al Robinson
_______________________________________________ Ietf mailing list Ietf at ietf.org https://www.ietf.org/mailman/listinfo/ietf
Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.