[Mageia-sysadm] Http server problem

Sat Nov 20 10:25:18 CET 2010

Hi,
as some of you may have seen, the http server was down this
morning on mageia.org since 00:18 ( CEST ). The server is still hosted on zarb.org,
so I suspect no one could restart it except the admin team.
I did restarted on 09h35.

>From what I have seen, the problem was caused by a http 
process that was not killed on midnight. The restart was caused by 
automated cfengine upgrade, after installing new php ( [3] MDVSA-2010:239 ).

Munin[1] show us a peak in IO on /var, /home and swap that somehow impacted
mysql, postfix and nagios. I suspect the cause was the swap activity due to some
problem in the apache process.

My own analysis is that the apache restart script was unable to cope with the 
fact that one process was not killed. As systemd coder remind us [2], this
is quite hard to do properly on linux with current tools.

I also suspect that php upgrade somehow messed with apache in some wrong
way, causing weird behavior. This is likely unreproductible and undebuggable,
maybe a race condition due to the high activity of the server combined
to the slowness, I do not know. Anyway, I think we are safe until next php
upgrade.

[1] https://www.zarb.org/munin/servers/ryu/index.html
[2] http://0pointer.de/blog/projects/systemd-for-admins-4.html
[3] http://www.mandriva.com/en/security/advisories?name=MDVSA-2010:239

-- 
Michael Scherer