[Mageia-sysadm] questions about our infrastructure setup & costs
boklm at mars-attacks.org
Mon Apr 2 21:41:56 CEST 2012
On Mon, 02 Apr 2012, Romain d'Alverny wrote:
> On Mon, Apr 2, 2012 at 17:49, nicolas vigier <boklm at mars-attacks.org> wrote:
> > Using paid hosting will not remove problems like bad RJ45 or switch
> > that stop working. If we want good availability, we need more servers
> > in different places.
> In paid hosting, (physical) server and link failure is to be directly
> handled by people that have a financial incentive to have it work. I
> expect (but may be wrong) that the availability will be higher than
> what we have today, and that it is still affordable for _some_
> services. It's not about going full speed to paid services or to spend
> unnecessarily money, it's about using what we can (it includes money)
> to improve our systems availability.
It's doesn't matter that it's paid hosting or not, if a switch stop
working on friday evening, and if there's nobody available to go to the
datacenter to replace it, then everything will be offline until one of
us has time to go to the datacenter. We can pay very expensive datacenter
hosting, but they won't replace our switch if it stops working.
We can also pay expensive hosting in datacenter and have power outage,
network problems because of a flood or other reason, air-conditioning
problems, etc ...
And we can also pay expensive hosting at EC2 and have 2 days downtime :
In 1 year we had only one major unexpected downtime on our servers,
because of a bad network cable, on friday evening, and hopefully this
kind of problem does not happen very often. Before this we had more
downtimes on the servers hosted at gandi, because of problems on their
storage servers for all their customers.
> The point is that: I don't know and I don't have the data to get an
> idea about that; and I'm not even sure the data needed is compiled
> somewhere at this time. And I suspect I'm not alone in this case. If I
> don't ask, someone else will later. Or even worse than that, won't
> dare to ask.
> That's why I'm asking for this for those two purposes: explaining more
> how it works, understanding how it could work.
> - functional split list => your skills/job
> - needs per functional unit => same
> - dependencies between units => same 
> - cost per unit in different contexts => can be spread around
We don't have a lot of servers, so no need for complex dependency graph
to see that all of the servers are critical, and downtime of any of the
server will cause problems somewhere. If we want to reduce the risk of
having a lot of services down at the same time, then we need more
servers, hosted in different places.
> And yes, it may be too expensive. Or it may not. But I suspect we
> don't know, or it's not obvious enough. On the other hand, having one,
> or several server downtime like this for 2/3 days also costs a lot to
> the project (loss of time, and reputation shift).
If we can't afford a 2 days downtime, then we should probably stop
everything now and do something else.
Projects with more money and more machines than us also have unexpected
Fedora had almost 1 day of downtime on their buildsystem in december :
And if we read their mailing list archives we can see 2 hours on many
services in january 2012, 1 hour for build system in febuary 2012, 2 hours
in january 2012, etc ...
In april 2010 Debian had their buildd.debian.org server down on friday
and restored on monday, wiki.debian.org for one day, forums.debian.org
for a few days :
wiki down for an unknow time in january 2010 :
ftp-master in january 2011:
And I think it's the same for most projects.
More information about the Mageia-sysadm