[Mageia-sysadm] questions about our infrastructure setup & costs

Mon Apr 2 18:29:59 CEST 2012

On Mon, Apr 2, 2012 at 17:49, nicolas vigier <boklm at mars-attacks.org> wrote:
> Using paid hosting will not remove problems like bad RJ45 or switch
> that stop working. If we want good availability, we need more servers
> in different places.

In paid hosting, (physical) server and link failure is to be directly
handled by people that have a financial incentive to have it work. I
expect (but may be wrong) that the availability will be higher than
what we have today, and that it is still affordable for _some_
services. It's not about going full speed to paid services or to spend
unnecessarily money, it's about using what we can (it includes money)
to improve our systems availability.

The point is that: I don't know and I don't have the data to get an
idea about that; and I'm not even sure the data needed is compiled
somewhere at this time. And I suspect I'm not alone in this case. If I
don't ask, someone else will later. Or even worse than that, won't
dare to ask.

That's why I'm asking for this for those two purposes: explaining more
how it works, understanding how it could work.
 - functional split list =>  your skills/job
 - needs per functional unit => same
 - dependencies between units => same [1]
 - cost per unit in different contexts => can be spread around

And yes, it may be too expensive. Or it may not. But I suspect we
don't know, or it's not obvious enough. On the other hand, having one,
or several server downtime like this for 2/3 days also costs a lot to
the project (loss of time, and reputation shift).

> I think it could help to have a few other servers in a different
> datacenter, and duplicate services when it is easy to do (like ldap,
> mirrors list, etc ...). In other cases like build system scheduler /
> main mirror however, it's difficult to duplicate without conflicts
> between the different servers, and cannot be done in a few minutes (or
> hours), so I think we should migrate only in case of very long outage
> (when we know that it will take more than a few days). Having a few
> more servers available would help if we need to migrate some things in
> case of big problem on one of the server.

Yes.

>>   * would it be possible to have the systems hosting our services to
>> have a prefix in their fqdn with the city/country they are located in? [...]
>>   * what do you think about maintaining a separate blog [...]
> We are already adding posts on official blog, and mails on mailling
> list in case of problem, so what would be the use of an other separate
> blog for this ?

Let's forget about these two.

[1] take the attached .dot file; it's rough, very incomplete (missing
subsystems, roles), probably incorrect, but it gives an idea of what
are the nodes that need the maximum level of availability, and the
ones that should not bother from each other, and were some decoupling
could be made. You could even add a layer of who is using/responsible
for what, or of where the system are. This is far more telling than a
puppet configuration or a text file.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mga_systems.dot
Type: application/octet-stream
Size: 467 bytes
Desc: not available
URL: </pipermail/mageia-sysadm/attachments/20120402/ebe1a596/attachment.obj>