[Mageia-sysadm] questions about our infrastructure setup & costs

Mon Apr 2 22:12:59 CEST 2012

On Monday, 2 April 2012 16:59:59 Michael Scherer wrote:
> Le lundi 02 avril 2012 à 15:23 +0200, Romain d'Alverny a écrit :
> > Hi,
> > 
> > following past week-end incident, and I know that there are already
> > some reflexions and discussions about that, I'm posting the following
> > questions/needs, with my treasurer/board hat; some of these may
> > already have answers, so please just link me to them.
> > 
> > It comes down to:
> >  - board needs to have an up-to-date view of how much our
> > 
> > infrastructure costs, and would cost in different setups; and this,
> > split in separate, functional chunks;
> 
> That's a rather odd question, since with your treasurer hat, you should
> have all infos, so I do not really see what we can answer to that.
> 
> The list of servers is in puppet :
> http://svnweb.mageia.org/adm/puppet/manifests/nodes/
> 
> and each has some module assigned to it, take this as functional chunk.
> Unfortunately, the servers are all doing more than one tasks, so
> splitting them in functional chunks do not mean much.
> 
> >  - how can we change our setup to: 1) reduce the impact of having one
> > 
> > chunk (here a faulty RJ45 in Marseille) shut down so much of the
> > project for such a long time and
> 
> That's easy to explain.
> 
> You identify each single point of failure, ( or spof ) and you make sure
> to remove the 'single' from SPOF by making it redundant.
> 
> For exemple, have 2 redundant power supply. Have 2 redundant ldap server
> ( we already do it ), have 2 redundant network connection.
> 
> Of course, the downside is that it cost twice the price ( at least ),
> and it is more complex.
> 
> Another solution is to try to increase the MTRR.
> 
> >  2) have a quick report, automatic
> > 
> > about this (not only for sysadmin, but for all users of our
> > infrastructure).
> 
> I do think for me that the current report of xymon are sufficient.

There is some room for enhancements, but that requires a bit more knowledge 
regarding real (physical) dependencies than I have at present.

I can maybe add some examples which don't require that knowledge, so others 
can add some more.

> > So here is how I would put it:
> >  A. could you, as sysadmin, draw (graphically) the dependencies
> > 
> > between services, at a certain functional scale + their current
> > location/host;
> > 
> >    * goal: have an overview of Mageia infrastructure, from the outside
> > 
> > of sysadmin team (and yes, again, that is needed);
> > 
> >    * can we get it produced from the puppet conf? => the goal being
> > 
> > for now to have such a visual overview first, not to have it
> > automated.
> > 
> >    * the function blocks I can think of would be (but add/split/fix
> > 
> > accordingly):
> >      + core for communication & doc:
> >        - user accounts (LDAP, identity.m.o)
> >        - communications (mailing-lists, mail server)
> >        - documentation (Wiki, Bugzilla)
> >        - a specific code repository (not related to the build system)
> > 
> > for adm and/or one dedicated to organization (paperwork, reports,
> > constitution, etc.)
> > 
> >      + Web hosts (www, blog, planet, forums, security notifs, etc.)
> >      + core for building the distribution
> >      
> >        - code repo
> >        - buildsystem
> >        - translation tools
> >        - other?
> >      
> >      + core for distribution software
> >      
> >        - primary mirror
> >      
> >      + other?
> >  
> >  B. based on these functional chunks, for each, could you:
> >   * document what is needed for them: storage, bandwidth, what it
> > 
> > represent in full hardware today, what it should grow to. Goals are:
> >     - to have a clear idea of how much it represents/costs: today, or
> > 
> > if we would move to other hosting solutions (paid or not, hardware or
> > virtual);
> > 
> >     - to know how much we need to budget in security for these services;
> >     - to know what our options (and needs) are for migrating some
> > 
> > services to an architecture or a paid solution that would improve
> > their availability (and accessibility in case of failure).
> 

Note that moving some, or duplication some services may be significantly more 
than twice the current cost, taking into account 'in-network' or 'in-cloud' 
traffic costs vs. transit to another cloud/network.

> so basically, if I take the price from OVH ( as they have a lot of
> choices and are rather cheap ) :
> 
> - alamut would cost around 84 e per month at ovh.fr. That's the closest
> server we can find in their offer.
> 
> - valstar has much more processors, ( 16 core ) and less ram, so let's
> evaluate this at 100e to 110e per month ( processor are more expensive
> than memory )
> 
> - ecosse would be around the same as alamut, but there is less ram so 70
> to 80 euros per month
> 
> - jonund has more processor so let's say too around 100 to 110e per
> month.
> 
> - fiona would like be 30 to 40 euros per month, given the price of
> Kimsufi ( cheaper servers from OVH )
> 
> - I cannot connect to sukuc from my bastion, so I do not know, but since
> that's a brand new server, let's say 80e per month.
> 
> As we cannot rent arm boards, let's assume that we will rent the space
> to host them.
> 
> Housing can be found in Paris for 300e :
> http://www.online.net/serveur-dedie/offre-dedibox-housing-dedirack.xhtml
> 
> since that's too much space for 2 arm board, I found a cheaper
> alternative :
> https://www.ovh.com/fr/housing/location_baie_1_a_3U.xml
> 99e
> 
> That make around 570 to 600 euros per month, for replacing the free
> hosting in LO with paid server, hosting them on one of the cheapest
> providers in the world. And for this price, we have of course no SSD on
> the builder ( there is some offer with small SSD, count 10 euros more
> per month and per server ) etc.
> 
> If we want to just host them in Paris, I think we can have for 600 euros
> per month, just for the housing, since we would use more than 3U ( I do
> not know exactly how much ).
> 
> People can feel free to redo the cost analysis on amazon EC2 or
> rackspace, I was not able to understand how much would alamut cost at
> rackspace ( not even if that's even possible to have a server where we
> are in charge ), and amazon ec2 pricing is to hosting what java is to my
> abacus.
> 
> And for being complete, I also searched random hosters around the
> world :
> 
> I found this
> http://www.razorservers.com/solutions/dedicated-servers/pricing/
> so a server with the same spec as alamut is around 200$ for a more
> classic provider.
> 
> I found this
> http://www.server4you.com/root-server/server-details.php?products=3
> would make 85$ ( since there is setup fee for each month ). Server4you
> is more like OVh.
> 
> and several others where the price is more around 150$ than 100$.
> 
> And of course, most of them have metered network connections that would
> maybe not be suitable for something like valstar, who act as a primary
> mirror. For reference, since we have started the server :
> 
> RX bytes:453228974131 (422.1 GiB)
> TX bytes:9311461347504 (8.4 TiB)
> 
> Uptime is 60 days.
> That's around 4 T per month of transfert.

How much of this is internal to the hosting provider?

> 
> That's for alamut, to compare :
> RX bytes:30792994686 (28.6 GiB)
> TX bytes:215624995862 (200.8 GiB)
> 
> While hosters often propose "unlimited transfer", most don't, and most
> use unlimited in the same way that phone providers do. So we need to be
> wary on this point if we want to go further in the cost analysis.
> 
> >  C. various questions:
> >   * could both above documentation (A and B) be maintained through
> >   changes;
> 
> That depend on how they will be done, but I do not foresee someone
> volunteering for that, and since puppet informations are not sufficient
> to express that in a automated manner ( there is support for graphing
> deps between modules but not inter servers ), I doubt to see it being
> written soon.
> 
> Nagios do support doing some form of graphs, but we already have a
> working monitoring system, and there is some more important stuff to do
> before changing it ( for example, making sure that the current one is
> read by people by reducing the amount of crap sent on the ml, and this
> would requires someone fixing #4591, among others )

"depends" notation can be used to describe the dependencies between services 
(including some logical tests). I have some scripts that draw diagrams with 
near-real-time status (mostly network ones, e.g. links between manageable 
switches, sometimes termed 'weather maps'), which could be extended to do some 
automated diagrams based on the depends notation. I would actually like to 
have this at work too, but at present I really only have (work) time to work 
on 'official' projects that have project managers and budgets :-(.

> 
> >   * would it be possible to have the systems hosting our services to
> > 
> > have a prefix in their fqdn with the city/country they are located in?
> > Goal: being more explicit about where a service is located at this
> > time, so that a $ host www.mageia.org can answer me something like
> > champagne.paris.fr.mageia.org - for instance. I don't mean to change
> > all that, but I'm wondering about the opportunity.
> 
> What problem would it solve ?

Vs. what problems would it create. Note that many mail servers or anti-spam 
systems score servers negatively for mismatch in forward/reverse records, and 
email RFCs forbid pointing an MX at a CNAME.

DNS isn't supposed to be a ITIL-compliant CMDB ...

Network engineers (mistakenly in some cases, IMHO) put reverse DNS on router 
interfaces to be able to easily understand traceroutes. My opinion is that 
network segments (typically following VLANs) should be named, and network 
segment should be used instead of interface name for non-point-to-point links 
on routers.

> The grouping of servers is already visible on xymon.mageia.org :
> http://xymon.mageia.org/xymon/servers/servers.html
> 
> I pondered on adding support this in puppet for that, but in the end, I
> didn't found any good reason to do that for now ( would help if we have
> enough server, to setup ntp based on d-c, bastion server acl, etc, but
> we are not there yet ).
> 
> >   * what do you think about maintaining a separate blog (for
> > 
> > opening/closing tickets + a global summary of what xymon provides
> > already) under status.mageia.org (or maybe a different domain, for
> > that matter)? (something similar to status.twitter.com)
> 
> Again, that solve none of our problems at all.
> 
> That solve a problem for a startup when they want to say "we care about
> our customer, we give access to some form of monitoring", but we do
> already give full access to our monitoring, so that would be redundant.
> 
> Now, maybe the current access is not nice enough, and I am sure we can
> do some css work to enhance that, but as a aesthetic issue, I would not
> make this a priority.
> 
> And I have seen no one saying that the current blog is not enough. If
> people do not read it, they will not read another web site.

Can we decide on what problem(s) we are trying to resolve?

A)That we should be able to know when our network connection is down?

IMHO, the hosting provider should be monitoring this (our data center business 
does this for all managed hosting customers). If the hosting provider is not 
able to do that, then our options are:
1)Monitor servers in one location from servers in another, and ensure that 
they can inform us without requiring the servers in the first location to be 
available
2)Monitor the network interfaces etc. from inside the network, but have a non-
network notification system, such as SMS modem, or old cell-phone (Nokia 5110 
was the usual choice a few years ago). Alternatively, a 3G dongle for IP-based 
access could be considered.

(note that so far, there is minimal cost involved) 

B)Ensure that a single network connection can fail without a whole site 
failing.

Cost: 2 manageable layer 3 switches with VRRP or HSRP (Cisco), an additional 
port from the provider, and their work to implement their side. Cisco 3560 is 
probably the best entry-level switch for this (but I haven't consulted a CCNP 
or CCIE yet ...).

C)That we should be able to continue development of the distribution when a 
site has failed?
The cost to implement this correctly/reliably is usually only justifiable for 
a real-time commerce system (Bank, very high volume ecommerce site competing 
with Amazon in a specific region, real-time billing system for a mobile phone 
operator)

D)That once we are aware of a failure, we are able to inform users of the 
sytems of an outage.

Beware that over-documenting things (we are doing ISO 20 000 and Business 
Continuity efforts at work) do not necessarily result in a better system, they 
just make sure that you have killed lots of trees making sure that someone 
will be able to read (but not necessarily understand) what need to be done to 
continue business in a disaster. The puppet-based config we have is superior 
to what many companies have ... what we may rather need to do (if we can find 
the time) is to hold some business continuity thought exercises.

Regards,
Buchan