[Mageia-sysadm] Databases down?

Mon May 30 13:53:47 CEST 2011

Le lundi 30 mai 2011 à 11:37 +0200, Romain d'Alverny a écrit :
> On Mon, May 30, 2011 at 10:59, Thierry Vignaud
> <thierry.vignaud at gmail.com> wrote:
> > On 28 May 2011 22:02, Michael Scherer <misc at zarb.org> wrote:
> >>> Long term solutions would be :
> >>> - having a second mirrored ldap server in the other datacenter
> >> irrelevent to the sympa issue we are talking about, and as said in the
> >> blog, and on the others lists, this is already planned since some months
> >>
> >>> - as for svn, maybe having a second realtime mirrored server. the ip
> >>> would be a virtual one
> >>>   and we could switch on issues (hard to do)
> >> Or we can use git :)
> >
> > That enables to still work in good conditions :-), but not to publish
> > new versions and to release new packages :-(
> 
> No, but at least it helps a little while others are working on fixing
> the servers.
> 
> On the same line, would it be a good idea to try to totally decouple
> the release production servers (stable releases RPMs, updates, ISOs,
> services) from development servers (build system, dev
> packages/updates/ISOs mirrors, etc.) so that if one fails, the other
> is not directly impacted?

It depend on how it is done. 
First, as you said, that mean twice the ressource, and twice the work.
It also mean having twice the risk of a problem, if we go on the idea of
having 1 BS in a datacenter, and 1 BS on a second one.
( ie, we lose half of the functionality ).

I also fear that sooner or later, we will start to do stuff on the first
and not on the 2nd one, and thus end by having them not synchronized,
thus leading to various problem later ( like more complex interaction
with both ). 

As I think we should work to make sure packagers work on stable release
too, and to do that, we should make sure that stable is not different
from cauldron from a process point of view.

So a better way would be to think how we could make the system more
redundant. IE, 2 BS instance in 2 DC, but both could be used for stable
and dev. 

This is easy to do for builder ( quite trivial, just add more servers ).

The scheduler would require some state synchronisation, like a firewall,
so this is also doable ( using couchdb as a backend would do it for
us :p ).

The signature part is like the builder, easy to replicate ( except we
have to copy the private key ).

The main problems are the main mirror and svn ( or other vcs, even if
git would likely ease that ). Both are reference points that would be
hard to replicate in write mode. So we would have to do some kind of
continuous replication in both directions, and add some logic to youri
to manage failover.  

For mirrors, we could do something with a custom sync system ( something
that use the NEVR of rpm to sync them ), and for svn, I guess using a
dvcs would make things much easier to do.

Now, we can also say that such problems are exceptional, and provided
that valstar/jonund could have been started automatically with proper
bios setup, we would have faced only 1 hour of downtime which is fully
bearable for a community project ( heck, debian security archive server
got burnt in 2002 and they survived :) ). 

> That requires more money (we can provision that), more work and some
> more duplication and tricks to manage sync/push from dev to prod but
> this would as well help not impacting users while the dev factory
> would be down, and reciprocally.

Since we had already enough trouble to have everything hosted in 1
datacenter, I am not sure we can really provision enough for the long
term :/

And users were not affected at all, the mirrors are here for that.

-- 
Michael Scherer