Gribēju palūkot savu LiveJournal frendlentu - nikā. Down... Esot power failure visiem, serveriem.
Our data center (Internap, the same one we've been at for many years)
lost all its power, including redundant backup power, for some unknown
reason. (unknown to us, at least) We're currently dealing with
verifying the correct operation of our 100+ servers. Not fun. We're not
happy about this. Sorry... :-/ More details later.
Update #1, 7:35 pm PST:
we have power again, and we're working to assess the state of the
databases. The worst thing we could do right now is rush the site up in
an unreliable state. We're checking all the hardware and data, making
sure everything's consistent. Where it's not, we'll be restoring from
recent backups and replaying all the changes since that time, to get to
the current point in time, but in good shape. We'll be providing more
technical details later, for those curious, on the power failure (when
we learn more), the database details, and the recovery process. For
now, please be patient. We'll be working all weekend on this if we have
to.
Update #2, 10:11 pm: So far so good. Things are
checking out, but we're being paranoid. A few annoying issues, but
nothing that's not fixable. We're going to be buying a bunch of
rack-mount UPS units on Monday so this doesn't happen again. In the
past we've always trusted Internap's insanely redundant power and UPS
systems, but now that this has happened to us twice, we realize the
first time wasn't a total freak coincidence. C'est la vie.
Update #3: 2:42 am:
We're starting to get tired, but all the hard stuff is done at least.
Unfortunately a couple machines had lying hardware that didn't commit
to disk when asked, so InnoDB's durability wasn't so durable (though no
fault of InnoDB). We restored those machines from a recent backup and
are replaying the binlogs (database changes) from the point of backup
to present. That will take a couple hours to run. We'll also be
replacing that hardware very shortly, or at least seeing if we can
find/fix the reason it misbehaved. The four of us have been at this
almost 12 hours, so we're going to take a bit of a break while the
binlogs replay... Again, our apologies for the downtime. This has
definitely been an experience.