CS outages on the increase?

Published by on 03 March 2008 at 15:29 in Uncategorized. 13 Comments Tags: downtime, tech.

Is it just me, or has the number and duration of CS outages increased significantly over the last six weeks or so?

I log into CS several times a day, or at least I try to do so to keep up to speed with my surf requests, but lately the system seems to be failing even more than usual. Was wondering; is disclosing uptime and downtime something the CS ‘management’ has to decide on, or is there another service/website/individual/organisation keeping track of that?

13 Responses to “CS outages on the increase?”

Diederik
4 years, 2 months ago.

Erm, for what I know, we didn’t have any “major” outages lately. I was going to setup a nagios setup for outside, and will post the uptime from my own machine.
bentivogli
4 years, 2 months ago.

thanks diederik. Transparancy above all!
tgoorden
4 years, 2 months ago.

“Downtime” doesn’t really count here because, in a certain way, the server(s) *do* keep on running (so web statistics wouldn’t say anything) even when there is a “site down”. However, I have the impression they created a new “feature” where the “site down” message shows up once in a while, probably to decrease the overall load.

In general, this has been a common approach in CS: when the servers are overloaded, they start cutting features and now entire pageviews arbitrarily from the site. None of this communicated, except perhaps in the amb groups. Also, it is all very ad-hoc, since they never really know what’s causing the problems.

In general, this has more troublesome underlying causes (Doogy, correct me if I’m not up-to-date):
- The main bottleneck is and has always been the database, which Casey handles pretty much singlehandedly (without any formal training in the field) because of alleged security concerns. We had a MySQL employee (!) on board for a while (Morgan), but everyone probably knows what happened there. To put is simply: the database is not up-to-date (not enough experience to be able to migrate), the fail-over was absent and if they have it now, it doesn’t seem to work very well. If that doesn’t scare you enough, remember that Casey probably caused the site crash before Montreal by dropping the production database accidentally.
- CS uses a custom made and extremely crude caching mechanism. Overall, Casey hardly ever uses (proven) open source code, preferring his own “solutions”. This code doesn’t scale properly for a lot of features.
- There is no forward thinking in IT planning. Most of the upgrades/performance boosts are done as a last-minute measure, usually when the site is already deeply in trouble. Remember, they added new machines recently, but it already seems insufficient and the summer is still to come… There’s no resource planning (replaced by all-nighters and collectives), no deadlines (replaced by “we’ll announce it when it sort of works”), no priority list (replaced by whatever Casey feels like working on and “the mission”) and even bug tracking has gone into the crapper (replaced by CUQ).
- For obvious reasons, volunteers only stick around the IT team for a short while, which causes: constant code rewrites (“I don’t understand this, so I’ll replace it.”), dead code all over the place and no documentation or testing to speak of. Again, most of this is also due to a complete lack of specific IT management skills and techniques. OK, except nonviolent communication
Diederik
4 years, 2 months ago.

@ Thomas:

What I was planning to do (and will do ) was to have Nagios do a check on the size of the page. With that, you can see what should be there, but obviously is not there

Some other small things:

===

“Downtime” doesn’t really count here because, in a certain way, the server(s) *do* keep on running (so web statistics wouldn’t say anything) even when there is a “site down”.

From my point: site down is site down. E.g.: the services I want to have (spamming, checking other’s references, moaning in Brainstorm and such) are not available within reasonable time.

So no SLA-shitty things that you would find in contracts.

…probably knows what happened there. To put is simply: the database is not up-to-date (not enough experience to be able to migrate), the fail-over was absent and if they have it now, it doesn’t seem to

Currently, the database *IS* up to date. That is: the latest package from CentOS is installed. I’m not an DBA, so the tables, data in the tables etc are having a unknown status (but I can guess), but the version of the DB is up to date. To be more honest: every webserver is currently up-to-date.

There is no forward thinking in IT planning. Most of the upgrades/performance boosts are done as a last-minute measure, usually when the site is already deeply in trouble. Remember, they added new machines recently, but it already seems insufficient and the summer is still to come…

Also here, I do disagree with you. Previously I had my problems with couchsurfing (and still do have those in some extend), but with Weston I feel that we have *MORE* forward-thinking. Not that we have reached a state that should be reached, but I do feel that we are getting somewhere.

For obvious reasons, volunteers only stick around the IT team for a short while, which causes: constant code rewrites (”I don’t understand this, so I’ll replace it.”), dead code all over the place and no documentation or testing to speak of. Again, most of this is also due to a complete lack of specific IT management skills and techniques. OK, except nonviolent communication

No comment
Diederik
4 years, 2 months ago.

(OT: grrrrrrrrrr, can someone fix my post with the quotes? Thanks!)
tgoorden
4 years, 2 months ago.

@Diederik
I fixed your tags (you can use regular HTML).

I agree that downtime should really be measured by functionality, which is why something like Alexa is probably wildly inaccurate in this case. The ad-hoc cutting of features makes it even harder to measure.

What version MySQL is CS running? If it’s the “last stable version” of 4.1, then I wouldn’t say it’s up-to-date. Updating CentOS is *not* going to bump a 4.1 DB to 5.0. (However, it might have been done anyway, do you know?)

Weston does seem to be a very valuable asset to CS. However, he is not the CTO and I haven’t seen him communicate in any form of coordinator function. If he’s a lone coder, that is probably worse in the long term. Also, the site doesn’t seem to be *that* much more stable with him on board…
Diederik
4 years, 2 months ago.

Currently, we run version 5. More to come up
James
4 years, 2 months ago.

Just to be clear since I think I’ve heard a reference to the ambs groups more than once here- The ambassadors do not get adequate communication on the issues. There are often threads full of posts of people asking for information about what is going on at the site. Then a distorted account of the situation comes in through Donna or someone else with direct communcation from the LT and Admins. There is no direct comunication with the ambs although we occasionally do get a thrown a few breadcrumbs such as the announcement that CS was applying for 501c3 status. Sometimes ambs get advance information but on the whole we are not told what is going on. I hope this isn’t too off-topic, I just saw an early post in this thread mentioning the ambs groups.
Diederik
4 years, 2 months ago.

James,

I guess that the lack of communication has to do with 2 things (please do mind this is *MY* view):

* Ignorance of users. Unfortunately, this is person-bound, and cannot be changed easy. People need to know why it is important to say what you are working on, why the server was down (db upgrade? rm -rf whatever?) etc;
* People that are willing to communicate do want to know what is going on, but lack information.
Callum
4 years, 2 months ago.

Purely anecdotally, I’ve started to notice that the site is down more frequently recently.

Having recently been learning about building scalable web sites, it saddens me how easily CS’s scale problems could be improved. So much could be done to improve site performance. Alas, as Thomas outlined, it’s unlikely we’ll see any significant changes any time soon.
Kasper Souren
4 years, 2 months ago.

@Diederik: Ignorance of users is not an issue. Users are ignorant by default But, when the communication of the tech team was much more open the less ignorant users started functioning as a buffer, communicating information from geeks to ignorant users. I think the ignorance is on the side of the people who closed down all means of communication.
Diederik
4 years, 2 months ago.

Kasper,

I think you are right. What I ment was that users are being ignored.

@Callum: too sad that you think this. Although the development-part is not my thing, the sysadmin-part is doing real well…
zak0r
4 years, 1 month ago.

@James
your information is appreciated and on spot. keep in mind that most of the “dissenting pirates” are actually former ambs and/or developers and know the flipside. indeed as you say, the ambs usually know little more than the regular user.
the difference is usually that ambs are the cheerful pro active people who do free marketing for cs, whithout being properly appreciated, valued and integrated by the “core” of cs, aka the burningman bunch

Comments are currently closed.

CS outages on the increase?

13 Responses to “CS outages on the increase?”

Recent Blog Posts

Recent Comments

Categories

Monthly Archives

Links