When bad system design leads to pain...

Submitted by gwolf on Thu, 05/17/2007 - 10:22
A long time ago, I wrote the system that still manages the Cuerpo Académico Historia del Presente group in the Universidad Pedagógica Nacional. Yes, I'm happy a good portion of my project, which took me over a year of work... But I must admit a nice deal of shame as well.
Of course, it comes from not properly understanding the domain data and information volume my system would be working with - and coming up with a stupid way to implement searches. I won't get too much in detail because, even if you had access to the full search facility in the system (no, it's not available for the general public), I would not like a swarm of curious people to make last week's events come back... Anyway, the group works by daily filling in tens or hundreds of articles in the system, and having some interesting search sessions every couple of months.
I knew the performance problem was caused by an inefficient searching mechanism (explicitly, category exclusion is the prime killer). I knew loadavg jumped through the roof, memory usage did so as well... But it was not until some weeks ago we installed the mighty Munin on the machines at UPN that we got this jewel - Thanks, Victor, for putting the graphics somewhere they can be shown! ;-)
So... How much does memory usage increase during searches?

Whoa. The system has 640MB real RAM. It has as well 1GB swap. Don't ask me how the hell it reports it was using ~2GB swap - but still... And how is our load average?

Have you ever seen a (single CPU, Pentium 4 1.7GHz) Linux system with a loadavg of 80?! For those who don't know, loadavg gives you the general status on how many jobs are pending scheduling by the CPU. 1 means that all of the CPU's time during a specified timeframe was used (and, on single-core systems, it's the optimal usage level). On this machine, things start getting uncomfortable at 6 or 7. I had never before seen values even half this large.
Sigh... Well, in my defense, I must say I've warned them about this problem for over two years. My contract with them has long passed - I've repeatedly recommended them to hire somebody to fix it. So far, they have not.
( categories: )
Joe Buck's picture

Re: When bad system design leads to pain...

No, 1 is not the optimal usage level on a single-core machine. What you are forgetting is that the number is a time average. If the average is 1, that means sometimes it is above 1 (and a process is waiting for its turn), and sometimes it is below 1 (meaning that the CPU is idle). You get a higher throughput if it's a higher number. If it's ridiculously high, the CPU spends as much time thrashing as doing actual work, but if it's in the range of 2-2.5, you're in decent shape (unless you're optimizing for latency rather than throughput and care about only one job).
garaged's picture

Re: When bad system design leads to pain...

the system was lucky to survive the previous peak ! that's good :) I had the sad experience of seen the main firewall simply panic :(, and my munin graphs doesn't have anything useful, nor the logs :( Hopefully it will crash again in a few months and I wont have to worry a lot about it.