The job board blues

The last time I talked to my good friend Rick Myers over at TalentZoo.com he mentioned that it would be fun to hear more about the obstacles we have faced (without giving the impression that we are losers, lol). So here we go.

"Enterprise" class my ***!
Outsource every #!@*&% thing you can...

In some previous posts I have mentioned that we have been plagued by some technology problems this year.  One of the things 'driving' us crazy have been Seagate hard drive failures - but not their "enterprise" class drives which seem to have higher initial failure rates according to user reviews.

Fortunately, these drive failures have not affected LatPro.com, just our drupal based websites like this blog. So I thought you might enjoy reading this conversation I had this morning on Yahoo messenger with our CTO:

eric (6:16:19 PM): Still working on the server problem?
paul (7:48:45 AM): Yes - I'm working on it. Remember the Seagate drives we used for these two servers? Well .... surprise, surprise ... results are the same.
eric (7:18:49 AM): can you give me more details about those servers?
eric (7:18:53 AM): I'm really really steamed
eric (7:19:11 AM): seems that every time we are about to make progress we have more setbacks
paul (7:19:11 AM): I'm running tests and collecting data now.
eric (7:19:45 AM): not just the money we pay you and Danny to spend weeks fighting this stuff, but the opportunity cost of not developing our projects...
eric (7:19:52 AM): it is very painful
eric (7:20:40 AM): is there anything Seagate could do for us?
eric (7:21:05 AM): How much do those hard drives cost?
paul (7:21:15 AM): I'm sure that they'll be happy to replace the failed drives. But that doesn't make me happy.
paul (7:21:48 AM): The drives are getting cheaper and cheaper .... these were something on the order of $85 a piece.
eric (7:21:58 AM): Oh my god
eric (7:22:34 AM): What you think Amazon.com uses? Do you think they only spend $85 a drive?
paul (7:23:15 AM): E - they have a totally different sort of setup, issues and resources.
paul (7:23:24 AM): stand by for a moment.
paul (7:24:21 AM): Qty Product Description Price Shipped from NJ - Tracking #: 1Z039AF20302651405 21 Seagate Barracuda 7200.10 ST3320620AS (Perpendicular Recording Technology) 320GB 7200 RPM SATA 3.0Gb/s Hard Drive - OEM Item #: N82E16822148140 Limited 30-Day Return Policy This item is serviced by the SeaGate. Please call 800 -732-4283 for service. $1,679.79
eric (7:25:51 AM): So maybe we want a refund? Do you have in mind some higher-quality drive to replace them with?
paul (7:25:58 AM): Cost no object? Try these.
eric (7:26:37 AM): How many do we need?
paul (7:26:45 AM): See, E --- here's the problem - check the specs on the Seagate's for MTBF --- they simply should NOT be failing in the numbers that they are.
paul (7:27:07 AM): 20 pieces plus spares, almost $11,000. That's why I do not use those drives.
eric (7:27:17 AM): I see
paul (7:27:42 AM): Let me find you the MTBF rating on the Seagates ...
eric (7:29:36 AM): Did you see the review on the Fujitsu's: "Cons: We placed 4 of these in a new server hosting VMWare ESX, configured as RAID 10. Within 60 days, one of the drives failed. Hopefully, we won't have problems on the warranty replacement - but for this type of drive, I expected more reliability. Other Thoughts: Always configure with a hot-spare so you have time to get a replacement."
paul (7:29:59 AM): Actually, on these I think the decision was made to stay with the non-enterprise version because the user reviews on the enterprise version were so much WORSE for initial failures.
eric (7:30:38 AM): Do you mean the drives that failed were not enterprise?
paul (7:31:06 AM): 700,000 HOURS MTBF, Annualized failure rate of 0.34% according to their data sheet.
eric (7:31:39 AM): Could there be something about our installation causing faster wear and tear?
paul (7:32:18 AM): Not from 700,000 hours to 6700. The temps are just fine (they are monitored and recorded at the drive).
eric (7:32:42 AM): So we probably got a bad batch
paul (7:33:56 AM): I'm checking to see if there were any initial dead ones in this build.
paul (7:35:27 AM): I replaced one at the end of January.
paul (7:36:43 AM): From the SMART data I've got, it looks like we'll want to replace about three more. One totally dead (making the server hang), and two that don't look so hot.
paul (7:37:54 AM): I think that equals an annualized failure rate of about 30%
paul (7:38:54 AM): When I had them power-cycle the system, the tech told me we had an amber light on one of the power supplies, so I'm going to get a replacemetn for that too.
eric (7:39:06 AM): What you think we should do? Try to get Seagate to replace them all with a more expensive Seagate product at no cost?
paul (7:39:35 AM): Hell no! Do you have any idea of how disruptive that little job would be????
paul (7:41:36 AM): Ideally, we should have gotten notified of the drive failure, and the system *should* have kept running - but the O.S. hung up in the disk driver, and it all went south.
eric (7:42:45 AM): Is drive failure something that can be tested? So that this can be debugged before it happens?
paul (7:43:04 AM): If I had used a hardware RAID controller instead of software RAID, it might not have done that. Opinions vary - you'll find just as many people saying that the hardware controllers cause more trouble than they solve.
paul (7:43:43 AM): Yes and no. There is more monitoring that we can (and will shortly) be doing.
eric (7:43:58 AM): Well, this is something we need a solution to.
paul (7:44:01 AM): But when stuff doesn't do what it's supposed to do ...
eric (7:44:13 AM): That's why I asked about testing.
paul (7:46:00 AM): I'm manually running the tests now. We'll have them on a scheduled automatic system shortly. I do not know if it would have given us enough info for us to know that it was time to proactively replace the drive, or not.
eric (7:47:15 AM): That sounds like a good idea. What do you want to replace those three drives with?
eric (7:47:46 AM): If we're only replacing one or a couple at a time, maybe we should go with something much more expensive like the Fujitsu...
paul (7:47:52 AM): As crazy as it sounds, the same model ...
paul (7:49:45 AM): Read the third review. The problem in a nutshell ...
paul (7:49:47 AM): http://www.newegg.com/Product/Product.aspx?Item=N82E16822148309
eric (7:51:10 AM): So we are locked in?
paul (7:52:29 AM): Not really, but having them match is not a bad thing. Switching to that SAS drive might not work ... it's a radically different animal. I just want to illustrate what I would have used if cost was not an object.
paul (7:52:47 AM): The drives are part of arrays.
eric (7:53:41 AM): So you don't know what to do yet? Is there somebody you could talk to about this? How about John...
paul (8:07:30 AM): My plan is to replace the drives with new ones. I could also add a hardware RAID controller - I do have one here that I could use.
paul (8:08:01 AM): Although I need to check for compatibility.
eric (8:10:25 AM): you don't think talking to Jon could be helpful?
eric (8:11:35 AM): Do we want to use something just because you have it around? I just want to make sure you are not setting us up for some other unforeseen crisis by not thinking through all the possibilities...
paul (8:15:33 AM): Jon doesn't know very much about hardware. He usually calls me.
eric (8:17:49 AM): who does know?
paul (8:17:54 AM): The most important attribute is whether the controller is properly supported by the Linux kernel --- and how long it has been (= driver maturity). I know the 3ware controller I have works, works with Supermicro motherboards and works with Linux ---- because we're already using them, and I have used them here too.
paul (8:18:31 AM): Google is one resource ... Newegg's reviews another.
paul (8:24:29 AM): Truly, there doesn't seem to be a solution, other than having lots of spares.
paul (8:25:54 AM): I could very easily have vh6 back up and running right now. The problem with that is that RAID minus one drive cannot tolerate a second failure.
paul (8:26:38 AM): And with the other 3 drives that look shakey, it just wouldn't be prudent.
paul (8:28:00 AM): But by manually removing (in a virtual sense) the bad drive from the array we'd be up and running without data loss.
paul (8:29:31 AM): Now you might want to bitch at Dani a bit over the fact that his MySQL setup was only doing replication and load-balancing, but did not have any provisions for fail-over when his master DB went down.
eric (8:30:25 AM): Okay, I will get out of your hair. Just consider that without reliable hardware, this company will not reach the next level. If we keep having failures at this pace.... so please if you need to take time to just think or meditate or talk to someone about it, please do whatever it takes
paul (8:33:03 AM): All ten drives in vh7 look ok, based on recorded hard error counts. That's the one that had the failure on one drive back in January.
paul (8:33:35 AM): Trying to complete self-test on them, but with the server busy, it seems to take forever.

So did anyone read this far?  Are you feeling better now about your own issues? Have some technical advice you can share?



Eric -- I trust that the photo wasn't a peak into the future of your hairline, but if issues like this continue then that future might not be too far away. In all seriousness, I feel your pain. We've definitely had serious technology problems as has any web site that generates the kind of traffic that LatPro does.

** es **  thank you Steven! I am a big fan of the Dalai Lama and usually stay on a pretty even keel myself.  Sometimes when my temperature rises I try to meditate away the anger and thinking about the 'angry meditator' gave me a laugh. The Dalai Lama has a great sense of humor, so I'm sure he won't mind, lol. It's true this stuff doesn't make us any younger...

July 24, 2008 - 2:38pm

Post new comment

The content of this field is kept private and will not be shown publicly.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
login