Sunday, November 28, 2010

to RAID or not RAID.... what was the question again?

Hi everyone,
In my basement setup I've got a nice virtualized setup. A single VMWare ESXi4 server boots form a 4gb pen drive and then connects via iSCSI.
Supporting the connections is a SMC Gb switch.
Supporting data are 2 QNAP NAS. A 6 drives 639 and a 4 drives 459 pro.
Each one is fully populated with 2Tb Samsung Hardrives and WesternDigital.
The 639 is configured with a raid5 of 4x 2Tb 7200prm 32mb cache Samsung HD and 2 separate 1Tb drives
The 459 is configured with a raid5 of 4x 2Tb 7200prm 32mb cache Samsung HD 
On top of this the 639 has:
 - 1 320Gb USB
 - 3 1.5Tb USB
 - 1 750Gb USB
 - 1 1,5Tb eSATA
 - 1 2Tb eSATA
The 459 has:
 - 4 1Tb USB

 - 1 1,5Tb eSATA
 - 1 2Tb eSATA

The event:
Power outage over the 10minute UPS threshold, then another one while the NAS was booting up again. Portuguese power company is probably the worse in existence. I pay 200€/month for my house power supply (yet 200€/month) and I still get frequent power outages and every time I file a complaint... it's someone else's fault but theirs).

The consequences:
The 459 NAS hold out fine as is consumes less power so is never shutdown between outages.
The 639 came up after 2 abrupt shutdowns with a 1 disk failure. The hard drive was fine but it just didn't rebuild. I then changed the hard drive and the NAS declared it as a spare and switched the mode to degraded and made everything read-only.

The solution:
After having the NAS in degraded-readonly mode, all I could do was to backup my data and rebuild the RAID from scratch.

The problem:
1 - 4,7Tb of data are not an easy backup.
2 - most of it was in a iSCSI virtual volume.
3 - The data inside the virtual volume is formated in a VMFS partition.

So how do you solve this:
1 - Go buy 2x 2Tb USB HD
2 - Backup everything into the existing USB HD
3 - Clear the RAID, test the drives independently, replace if the drive actually has problems (some QNAP reports aren't fully accurate on drive problems).
4 - Re-create the RAID
5 - Move the iSCSI VMFS drive contents to the now clean RAID.
6 - The most important part: setup full constant replica's between the two devices.

Not quite. The first EMail was an obvious... backup your data and do-it all over again. After a twisted reply from my side ("is this what you think of how a RAID works? a drive fails the new one doesn't rebuild and it's a backup and do-it all over again time? should I start looking for an alternative solution and return my QNAPS?") Then the reply was different. They asked for a link so they could remotely access and analyse the problem. It happened however 1 week after the event... and that's everything but a good support. The conclusion was that I had to backup the NAS and re-do...jezz thanks QNAP, that's really pro!

My Setup Today:
NAS01                                                                      NAS02
  RAID5 (SATA2,3,4,5)                                             RAID5(SATA1,2,3,4)
      VMachinesDepot1 <-------------------------------------> ReplicaOfVmachinesDepot1
      ReplicaOfVmachinesDepot2 <------------------------> VMachinesDepot2
      LocalWorkShare <---------------------------------------> ReplicaOfLocalWorkShare
      ReplicaOfVPNWork <----------------------------------> VPNWork
      FamilyMedia <--------------------------------------------> ReplicaOfFamilyMedia
  eSATA1                                                                     eSATA1
      TVSeries                                                                    Movies
  USB1                                                                         USB1
      BackupsFromClients                                                  BackupsFromServers
  USB2                                                                         USB2
      KidsMovies                                                                MusicAndVideos
  eSATA2                                                                     eSATA2
      Downloads                                                                 Temp
  USB3                                                                         USB3
      Software                                                                      Library
  SATADisk1                                                              USB4
      CompanyWork <------------------------------------------->ReplicaCompanyWork

So in conclusion:
 Use the RAID5 for Safety...and don't trust it, so replicate to the other RAID.
 So in short... today Hardware RAID is actually a Software RAID with a specific hardware appliance and controlled software packaging and configuration. QNAP (and the other players like it) can sell systems that look professional, but in truth they are not different from a standard PC with a SATA card, running linux and MDADM, with a simple interface and a fancy hotplug drawer system.
 Support is as "pro" as the standard product is.
Not a bad product, especially price-wise, and that's the line you should think of. It's a great solution, not because it costs 1/4 or less, but rather because that 1/4 price tag allows you to buy twice as much as you intended and have full redundancy. So in truth it's NOT as cheap as it claims to be, but taking some simple precautions it can be 1/2 of what a pro system would cost.

Saturday, October 30, 2010

Nagios alarm reporting via SMS

Hi all.
One of the recent projects was the configuration of a NAGIOS alarm system.
Each and everyone I.T.Manager (like any normal human) love to have good nights of relaxed sleep.
There is only one thing worse that a half cut night because a server decided to crash someone wakes you up in agony to "go-fix-it-pleeeeeaase"; and that is walking into your company in a relaxed monday morning, and enter the "chaos zone" of having every user finding out the servers have crashed.

This of course is something I've never experienced in my life. There are several reasons for that:
 1 - I use Linux and Unix... so this alone get's the blue-screens, exploits, memory leakages and random reboots out of the equation.
 2 - I virtualize everything, from servers to vital workstations, to storage and networking. And this is not just eliminating the real layer of the thing... it's also configuration resources watchdog routines and procedures to motion the resources on demand.
 3 - I create multi routes and multi paths.
 4 - I've always used SNMP and solutions like HP OpenView and the beautiful NAGIOS

Now OpenView is common between most I.T.Managers, but at a cost... a HUGE cost.
Most people are so microsoftized that see nothing else other than that poor SMS server... and abandon towards OpenView or IBM'S Tivoli, paying the price.
Don't get me wrong... the huge cost of HP's OpenView on IBM's Tivoli solution is not expensive. In time, it's all returnable... but still I'd rather have that money spent in more vital areas.

This is were the NAGIOS comes in. Nagios e opensource and very flexible. It's so flexible that people usually say it's a bitch to configure.
This statement is not entirely truth. There is a trick to being able to configure Nagios fast and without slashing your wrists - Make a Lab using an already installed and configured appliance... and then use that knowledge to configure one from scratch.

But back to the original title:
In the company I'm working today, the I.T.Manager is using Nagios to monitor the servers and it's services. The problem is finding out what went wrong when he is not looking at the console and it's warnings.
So he installed an old NOKIA phone and plugged it VIA USB.
We installed gsmsendsms and configured as 2 Nagios commands (user and service commands).
There is no black magic to this, and if you Google, you'll discover hundreds of articles explaining how to do this. This one HERE is a good example.

The problem most posts don't explain is that nagios will execute gsmsendsms as 'nagios' user. And gsmsendsms need to send data directly to the usb device port, needing sudo rights to do so.
It that's the case, a look at nagios log will make it ease to understand that Nagios is not being able to open the device to send data to.

The solution? Simple.
1st configure Nagios command files and make sure you type "sudo " prior the gsmsendsms command.
sudo nano /etc/sudoers
Then add the following line:
nagios ALL=(root)NOPASSWD: /etc/bin/gsmsendsms
Write out and exit.

That's it! Nagios will start sending SMS when the configures alarms fired. All you have to do now is configure those alarms and recharge that mobile phone's sim card with some money every once in a while.

Thursday, October 28, 2010

Project Doesn't do Montecarlo! Is that so?

I've been teaching Project Management for over 10 years now. I've always tried to improve and  give my students more than they would get in a normal course.
Evidently I had to push Microsoft Project to new bounderies.
A lot of people asked-me "why don't you put project aside and use something better?". The answer is simple. Most Portuguese company's use Microsoft... most of them are starting to aply project management. Not only they don't need to push project to it's absolute limits yet, they also got it already licensed into their select and tech-net packages... so it's the most commonly used software.
And you know what? Price for features... the Project is actually an excellent product and very hard to beat.
To when it comes to Risk management... how do I take the theory of MacroRisk Management, MicroRisk Management and MonteCarlo Analisys into practice?

The same problem popped out years ago when I was the Risk Management Speaker at a International conference on project Management.
Back then I crawled the web and discovered a MonteCarlo (very simple routine) implementation using VBA and project from Jack Dahlgren called the Black Jack. I copied the code into project's VBA and immediately stared to make bug corrections, and improvements (several improvements).

As a result I now have a MonteCarlo Simulator working on project and use it in every RiskManagement module. It's a blast. Students love it and understand the brilliance of project risk management. It makes-me very proud because they can now implement these teaching into real life... maybe in a few years time we'll start seeing True Project Management Techniques in practice at Portuguese companies.

There is also something interesting. I'm porting this code to .NET into a closed source solution (a lot beefier and with better features). If I had used another tool, most of my work would not be portable with ease, making the development effort pointless.

The Demo Videos:
Macro Risk Management using Montecarlo
Micro Risk Management using Montecarlo

The Windows7 MultiTouch

About 2 years ago , I was asked to build a prototype that could use Multitouch and Gestures aplied to a 3D renderization model showing an object.

The prototype should be used in a museum, enabling visitors to use conventional multitouch devices (hp for instance) and browse the museum's objects freely and without compromising the "object" it's self.
The first prototype (one of the first coins) was digitized in very high resolution and then mapped to a 3D object in 3DStudioMAX by Illusive. They are not programmers, so the first solution the found was to use the QUEST3D engine and make a demo that was mouse scrollable.

I was then called into action. I was working at PSIEngine at the time, so I pulled the project in.
Either I created something from scratch, and implemented some kind of 3D engine on top of windows 7 multitouch layer and windows 7 beta... or I would have to use the Quest 3D and make something out of it.

Risk Management:
Windows 7 is still beta!
Windows 7 Touch and gestures are even worse... the examples in the down-loadable alpha versions are not written (just the object structure) and the videos from the Microsoft Evangelists are very shallow.
Quest 3D uses a Channel object structure... you drop the DLL into quest, you drag the object to the set and use the connectors... but the SDK is built for Visual Studio 6 and C++ (Quest needs speed so the running away from .NOT makes perfect sense.)
The Windows / touch objects are built for the .NOT framework.

Use Visual Studio 2005, and ignore all the errors it fires in Windows 7 Beta, create a compatiblity project using Quest's c++ dll example.... and then ignore the windows gestures, grab the 2 points and then create your own gesture manipulation routine...  compile, debug, wrap-it-up and make a bang.

Special Thanks:
Revelino Mateus. A colleage from PSIEngine that just talks C++ better than Camões (famous Portuguese poet) talked Portuguese. Thanks Revs, you are the best.

These are the 2 Demonstration Videos:
The Quest Channel Properties
The Gestures working