Wednesday, March 6, 2013

Ubuntu 12.10 random file system in read-only mode...

Hi all,
If, like me, you are a heavy Linux user, and you have some SATA II or SATA III hardware installed on Your PC, you have a good chance of understanding this post.
You see, most computers out there are Intel... hell most are Win-Tel. But ignoring software choices, hardware wise, most computers out there run on Intel hardware. In our days, running on Intel means having Intel chip-set, which in turn means having Intel SerialATA controllers.
If you have Intel SATA controllers, you are lucky enough to enjoy the widely used and (as a consequence widely debugged) Intel SATA Drivers for Linux.
If, on the other hand, you hate monopoly (in the line of thought that monopoly will kill competitors and as a result render you without options and as a consequence render you slave of a company...allowing it  not to evolve products as often, and price-tag it as they want), you'll probably be running on AMD hardware...and probably use some different  hardware for SATA controllers.

Enter the NCQ - Native Command Query. Back in ParallelATA time, TCQ was invented. TCQ stands for Tagged Command Query. Is simple terms, instead of asking you hard-drive to fetch 1 piece of information from it, you just send a bunch of requests at a time, and allow the drive to chose which it gets first. The drive then uses the location and path of it's heads, relative to the position of the data blocks in the disks and traces a path that fetch information faster but not in the requested sequence. It's no different that a mail man running it's mail delivery routine. If he was to go deliver each letter and then return back to the post office to go for another delivery and then back again, it would take ages compared to a well planned and scheduled delivery run in which he delivers all the letters to a specific path, regardless of the order in which the letters were sent. It was a good idea, however Parallel ATA uses the ISA bus protocols to communicate. That simple fact means the CPU has to hand over all data to and from the disk to memory. As a result TCQ was just a good idea without support from hardware communication protocols as it results in a huge CPU overhead. It's as if the letters had to be given to the mail man buy it's boss one at a time, and his boss would have to go back and forward to fetch each letter...pointless!
SATA controllers, on the other hand grabs a bunch of DMA (Direct Memory Access) addresses and uses them at will. This means that the TCQ version of SATA (called Native Command Query). Now using our example, the DMA address ability of the SATA bus is just like our mailman bag that can hold all the letter he takes before leaving the post-office.
Unlike TCQ, the NCQ is a success and it's been widely implemented in every SATA2 and 3 controller.
NCQ has one more ability over TCQ and that's the new SATA interfaced SSD drives. A lot of you would think that it's pointless to have NCQ if the drive has no moving parts, and so, no optimal retrieval path would have to be calculated to maximize the drive head to disk path...and you would be right about that. However SSD are so fast that the bottleneck becomes the host controller of the drive. So NCQ is used to instruct the drive on what to get while it's waiting for the controller to respond with a ready to receive state.

So NCQ seems like a good thing. It is, however to take advantage of it, one has to buffer things (driver wise). You see, if the Operating systems requests 1,2,3 and the drive replies 2,1,3 the driver has to buffer the request, and the reply on order to allow coherence.
THAT is the problem. Linux drivers for some non Intel controllers seem to have a bug in this caching. The result is that the Operating system often receives data that is out of order and thinks you have disk corruption... and other times you do have disk corruption because of the resulting re-writes. This will trigger a Linux kernel re-mount of the problematic hard-disk in read-only mode, allowing you to read your data and ultimately back-it-up. The more you stress your system, the likely this is to happen. Most people out there is thinking that their drive is failing and as a result are buying new hard-disks.

My case: I am now running the latest Kernel 3.5.0-25 in Ubuntu-Studio 12.10.
And my workstation hardware is ...AMD!
The first time this happened was after a kernel update. So it was clear to me that I had a problem with some of the hardware drivers on that kernel. I then decided to investigate and found out that:
1- Most of the other people having this problem were running on AMD
2- Some of them solved the issue by re-configuring the BIOS controller to SATA I mode
3- Some solved the issue by just changing to a PCI-express Intel based SATA controller.
4- Some solved by disabling the NCQ mode on kernel boot.

Solution until a bug fix happens:
The best way to solve this without changing hardware or rolling back to SATA 1?
just edit the file /etc/default/grub with super user rights
Change the line GRUB_CMDLINE_LINUX="" to GRUB_CMDLINE_LINUX="libata.force=noncq"
save and reboot.

Next time you upgrade your kernel, switch it back, reboot and try.