Awful. Hardware problems are always a nightmare, as expected, been there done that, but still an awful experience. I like the absolute reliability of code, as a mathematical entity.
But, nu, postponed enough, now have to deal with it.
- Been getting repeated errors for some time:
$ tail -f /var/log/messages Apr 7 23:23:21 hostname kernel: [473205.958835] ata2.00: limiting speed to UDMA/100:PIO4 Apr 7 23:23:21 hostname kernel: [473205.958853] res 40/00:00:30:00:00/40:00:30:00:00/e0 Emask 0x14 (ATA bus error) Apr 7 23:23:21 hostname kernel: [473205.958860] ata2: hard resetting port Apr 7 23:23:22 hostname kernel: [473206.994283] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Apr 7 23:23:22 hostname kernel: [473207.011050] ata2.00: ata_hpa_resize 1: sectors = 1953525168, hpa_sectors = 1953525168 Apr 7 23:23:22 hostname kernel: [473207.018900] ata2.00: ata_hpa_resize 1: sectors = 1953525168, hpa_sectors = 1953525168 Apr 7 23:23:22 hostname kernel: [473207.018909] ata2.00: configured for UDMA/100 Apr 7 23:23:22 hostname kernel: [473207.018923] ata2: EH complete Apr 7 23:23:22 hostname kernel: [473207.042038] SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB) Apr 7 23:23:22 hostname kernel: [473207.042200] sdb: Write Protect is off Apr 7 23:23:22 hostname kernel: [473207.046095] SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA Apr 7 23:23:56 hostname kernel: [473241.701613] ATA: abnormal status 0x80 on port 0x0000000000010977 Apr 7 23:23:56 hostname kernel: [473241.711921] ATA: abnormal status 0x80 on port 0x0000000000010977 Apr 7 23:23:56 hostname kernel: [473241.722227] ATA: abnormal status 0x80 on port 0x0000000000010977
- /dev/sdb, but which partition? And what’s the problem?
- Later, maybe elsewhere, getting different errors:
Apr 13 15:53:31 hostname kernel: [5192358.304018] ReiserFS: dm-5: warning: zam-7001: io error in reiserfs_find_entry
- And then, lost access to some files!
Apr 14 22:39:02 hostname kernel: [5303037.750071] ReiserFS: dm-2: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [22594 66995 0x0 SD]
- Yeah, ReiserFS; these are ancient LVM volumes from the 2.4 kernel days that I’ve been migrating between harddisks. Host now runs 2.6.20. Where’s the source code?
- Some reports seem to suggest these aren’t real hardware problems.
Apr 17 12:58:32 hostname kernel: [ 526.391826] ReiserFS: warning: is_tree_node: node level 63684 does not match to the expected one 1 Apr 17 12:58:32 hostname kernel: [ 526.391831] ReiserFS: dm-5: warning: vs-5150: search_by_key: invalid format found in block 45842437. Fsck? Apr 17 12:58:32 hostname kernel: [ 526.391838] ReiserFS: dm-5: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [39948 39954 0x0 SD]
Fsck!
- Rebooting only replays journal, doesn’t run fsck? Manually, then.
- Who are “dm-5” and “dm-2”?!
- VLs are data, not OS, so I can unmount and fsck them without rebooting into single mode:
Checking internal tree..block 2523426: The level of the node (11888) is not correct, (1) expected the problem in the internal node occured (2523426), whole subtree is skipped Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs. Bad nodes were found, Semantic pass skipped 1 found corruptions can be fixed only when running with --rebuild-tree
- Everyone says scary things about “–rebuild-tree”, so let’s snapshot before unleashing ReiserFS:
dd if=/dev/vg/label conv=noerror | gzip --stdout > /mnt/sda9/label.img.gz
- This one is small enough, but still takes forever (old machine, PATA disks)! Should’ve piped thru pv to monitor progress precisely. Instead, watching image grow:
watch --diff du /mnt/sda9/label.img.gz
CPU goes to 100%, and overheats! Gzip wasn’t a significant gain because volume was full, and lots of tarballs, PDFs — stuff that’s already compressed.
- Oy, hardware errors confirmed — low-level image creation doesn’t go thru ReiserFS:
Apr 17 21:46:39 hostname kernel: [18722.034559] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Apr 17 21:46:39 hostname kernel: [18722.034565] ata2.00: (BMDMA stat 0x25) Apr 17 21:46:39 hostname kernel: [18722.034572] ata2.00: cmd 25/00:98:0f:13:c5/00:00:5f:00:00/e0 tag 0 cdb 0x0 data 77824 in Apr 17 21:46:39 hostname kernel: [18722.034574] res 51/40:00:5f:00:00/40:00:5f:00:00/e0 Emask 0x9 (media error)
- Not sure how would’ve used this snapshot to recover, but, nu.
# fsck.reiserfs --rebuild-tree /dev/vg/label
CPU is 60% waiting for I/O, and disk churning…
-
reiserfsck 3.6.19 (2003 www.namesys.com) ************************************************************* ** Do not run the program with --rebuild-tree unless ** ** something is broken and MAKE A BACKUP before using it. ** ... ************************************************************* Will rebuild the filesystem (/dev/vg/label) tree Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes Replaying journal.. Reiserfs journal '/dev/vg/label' in blocks [18..8211]: 0 transactions replayed ########### reiserfsck --rebuild-tree started at Thu Apr 17 22:06:04 2014 ########### Pass 0: ####### Pass 0 ####### Loading on-disk bitmap .. ok, 3309832 blocks marked used Skipping 8370 blocks (super block, journal, bitmaps) 3301462 blocks will be read 0%....20%....40%....60%....80%....100% 75525 directory entries were hashed with "r5" hash. "r5" hash is selected Flushing..finished Read blocks (but not data blocks) 3301462 Leaves among those 14946 Objectids found 75545 Pass 1 (will try to insert 14946 leaves): ####### Pass 1 ####### Looking for allocable blocks .. finished 0%....20%....40%....60%....80%....100% Flushing..finished 14946 leaves read 14862 inserted 84 not inserted ####### Pass 2 ####### Pass 2: 0%....20%....40%....60%....80%....100% Flushing..finished Leaves inserted item by item 84 Pass 3 (semantic): ####### Pass 3 ######### rebuild_semantic_pass: The entry [28815 75561] ("Duplicity") in directory [409 28815] points to nowhere - is removed vpf-10650: The directory [409 28815] has the wrong size in the StatData (1000) - corrected to (968) rebuild_semantic_pass: The entry [28846 39978] ("FUSE") in directory [409 28846] points to nowhere - is removed rebuild_semantic_pass: The entry [28846 29535] ("LVM-HOWTO.html") in directory [409 28846] points to nowhere - is removed vpf-10650: The directory [409 28846] has the wrong size in the StatData (408) - corrected to (352) rebuild_semantic_pass: The entry [28886 33784] ("foo.html") in directory [81 28886] points to nowhere - is removed vpf-10650: The directory [81 28886] has the wrong size in the StatData (312) - corrected to (272) Flushing..finished Files found: 70246 Directories found: 4995 Symlinks found: 282 Names pointing to nowhere (removed): 4 Pass 3a (looking for lost dir/files): ####### Pass 3a (lost+found pass) ######### Looking for lost directories: Looking for lost files: The object [28815 75560] has wrong mode (?---------) - corrected to -rw------- vpf-10670: The file [28815 75560] has the wrong size in the StatData (0) - corrected to (1824) vpf-10680: The file [28815 75560] has the wrong block count in the StatData (0) - corrected to (8) Flushing..finished Objects without names 18 Files linked to /lost+found 18 Pass 4 - finished Deleted unreachable items 1 Flushing..finished Syncing..finished ########### reiserfsck finished at Thu Apr 17 22:11:07 2014 ###########
- “Lost” files found:
# ll /vol/label/lost+found/ total 2689 -rw-r--r-- 1 me users 71084 2012-11-14 19:49 22613_673 -rw-r--r-- 1 me users 141818 2008-11-06 21:48 28815_29497 -rw-r--r-- 1 me users 14733 2008-11-07 00:11 28815_29763 -rw-r--r-- 1 me users 28248 2008-09-21 17:21 28815_29911 -rw-r--r-- 1 me users 191413 2009-03-16 12:43 28815_43711 -rw-r--r-- 1 me users 177946 2009-03-31 13:32 28815_43721 -rw-r--r-- 1 me users 187650 2009-04-01 19:30 28815_43724 -rw-r--r-- 1 me users 130026 2009-04-09 13:36 28815_43725 -rw-r--r-- 1 me users 128074 2009-04-09 13:37 28815_43726 -rw-r--r-- 1 me users 112966 2009-04-09 13:37 28815_43727 -rw-r--r-- 1 me users 1235340 2013-04-24 08:04 28815_75556 -rw-r--r-- 1 me users 24170 2013-04-24 08:37 28815_75557 -rw-r--r-- 1 me users 1862 2013-04-24 13:42 28815_75558 -rw-r--r-- 1 me users 2135 2013-04-24 13:42 28815_75559 -rw------- 1 root root 1824 1970-01-01 02:00 28815_75560 -rw-r--r-- 1 me users 115384 2008-12-07 18:28 39978_39980 -rw-r--r-- 1 me me 53224 2008-12-07 18:48 39978_39981 -rw-r--r-- 1 me users 102817 2008-12-07 19:01 39978_39982
- And repeat for other volumes. Till it breaks:
Apr 21 18:58:46 hostname kernel: [205637.923244] scsi 1:0:0:0: rejecting I/O to dead device Apr 21 18:58:46 hostname kernel: [205637.923252] printk: 154 messages suppressed. Apr 21 18:58:46 hostname kernel: [205637.923256] Buffer I/O error on device dm-5, logical block 65444863
ReiserFS says:
The problem has occurred looks like a hardware problem. If you have bad blocks, we advise you to get a new hard drive, because once you ... bread: Cannot read the block (65444863): (Input/output error). Warning... fsck.reiserfs for device /dev/vg/noise exited with signal 6.
I’m screwed$#@!
- Now to solve the hardware problems…