Recovering from ReiserFS/Disk Failures

Awful. Hardware problems are always a nightmare, as expected, been there done that, but still an awful experience. I like the absolute reliability of code, as a mathematical entity.

But, nu, postponed enough, now have to deal with it.

  1. Been getting repeated errors for some time:
    $ tail -f /var/log/messages
    Apr  7 23:23:21 hostname kernel: [473205.958835] ata2.00: limiting speed to UDMA/100:PIO4
    Apr  7 23:23:21 hostname kernel: [473205.958853]          res 40/00:00:30:00:00/40:00:30:00:00/e0 Emask 0x14 (ATA bus error)
    Apr  7 23:23:21 hostname kernel: [473205.958860] ata2: hard resetting port
    Apr  7 23:23:22 hostname kernel: [473206.994283] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    Apr  7 23:23:22 hostname kernel: [473207.011050] ata2.00: ata_hpa_resize 1: sectors = 1953525168, hpa_sectors = 1953525168
    Apr  7 23:23:22 hostname kernel: [473207.018900] ata2.00: ata_hpa_resize 1: sectors = 1953525168, hpa_sectors = 1953525168
    Apr  7 23:23:22 hostname kernel: [473207.018909] ata2.00: configured for UDMA/100
    Apr  7 23:23:22 hostname kernel: [473207.018923] ata2: EH complete
    Apr  7 23:23:22 hostname kernel: [473207.042038] SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
    Apr  7 23:23:22 hostname kernel: [473207.042200] sdb: Write Protect is off
    Apr  7 23:23:22 hostname kernel: [473207.046095] SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    Apr  7 23:23:56 hostname kernel: [473241.701613] ATA: abnormal status 0x80 on port 0x0000000000010977
    Apr  7 23:23:56 hostname kernel: [473241.711921] ATA: abnormal status 0x80 on port 0x0000000000010977
    Apr  7 23:23:56 hostname kernel: [473241.722227] ATA: abnormal status 0x80 on port 0x0000000000010977
  2. /dev/sdb, but which partition? And what’s the problem?
  3. Later, maybe elsewhere, getting different errors:
    Apr 13 15:53:31 hostname kernel: [5192358.304018] ReiserFS: dm-5: warning: zam-7001: io error in reiserfs_find_entry
  4. And then, lost access to some files!
    Apr 14 22:39:02 hostname kernel: [5303037.750071] ReiserFS: dm-2: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [22594 66995 0x0 SD]
  5. Yeah, ReiserFS; these are ancient LVM volumes from the 2.4 kernel days that I’ve been migrating between harddisks. Host now runs 2.6.20. Where’s the source code?
  6. Some reports seem to suggest these aren’t real hardware problems.
    Apr 17 12:58:32 hostname kernel: [  526.391826] ReiserFS: warning: is_tree_node: node level 63684 does not match to the expected one 1
    Apr 17 12:58:32 hostname kernel: [  526.391831] ReiserFS: dm-5: warning: vs-5150: search_by_key: invalid format found in block 45842437. Fsck?
    Apr 17 12:58:32 hostname kernel: [  526.391838] ReiserFS: dm-5: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [39948 39954 0x0 SD]

    Fsck!

  7. Rebooting only replays journal, doesn’t run fsck? Manually, then.
  8. Who are “dm-5” and “dm-2”?!
  9. VLs are data, not OS, so I can unmount and fsck them without rebooting into single mode:
    Checking internal tree..block 2523426: The level of the node (11888) is not correct, (1) expected
     the problem in the internal node occured (2523426), whole subtree is skipped
    Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
    Bad nodes were found, Semantic pass skipped
    1 found corruptions can be fixed only when running with --rebuild-tree
  10. Everyone says scary things about “–rebuild-tree”, so let’s snapshot before unleashing ReiserFS:
    dd if=/dev/vg/label conv=noerror | gzip --stdout > /mnt/sda9/label.img.gz
  11. This one is small enough, but still takes forever (old machine, PATA disks)! Should’ve piped thru pv to monitor progress precisely. Instead, watching image grow:
    watch --diff du /mnt/sda9/label.img.gz

    CPU goes to 100%, and overheats! Gzip wasn’t a significant gain because volume was full, and lots of tarballs, PDFs — stuff that’s already compressed.

  12. Oy, hardware errors confirmed — low-level image creation doesn’t go thru ReiserFS:
    Apr 17 21:46:39 hostname kernel: [18722.034559] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    Apr 17 21:46:39 hostname kernel: [18722.034565] ata2.00: (BMDMA stat 0x25)
    Apr 17 21:46:39 hostname kernel: [18722.034572] ata2.00: cmd 25/00:98:0f:13:c5/00:00:5f:00:00/e0 tag 0 cdb 0x0 data 77824 in
    Apr 17 21:46:39 hostname kernel: [18722.034574]          res 51/40:00:5f:00:00/40:00:5f:00:00/e0 Emask 0x9 (media error)
  13. Not sure how would’ve used this snapshot to recover, but, nu.
    # fsck.reiserfs --rebuild-tree /dev/vg/label

    CPU is 60% waiting for I/O, and disk churning…

  14. reiserfsck 3.6.19 (2003 www.namesys.com)
    
    *************************************************************
    ** Do not  run  the  program  with  --rebuild-tree  unless **
    ** something is broken and MAKE A BACKUP  before using it. **
    ...
    *************************************************************
    
    Will rebuild the filesystem (/dev/vg/label) tree
    Will put log info to 'stdout'
    
    Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
    Replaying journal..
    Reiserfs journal '/dev/vg/label' in blocks [18..8211]: 0 transactions replayed
    ###########
    reiserfsck --rebuild-tree started at Thu Apr 17 22:06:04 2014
    ###########
    
    Pass 0:
    ####### Pass 0 #######
    Loading on-disk bitmap .. ok, 3309832 blocks marked used
    Skipping 8370 blocks (super block, journal, bitmaps) 3301462 blocks will be read
    0%....20%....40%....60%....80%....100%
    75525 directory entries were hashed with "r5" hash.
    	"r5" hash is selected
    Flushing..finished
    	Read blocks (but not data blocks) 3301462
    		Leaves among those 14946
    		Objectids found 75545
    
    Pass 1 (will try to insert 14946 leaves):
    ####### Pass 1 #######
    Looking for allocable blocks .. finished
    0%....20%....40%....60%....80%....100%
    Flushing..finished
    	14946 leaves read
    		14862 inserted
    		84 not inserted
    ####### Pass 2 #######
    
    Pass 2:
    0%....20%....40%....60%....80%....100%
    Flushing..finished
    	Leaves inserted item by item 84
    Pass 3 (semantic):
    ####### Pass 3 #########
    rebuild_semantic_pass: The entry [28815 75561] ("Duplicity") in directory [409 28815] points to nowhere - is removed
    vpf-10650: The directory [409 28815] has the wrong size in the StatData (1000) - corrected to (968)
    rebuild_semantic_pass: The entry [28846 39978] ("FUSE") in directory [409 28846] points to nowhere - is removed
    rebuild_semantic_pass: The entry [28846 29535] ("LVM-HOWTO.html") in directory [409 28846] points to nowhere - is removed
    vpf-10650: The directory [409 28846] has the wrong size in the StatData (408) - corrected to (352)
    rebuild_semantic_pass: The entry [28886 33784] ("foo.html") in directory [81 28886] points to nowhere - is removed
    vpf-10650: The directory [81 28886] has the wrong size in the StatData (312) - corrected to (272)
    Flushing..finished
    	Files found: 70246
    	Directories found: 4995
    	Symlinks found: 282
    	Names pointing to nowhere (removed): 4
    Pass 3a (looking for lost dir/files):
    ####### Pass 3a (lost+found pass) #########
    Looking for lost directories:
    Looking for lost files:
    The object [28815 75560] has wrong mode (?---------) - corrected to -rw-------
    vpf-10670: The file [28815 75560] has the wrong size in the StatData (0) - corrected to (1824)
    vpf-10680: The file [28815 75560] has the wrong block count in the StatData (0) - corrected to (8)
    Flushing..finished
    	Objects without names 18
    	Files linked to /lost+found 18
    Pass 4 - finished
    	Deleted unreachable items 1
    Flushing..finished
    Syncing..finished
    ###########
    reiserfsck finished at Thu Apr 17 22:11:07 2014
    ###########
  15. “Lost” files found:
    # ll /vol/label/lost+found/
    total 2689
    -rw-r--r-- 1 me      users     71084 2012-11-14 19:49 22613_673
    -rw-r--r-- 1 me      users    141818 2008-11-06 21:48 28815_29497
    -rw-r--r-- 1 me      users     14733 2008-11-07 00:11 28815_29763
    -rw-r--r-- 1 me      users     28248 2008-09-21 17:21 28815_29911
    -rw-r--r-- 1 me      users    191413 2009-03-16 12:43 28815_43711
    -rw-r--r-- 1 me      users    177946 2009-03-31 13:32 28815_43721
    -rw-r--r-- 1 me      users    187650 2009-04-01 19:30 28815_43724
    -rw-r--r-- 1 me      users    130026 2009-04-09 13:36 28815_43725
    -rw-r--r-- 1 me      users    128074 2009-04-09 13:37 28815_43726
    -rw-r--r-- 1 me      users    112966 2009-04-09 13:37 28815_43727
    -rw-r--r-- 1 me      users   1235340 2013-04-24 08:04 28815_75556
    -rw-r--r-- 1 me      users     24170 2013-04-24 08:37 28815_75557
    -rw-r--r-- 1 me      users      1862 2013-04-24 13:42 28815_75558
    -rw-r--r-- 1 me      users      2135 2013-04-24 13:42 28815_75559
    -rw------- 1 root    root       1824 1970-01-01 02:00 28815_75560
    -rw-r--r-- 1 me      users    115384 2008-12-07 18:28 39978_39980
    -rw-r--r-- 1 me      me        53224 2008-12-07 18:48 39978_39981
    -rw-r--r-- 1 me      users    102817 2008-12-07 19:01 39978_39982
  16. And repeat for other volumes. Till it breaks:
    Apr 21 18:58:46 hostname kernel: [205637.923244] scsi 1:0:0:0: rejecting I/O to dead device
    Apr 21 18:58:46 hostname kernel: [205637.923252] printk: 154 messages suppressed.
    Apr 21 18:58:46 hostname kernel: [205637.923256] Buffer I/O error on device dm-5, logical block 65444863
    

    ReiserFS says:

    The problem has occurred looks like a hardware problem. If you have
    bad blocks, we advise you to get a new hard drive, because once you
    ...
    bread: Cannot read the block (65444863): (Input/output error).
    Warning... fsck.reiserfs for device /dev/vg/noise exited with signal 6.
    

    I’m screwed$#@!

  17. Now to solve the hardware problems…


Comments are closed.