My work sometimes involves dealing with large amounts of data stored in .tar.bz2 files. Recently, we found that some of the files were corrupted. The files are large enough that a successful "bunzip2 -t" takes prohibitively long. As a shortcut, we can just check for the bzip magic numbers at the beginning of the file. The bzip2 format requires that each file starts with the string "BZh". Our files are stored in batches by calendar date (MMDD). I generate a log of the files' magic numbers like so:
(for d in ????;
do
  for f in $d/*bz2;
  do
    echo $f `head -1 $f | hexdump -c -n3 | grep '0000000' `;
  done;
done) > magic_numbers.txt
And then I can find the bad files like so:
grep -v 'B Z h' magic_numbers.txt
hexdump is a nice tool for manipulating binary files on the command line. This method of testing the files runs relatively fast.
