2012-09-28

Testing bunzip2 files quickly using hexdump

My work sometimes involves dealing with large amounts of data stored in .tar.bz2 files. Recently, we found that some of the files were corrupted. The files are large enough that a successful "bunzip2 -t" takes prohibitively long. As a shortcut, we can just check for the bzip magic numbers at the beginning of the file. The bzip2 format requires that each file starts with the string "BZh". Our files are stored in batches by calendar date (MMDD). I generate a log of the files' magic numbers like so:

(for d in ????;
do
  for f in $d/*bz2;
  do
    echo $f `head -1 $f | hexdump -c -n3 | grep '0000000' `;
  done;
done) > magic_numbers.txt

And then I can find the bad files like so:

grep -v 'B Z h' magic_numbers.txt

hexdump is a nice tool for manipulating binary files on the command line. This method of testing the files runs relatively fast.

No comments:

Post a Comment