Tips From The Blog XVI: fast check of two directories

Here is a quick method for checking parity between two directories.

Let’s say we have two directories dir1 and dir2. They are large and have thousands of files and subdirectories. How can we check that they have the same contents? I found myself in this situation recently during a server migration.

Method 1: rsync

To move the files in the first place we can do rsync -av --checksum path/to/dir1/ path/to/dir2 which will copy the contents of a into the b directory. Use of the checksum flag means that the hashes are checked during transfer. If b is a new directory, or you use the –delete flag to erase anything else in b then this means the transfer is done and that it is checked. Barring any errors both directories will be the same. If you want to just see what would have happened with the command, use –dry-run.

The rsync command can be run again to check that everything went OK and that there are no more files to update or change. The problem is that with a large directory that has many files and subdirectories, this is slow, especially over a network connection.

Method 2: diff

Using diff to check for differences between the contents of dir1 and dir2 is a great way to check parity.

diff -rq path/to/dir1 path/to/dir2

This will list any files that are present/absent in one directory or the other and it will also show which files have changed.

Again the problem here is speed. If the diff is done in the shell of a computer with two network volumes, particularly over a slow connection, this can take a long time.

Method 3: python

Python has a library filecmp which could help us.

import filecmp

dir1 = 'path/to/dir1'
dir2 = 'path/to/dir2'
filecmp.dircmp(dir1,dir2).report_full_closure()

We can get lists of contents and it works fast, but the results are difficult to make sense of.

Method 4: easy/quick method

Let’s say we are pretty sure that the directories are the same, but we just want to make sure. Imagine you did the migration a few days ago but weren’t sure if anything has changed in dir1 or dir2 in that time. If others have read-write access to dir1 or dir2 there might be files added or removed that you don’t know about.

Here’s a simple solution in zsh.

Get a list of files in both directories and put them in two text files on the desktop.

cd path/to/dir1
find -L . > ~/Desktop/dir1.txt
cd path/to/dir2
find -L . > ~/Desktop/dir2.txt

Change directory means that the relative paths of all files and folder listed in the the text file are comparable. The two outputs we get for a dummy set of files (created using this script) looks like this

.
./not_the_same
./dir_only_in_dir1
./file_only_in_dir1
./common_file
./file_in_dir1
./common_dir
./common_dir/dir2
./common_dir/dir2/not_the_same
./common_dir/dir2/common_file
./common_dir/dir2/file_in_dir1
./common_dir/dir2/dir_only_in_dir2
./common_dir/dir2/common_dir
./common_dir/dir2/file_only_in_dir2
./common_dir/dir1
./common_dir/dir1/not_the_same
./common_dir/dir1/dir_only_in_dir1
./common_dir/dir1/file_only_in_dir1
./common_dir/dir1/common_file
./common_dir/dir1/file_in_dir1
./common_dir/dir1/common_dir

and

.
./not_the_same
./common_file
./file_in_dir1
./dir_only_in_dir2
./common_dir
./common_dir/dir2
./common_dir/dir2/not_the_same
./common_dir/dir2/common_file
./common_dir/dir2/file_in_dir1
./common_dir/dir2/dir_only_in_dir2
./common_dir/dir2/common_dir
./common_dir/dir2/file_only_in_dir2
./common_dir/dir1
./common_dir/dir1/not_the_same
./common_dir/dir1/dir_only_in_dir1
./common_dir/dir1/file_only_in_dir1
./common_dir/dir1/common_file
./common_dir/dir1/file_in_dir1
./common_dir/dir1/common_dir
./file_only_in_dir2

This step is seriously fast. On the large directories I was scanning, it took just a few minutes for find versus hours for rsync or diff.

Now we can sort the text files into alphabetical order and put the contents each into a new file:

cd ~/Desktop
sort dir1.txt > dir1sort.txt
sort dir2.txt > dir2sort.txt

Then we can just diff the text files by doing diff dir1sort.txt dir2sort.txt

However, I really like diff2html for a more visual way to observe the output. Using:

diff -u dir1sort.txt dir2sort.txt | diff2html -i stdin

We get this:

The sorting step helps with diff, because contiguous blocks of flies in subdirectories will be grouped together following the sort.

The example I’m showing here is very simple and the power of this method comes when a complicated directory tree needs to be quickly compared.

This post is part of an occasional series of tech tips.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.