Help Topic: The DULoad Toolbox Rsync Wrappers
Maintained by: mikerb@mit.edu Get PDF
The DULoad Toolbox Rsync Wrappers
The DULoad Toolbox is designed to facilitate the use of rsync in maintaining a remote directory of files. It contains two perl scripts, one for downloading and one for uploading files, which are essentially wrappers for rsync. It also utilizes a meta-file on the local machine to facilitate regular transactions with much less complexity than the initial transaction. The rsync command is distributed by default in GNU/Linux and MacOS, and is very powerful, but also involves a set of lengthy command-line arguments which can be error-prone. The goals of the DULoad utilities are:
- Hide the complexities of rsync to make operations easier to execute and less error-prone.
- Implement key metafiles on both the remote and local machines to facilitate synchronizing sub-trees.
- Implement support for masking out "raw" versions of video and images.
The DULoad package is designed for a user who needs access to a portion of a large data archive, where the whole archive far exceeds the capacity of the user's local disk storage budget. Typically this may be an archive with large amounts of video or image files. The user may want the whole directory structure locally (without files), but may want the ability to selectively populate portions of the directory structure with files of current interest. The DULoad package implements further support for a directory structured with raw and low-resolution versions of media files. This enables users to download low-resolution versions as a preview, and only pull-down higher-resolution files if needed.
Getting the DULoad Package [top]
The DULoad Toolbox consists of two perl scripts: dload.pl and uload.pl, and a Bash script rrf.sh. If you are part of the PavLab group, already with access to the Oceanai server, you can just check out this tree via SVN:
$ svn co svn+ssh://oceanai.mit.edu/svn/repos/project-pavlab
The directory project-pavlab/utils/bin should then be added to your shell path.
For members outside the lab, the DULoad Toolbox scripts can be download in a single tar file from here:
$ wget http://oceanai.mit.edu/duload/project-duload.tar
Once these scripts are added to your shell path, they're ready to go.
Also see [1].
Basic Usage and Working Example [top]
The duload utilities are designed to be invoked on your local machine, working with another machine acting as a remote server, e.g., the oceanai server used in our lab. It is assumed that you have an account on the remote server with the ability to ssh onto that machine (preferably with ssh-keys). We'll use the below directory as an example. It is a simple event from which two videos were made. The raw videos are stored, along with a lower-resolution version of each.
event_aug1616/ raw_vids/ in_water.mov 1.5 GB on_shore.mov 1.8 GB vids/ in_water.mp4 181 MB on_shore.mp4 472 MB
Let's assume it exists on a machine with IP address 18.38.2.158, and on the local file system as /raiddrive/archives/. Assuming you have read access on that machine for this directory, you could copy it onto your local machine with:
$ scp -rp 18.38.2.158:/raiddrive/archives/event_aug1616 .
The drawback of scp however is that if this transmission is interupted before completion, a subsequent scp invocation would start over again. A better approach with rsync is:
$ rsync -aP --partial 18.38.2.158:/raiddrive/archives/event_aug1616 .
If the above rsync transmision were interrupted, the subsequent invocation would only copy files that were not copied on the first transmission. With the --partial option, a file interrupted mid-transmission would resume only to copy the remaining part of the file.
So rsync is clearly advantageous. The dload.pl utility uses rsync and implements a few other important features.
Basic Usage of the dload.pl utility [top]
Using the dload.pl utility, an initial synchronization begins with:
$ dload.pl --init=18.38.2.158:/raiddrive/archives/event_aug1616
This will copy all the remote directories, but not the files:
$ cd event_aug1616 $ ls raw_vids/ vids/
Once the tree has been initialized locally, we can fill it in later at our discretion. Even though we have only downloaded the directories, we can peek at the contents with:
$ cd event_aug1616 $ dload.pl --all --raw --list-only drwxrwxr-x 4096 2017/07/10 16:50:32 . drwxrwxr-x 4096 2017/07/10 16:50:32 raw_vids -rw------- 1565214170 2017/07/10 16:48:34 raw_vids/in_water.mov -rw------- 1911673613 2017/07/10 16:49:33 raw_vids/on_shore.mov drwxrwxr-x 4096 2017/07/10 16:52:53 vids -rw-r--r-- 189222648 2017/07/10 16:51:25 vids/in_water.mp4 -rw-r--r-- 494034665 2017/07/10 16:51:15 vids/on_shore.mp4
In this example, we may recall that the in_water video may have been interesting. So now that we can see the exact file name, it can be pulled down for preview with:
$ cd event_aug1616/vids $ dload.pl in_water.mp4
And if it turns out that the high-resolution version is wanted, it can then be pulled down with:
$ cd event_aug1616/raw_vids $ dload.pl in_water.mov
In practice, it is common to download a directory in its entirety except all the raw files. By default this is the mode dload operates in. Raw files are all files who's filename or directory name begins with string "raw_", case insensitive. In our example, if the user wanted to download the entire event_aug1616 directory without raw files:
$ cd event_aug1616/ $ dload.pl --all $ ls * raw_vids: vids: in_water.mp4 on_shore.mp4
Using the color=#555555--dry-run Command Line Option [top]
The rsync utility supports a --dry-run commnand line switch which shows the user which files would be transferred if actually invoked without this option. Returning to our example, in this case with raw videos not yet downloaded:
$ cd event_aug1616/ $ ls * raw_vids: <-- note raw videos are missing vids: in_water.mp4 on_shore.mp4 $ dload.pl --all --raw --dry-run receiving file list ... 8 files to consider raw_vids/ <-- raw videos *would* be downloaded raw_vids/in_water.mov raw_vids/on_shore.mov sent 98 bytes received 262 bytes 720.00 bytes/sec total size is 4175680306 speedup is 11599111.96 $ ls * raw_vids: <-- raw videos are still not here vids: in_water.mp4 on_shore.mp4
The --list-only switch only lists the directory contents on the remote machine, and does not show you which remote files are locally missing, and vice versa. The --dry-run switch presents you the differences between local and remote machines. Using --dry-run, or simply -d, is a good habit to use prior to most operations, just to confirm that you're actually doing what you think you're doing.
Note: In rsync the --dry-run option may be abbreviated to -n. In the dload.pl and uload.pl utilities, this is abbreviate to -d.
Dowloading Only a Portion of a Directory [top]
If a portion of a remote archive is all that is needed, this portion can be just dowloaded directly with the appropriately extended URL. In our example, if only the vids are wanted:
$ dload.pl --init=18.38.2.158:/raiddrive/archives/event_aug1616/vids $ cd vids $ dload.pl --all
(In the rare case that you later decide you want a portion of the tree above the original sub-tree dowloaded, you have three options. (1) The brute force option is to simply start over by downloading the higher level tree. This may be fine if the amount of data is relatively small and bandwidth is high. (2) Download the higher level tree (not yet with files), and move the previously downloaded files into place. (3) Create the higher-level parent directories by hand, move the .rsync_info to the root of the tree, and edit this file accordingly.)
Basic Usage of the uload.pl utility [top]
The uload.pl utility is used for uploading files to tree that has been previously downloaded via dload.pl. By example, let's add another video to our local vids folder and upload it:
$ cd event_aug1616/vids $ ls in_water.mp4 on_shore.mp4 $ mv ~/in_air.mp4 . <-- move a new video into local folder $ uload.pl in_air.mp4 Will handle file: in_air.mp4 building file list ... 1 file to consider in_air.mp4 15535210 100% 7.00MB/s 0:00:02 (xfer#1, to-check=0/1) sent 15537234 bytes received 42 bytes 6214910.40 bytes/sec total size is 15535210 speedup is 1.00ls
The --all (or -a) switch can be used to upload everything in the local directory (including all local directories recursively) to the remote machine:
$ uload.pl -a
The --user=USER switch can be used if the local username is different than the remote username.
$ uload.pl --username=jane
Local Storage of the rsync Root Location [top]
A key motivation for implementing the dload.pl and uload.pl scripts is to ease the user burden on getting complex command line arguments correct. We really want to be able to confidently remove any part of the local tree (re-claiming local storage) and get it back with a trivially simple operation. For example:
$ cd event_aug1616 $ ls * raw_vids: in_water.mov on_shore.mov vids: in_water.mp4 on_shore.mp4 $ rm -rf raw_vids $ ls * vids: in_water.mp4 on_shore.mp4 $ dload.pl -a -r raw_vids: in_water.mov on_shore.mov vids: in_water.mp4 on_shore.mp4
Remember the tool is designed for the scenario where the user is working with a portion of an archive that may be too large to hold in its entirety on one's local computer. Easily deleting and restoring parts of the tree is a primary design goal of dload.pl.
In the above example, after removing the raw_vids directory, rsync could have been used directly to restore the tree with
$ cd event_aug1616 $ rsync -aP 18.38.2.158:/raiddrive/archives/event_aug1616/raw_vids .
Easy, right? You just have to precisely remember the original URL including the server IP address and the location of the files on the remote machine. And, if you accidentally put a trailing "/", as in ../raw_vids/, then you get the contents of raw_vids instead of the directory. The dload.pl utility accomplishes the same with:
$ cd event_aug1616 $ dload.pl -a -r (-r is short for --raw, -a is short for --all)
To accomplish this simplicity, a bit of behind-the-scenes magic needs to happen. When the (initially empty) tree is first downloaded, a single hidden file, .rsync_info, is created in the root of the tree. This file contains the "root" of the original tree, URL and all:
$ dload.pl --init=18.38.2.158:/raiddrive/archives/event_aug1616 $ cd event_aug1616 $ cat .rsync_info 18.38.2.158:/raiddrive/archives/event_aug1616
Whenever dload.pl or dload.pl is invoked, the script initially looks up the current full path until it finds the .rsync_info file. It uses this information, combined with the full present working directory, to re-constuct the proper rsync command line argument. Neither dload.pl or uload.pl will work without it:
$ cd event_aug1616 $ rm -f .rsync_info $ dload.pl -a The .rsync_info file was not found. Exiting.
References
1. | Michael Benjamin, The DULoad Toolbox Scheme, http://oceanai.mit.edu/pavlab/help/duload_scheme, 2017. |
Page built from LaTeX source using texwiki, developed at MIT. Errata to issues@moos-ivp.org. Get PDF