Optimizing HPSS file retrieval

Listing files by tape cartridge | Requesting files from tape

Updated 10/25/2019 – In the batch script example, the partition to use is now hpss rather than dav. The "module reset" line has been removed. Please update any scripts you use that are based on that example.

Retrieving files from HPSS can be quite time consuming, especially if you need to get numerous files that are stored on different tapes. When you have a large number of files scattered across many tape cartridges, configuring your request based on tape location can result in quicker retrievals in some cases.

Here's what to do:

  • Generate a list of files sorted by tape cartridge.
  • Submit an interactive request for files that are all on the same tape,
    or submit a batch job to retrieve larger numbers of files.
  • When you have all the files you need from the first tape, request files from the next tape, and so on.

Details are in the examples below.

Listing files by tape cartridge

Run a script, such as this example, to list your files and identify the HPSS tape where each one resides. This script does not have error control and is not guaranteed to work in all cases.

hsi -P  ls -PR target_dir | awk 'BEGIN { FS = "\t"};{ if($1 == "FILE") print $6, $2}' | sort

The example assumes you have read permissions on the HPSS target directory that you specify.

In the following sample output, the tape number precedes the files' pathnames. Files appended without a tape number are zero-length files.

80787200  /home/juser/mpilsf.657862.out
80787200  /home/juser/mpilsf.657862.err
80787600  /home/juser/mpilsf.657904.out
80787600  /home/juser/mpilsf.657904.err

Requesting files from tape

Create an input file that contains the get or cget commands that you want to run and that specify the files that you want to retrieve. The order in which you list the files does not matter, but they should be on the same tape.

Here's an example of what your input file might include:

get pathnameA/filenameA
get pathnameB/filenameB
get pathnameC/filenameC
get pathnameD/filenameD
get pathnameE/filenameE

Submit an interactive request or a batch job to retrieve the files. To avoid exceeding your concurrent transfer limitdo not run more than five interactive requests or batch jobs concurrently.

Interactive request

Use the HSI in command followed by the name of your input file.

hsi in input_file_name

Batch job

Submit a batch job if you need to retrieve hundreds or thousands of files. A batch job doesn't stop when you log out, and you don't have to babysit a potentially long process.

Use a script like the following example to submit a sequential batch job, in which you can request the retrieval of as many files as you want if you allow sufficent wall-clock time. Adjust the wall-clock time after you run a few representative jobs and have a feel for what you really need.

Be sure to place the batch script and the input file in the same directory where the retrieved files are to be placed. (Use bash instead of tcsh if you prefer.)

### bash users replace /tcsh with /bash -l
#SBATCH --job-name=HPSS_job
#SBATCH --account=project_code
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=24:00:00
#SBATCH --partition=hpss
#SBATCH --output=HPSS_job.out.%j

setenv TMPDIR /glade/scratch/$USER/temp
### bash users - export TMPDIR=/glade/scratch/$USER/temp
mkdir -p $TMPDIR

hsi in input_file