Optimizing HPSS file retrieval

Listing files by tape cartridge | Requesting files from tape

Retrieving files from HPSS can be quite time consuming, especially if you need to get numerous files that are stored on different tapes. When you have a large number of files scattered across many tape cartridges, configuring your request based on tape location can result in quicker retrievals in some cases.

When storing large numbers of files, use HTAR to group many small files into one archive that can be stored on a single tape. Your files will be easier to retrieve.

Here's what to do:

  • Generate a list of files sorted by tape cartridge.
  • Submit an interactive request for files that are all on the same tape,
    or submit a batch job to retrieve larger numbers of files.
  • When you have all the files you need from the first tape, request files from the next tape, and so on.

Details are in the examples below.

Listing files by tape cartridge

Run a script, such as this example, to list your files and identify the HPSS tape where each one resides. This script does not have error control and is not guaranteed to work in all cases.

hsi -P  ls -PR target_dir | awk 'BEGIN { FS = "\t"};{ if($1 == "FILE") print $6, $2}' | sort

The example assumes you have read permissions on the HPSS target directory that you specify.

In the following sample output, the tape number precedes the files' pathnames. Files appended without a tape number are zero-length files.

80787200  /home/juser/mpilsf.657862.out
80787200  /home/juser/mpilsf.657862.err
80787600  /home/juser/mpilsf.657904.out
80787600  /home/juser/mpilsf.657904.err

Requesting files from tape

Create an input file that contains the get or cget commands that you want to run and that specify the files that you want to retrieve. The order in which you list the files does not matter, but they should be on the same tape.

Here's an example of what your input file might include:

get pathnameA/filenameA
get pathnameB/filenameB
get pathnameC/filenameC
get pathnameD/filenameD
get pathnameE/filenameE

Submit an interactive request or a batch job to retrieve the files. Do not run more than five interactive requests or batch jobs concurrently to avoid exceeding your concurrent transfer limit.

Interactive request

Use the HSI in command followed by the name of your input file.

hsi in input_file

Batch job

Submit a batch job if you need to retrieve hundreds or thousands of files. A batch job doesn't stop when you log out, and you don't have to babysit a potentially long process.

Use a script like the following example to submit a sequential batch job, in which you can request the retrieval of as many files as you want if you allow sufficent wall-clock time. Adjust the wall-clock time after you run a few representative jobs and have a feel for what you really need.

Be sure to place the batch script and the input file in the same directory where the retrieved files are to be placed. (Use bash instead of tcsh if you prefer.)

### bash users replace /tcsh with /bash
#SBATCH -J job_name
#SBATCH -n 1
#SBATCH --ntasks-per-node=1
#SBATCH -t 24:00:00
#SBATCH -A project_code
#SBATCH -p hpss
#SBATCH -e job_name.err.%J
#SBATCH -o job_name.out.%J

source /glade/u/apps/opt/slurm_init/hpss.csh
### bash users replace /hpss.csh with /hpss.sh

mkdir -p /glade/scratch/username/temp
setenv TMPDIR /glade/scratch/username/temp
### bash users - export TMPDIR=/glade/scratch/username/temp

hsi in input_file

Related training courses