/lustre

 

The /lustre filesystem, as its name suggests, is a Lustre v2.5 parallel filesystem with 16PB of total available disk space. Underpinning the filesystem is a Cray Sonexion 2000 Storage System consisting of 12 cabinets containing a total of 5988 4TB SAS disk drives. The cabinets are interconnected by FDR Infiniband Fabric with Fine Grained Routing, where optimal pathways are used to transport data betweeen compute nodes and OSSes (see below).

There are 73 LNET router nodes providing connectivity from the compute nodes to the storage over an Infiniband Fabric, with a performance capability of over 500GB/sec.

Each cabinet can contain up to 6 Scalable Storage Units (SSU); Shaheen II has a total of 72 SSUs. Each SSU contains 82 disks configured as two GridRAID [41(8+2)+2] arrays. GridRAID is Sonexion's implementation of a Parity Declustered RAID. An SSU also contains 2 embedded SBB form factor servers, configured as Lustre Object Storage Servers (OSS). The OSSes are configured with FDR Infiniband for communicating with the LNET routers. An Object Storage Target (OST) is an ext4 filesystem that is managed by an OSS. These OST ext4 filesystems are effectively aggregated to form the Lustre filesystem. In the case of Shaheen II, there is a one-to-one relationship between OSSes and OSTs. As there are 2 OSS/OSTs for each SSU, this means that there are 144 OSTs in total (use the command lfs osts to list the OSTs).

Lustre_layout.jpg

Lustre Stripe Count

Striping a file across multiple OSTs can significantly improve performance, because the I/O bandwidth will be spread over the OSTs (round robin method). The /lustre filesystem default stripe size is set at 1MB and, following analysis of typical Shaheen file sizes,  the default stripe count is set to 1, i.e. individual files will reside on one OST only, by default. The stripe count however can be increased by users to any number up to the maximum number of OSTs available (144 for Shaheen II). This can be done at the directory or file level. When the size of the file is greater than the stripe size (1MB), the file will be broken down into 1MB chunks and spread across the specified (stripe count) number of OSTs.

Example - Viewing the stripe information for a file

$ lfs getstripe file1
filename1
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  19
	obdidx		 objid		 objid		 group
	    19	       14262392	     0xd9a078	             0

where obdidx is the OST number.

Example - Viewing the stripe information for a directory

$ lfs getstripe -d dir1
dir1
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1

Example - Creating a new file with a stripe count of 10

$ lfs setstripe -c 10 file2

Example - Setting the default stripe count of a directory to 4

$ lfs setstripe -c 4 dir1

Example - Creating a new file with a stripe size of 4MB (stripe size value must be a multiple of 64KB)

$ lfs setstripe -s 4M filename2

 

More detailed information about efficient use of Lustre and stripes can be found in our Training slides.

 

Filesystem Layout

The /project and /scratch filesystems are links to subdirectories of /lustre - /lustre/project and /lustre/scratch. The /scratch directory should only be used for temporary data utilised by running jobs, as it is subject to a rigorous purge policy described below. Any files that you need to keep for longer-term use should reside in the /project directory.

Any files created in /project will have a copy made to tape within 8 hours of creation by an automatic process called TAS HSM. The tape copy will become the active version of the file if said file is subject to a purge action (see below). The tape library should be considered as an extension of the disk filesystem, but on a slower media. There is no tape archiving - when a file is removed/overwritten, previous versions will not remain on tape.

Please note that as /scratch is designated as temporary storage, the data is NOT copied to tape.

Purge Policies

  • /scratch/<username> and /scratch/project/<projectname>: files not modified AND not accessed in the last 60 days will be deleted.
  • /scratch/tmp: temporary folder - files not modified AND not accessed in the last 3 days will be deleted.
  • /project/<projectname>: 20 TB limit per project. Once a project has used 20TB of disk storage, files will be automatically deleted from disk with a weighting based on date of last access. Stub files will remain on disk that link to the tape copy, so from a user's perspective the file will still be visible on disk using normal commands such as ls, but will take time to recover from tape back to disk if the file needs to be read.
  • all data in /project/<projectname> and /scratch/project/<projectname> will be deleted permanently 1 month after core hour allocations for the project have expired unless a further application has been submitted for RCAC consideration.

To query the HSM status of a particular file in /project, run the following command:

$ lfs hsm_state /project/<projectname>/<filename>

A response of this type:

/project/<projectname>/<filename>: exists archived, archive_id:1

indicates that a copy of the file has been made to tape, whereas:

/project/<projectname>/<filename>: released exists archived, archive_id:1

indicates that the file has been purged from disk and only the stub file remains. A simple read action will cause the file to be recalled from tape back to disk.

The fastest and most efficient to restore multiple files from tape is to use:

$ lfs hsm_restore FILE1 [FILE2 ...]

 

Removing multiple files efficiently

 

Using the standard Linux command rm to delete multiple files on a Lustre filesystem is not recommended. Huge numbers of files deleted with the rm command will be very slow since it will provoke an increased load on the metadata server, resulting in instabilities with the filesystem, and therefore affecting all users.

It is recommended to use munlink, an optimized Lustre-specific command, as in the following example:

find ./my_testrun1 -type f -print0 | xargs -0 munlink
  • find ./ my_testrun1 -type f : will search files (-type f)  in the directory my_testrun1 and all its subdirectories
  • | xargs -0 munlink : xargs will then convert the list of files, line by line, into an argument for munlink. The -0 flag is related to the format of the listed files; if you use -print0 in the find command you must use -0 in the xargs command.

Once all of the files are deleted, the directory and its subdirectories can be deleted as follows:

find ./my_testrun1 -type d -empty -delete