/lustre

 

The /lustre filesystem, as its name suggests, is a Lustre v2.7 parallel filesystem with 17.6PB of total available disk space. Underpinning the filesystem is a Cray Sonexion 2000 Storage System consisting of 12 cabinets containing a total of 5988 4TB SAS disk drives. The cabinets are interconnected by FDR Infiniband Fabric with Fine Grained Routing, where optimal pathways are used to transport data betweeen compute nodes and OSSes (see below).

There are 73 LNET router nodes providing connectivity from the compute nodes to the storage over an Infiniband Fabric, with a performance capability of over 500GB/sec.

Each cabinet can contain up to 6 Scalable Storage Units (SSU); Shaheen II has a total of 72 SSUs. Each SSU contains 82 disks configured as two GridRAID [41(8+2)+2] arrays. GridRAID is Sonexion's implementation of a Parity Declustered RAID. An SSU also contains 2 embedded SBB form factor servers, configured as Lustre Object Storage Servers (OSS). The OSSes are configured with FDR Infiniband for communicating with the LNET routers. An Object Storage Target (OST) is an ext4 filesystem that is managed by an OSS. These OST ext4 filesystems are effectively aggregated to form the Lustre filesystem. In the case of Shaheen II, there is a one-to-one relationship between OSSes and OSTs. As there are 2 OSS/OSTs for each SSU, this means that there are 144 OSTs in total (use the command lfs osts to list the OSTs).

Lustre_layout.jpg

Lustre Stripe Count

Striping a file across multiple OSTs can significantly improve performance, because the I/O bandwidth will be spread over the OSTs (round robin method). The /lustre filesystem default stripe size is set at 1MB and, following analysis of typical Shaheen file sizes,  the default stripe count is set to 1, i.e. individual files will reside on one OST only, by default. The stripe count however can be increased by users to any number up to the maximum number of OSTs available (144 for Shaheen II). This can be done at the directory or file level. When the size of the file is greater than the stripe size (1MB), the file will be broken down into 1MB chunks and spread across the specified (stripe count) number of OSTs.

Example - Viewing the stripe information for a file

$ lfs getstripe file1
file1
lmm_stripe_count:  8
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 130
    obdidx         objid         objid         group
       130         408781270       0x185d81d6                 0
       104         419106989       0x18fb10ad                 0
       102         411979658       0x188e4f8a                 0
        36         403139579       0x18076bfb                 0
       112         409913235       0x186ec793                 0
        27         408240053       0x18553fb5                 0
        72         407370211       0x1847f9e3                 0
        97         403688203       0x180fcb0b                 0

In this example, the file is striped across 8 OSTs with a stripe size of 1 MB. The obdidx numbers listed are the indices of the OSTs used in the striping of this file.

Example - Viewing the stripe information for a directory

$ lfs getstripe -d dir1
dir1
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1

Example - Creating a new file with a stripe count of 10

$ lfs setstripe -c 10 file2

Example - Setting the default stripe count of a directory to 4

$ lfs setstripe -c 4 dir1

Example - Creating a new file with a stripe size of 4MB (stripe size value must be a multiple of 64KB)

$ lfs setstripe -s 4M filename2

Note: Once a file has been written to Lustre with a particular stripe configuration, you cannot simply use setstripe to change it. The file must be re-written with a new configuration. Generally, if you need to change the striping of a file, you can do one of two things:

  • using setstripe, create a new, empty file with the desired stripe settings and then copy the old file to the new file, or
  • setup a directory with the desired configuration and cp (not mv) the file into the directory

General Considerations

Large files benefit from higher stripe counts. By striping a large file over many OSTs, you increase bandwidth for accessing the file and can benefit from having many processes operating on a single file concurrently. Conversely, a very large file that is only striped across one or two OSTs can degrade the performance of the entire Lustre system by filling up OSTs unnecessarily. A good practice is to have dedicated directories with high stripe counts for writing very large files into.

Another scenario to avoid is having small files with large stripe counts. This can be detrimental to performance due to the unnecessary communication overhead to multiple OSTs. A good practice is to make sure small files are written to a directory with a stripe count of 1—effectively, no striping.

More detailed information about efficient use of Lustre and stripes can be found in our Training slides.

 

Filesystem Layout

The /project and /scratch filesystems are links to subdirectories of /lustre - /lustre/project and /lustre/scratch. The /scratch directory should only be used for temporary data utilised by running jobs, as it is subject to a rigorous purge policy described below. Any files that you need to keep for longer-term use should reside in the /project directory.

Any files created in /project will have a copy made to tape within 8 hours of creation by an automatic process called TAS HSM. The tape copy will become the active version of the file if said file is subject to a purge action (see below). The tape library should be considered as an extension of the disk filesystem, but on a slower media. There is no tape archiving - when a file is removed/overwritten, previous versions will not remain on tape.

Please note that as /scratch is designated as temporary storage, the data is NOT copied to tape.

Purge Policies

  • /scratch/<username> and /scratch/project/<projectname>: files not modified AND not accessed in the last 60 days will be deleted.
  • /scratch/tmp: temporary folder - files not modified AND not accessed in the last 3 days will be deleted.
  • /project/<projectname>: 20 TB limit per project. Once a project has used 20TB of disk storage, files will be automatically migrated from disk to tape with a weighting based on date of last access. Stub files will remain on disk that link to the tape copy, so from a user's perspective the file will still be visible on disk using normal commands such as ls, but will take time to recover from tape back to disk if the file needs to be read.
  • all data in /project/<projectname> and /scratch/project/<projectname> will be deleted permanently 1 month after core hour allocations for the project have expired unless a further application has been submitted for RCAC consideration.

To query the HSM status of a particular file in /project, run the following command:

$ lfs hsm_state /project/<projectname>/<filename>

A response of this type:

/project/<projectname>/<filename>: exists archived, archive_id:1

indicates that a copy of the file has been made to tape, whereas:

/project/<projectname>/<filename>: released exists archived, archive_id:1

indicates that the file has been purged from disk and only the stub file remains. A simple read action will cause the file to be recalled from tape back to disk.

The fastest and most efficient to restore multiple files from tape is to use:

$ lfs hsm_restore FILE1 [FILE2 ...]

 

Removing multiple files efficiently

 

Using the standard Linux command rm to delete multiple files on a Lustre filesystem is not recommended. Huge numbers of files deleted with the rm command will be very slow since it will provoke an increased load on the metadata server, resulting in instabilities with the filesystem, and therefore affecting all users.

It is recommended to use munlink, an optimized Lustre-specific command, as in the following example:

find ./my_testrun1 -type f -print0 | xargs -0 munlink
  • find ./ my_testrun1 -type f : will search files (-type f)  in the directory my_testrun1 and all its subdirectories
  • | xargs -0 munlink : xargs will then convert the list of files, line by line, into an argument for munlink. The -0 flag is related to the format of the listed files; if you use -print0 in the find command you must use -0 in the xargs command.

Once all of the files are deleted, the directory and its subdirectories can be deleted as follows:

find ./my_testrun1 -type d -empty -delete

 

Quotas

At present quotas are currently only implemented for the file count, you can check your current usage with:

 

$ lfs quota -uh <user> /lustre
Disk quotas for user <user> (uid <UID>):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
        /lustre  2.586T      0k      0k       -  609893       0 1000000       -