AFRL DSRC: Policy User Guide

Policy User Guide

1. AFRL DSRC Use Policy
2. User Account Removal Policy
3. Information Interchange Policy
4. System Availability Policy
5. Workspace Policy
6. Data Import and Export Policy
7. Password Sharing Policy

1. AFRL DSRC Use Policy

1.1. Interactive Use

Interactive usage of each of the AFRL DSRC High Performance Computing (HPC) systems is restricted to 15 minutes of CPU time per processor. Any interactive job/process that exceeds this 15-minute CPU time limit will automatically be killed by system monitoring software. Interactive usage on AFRL DSRC HPC systems should be limited to items such as: program development, including debugging and performance improvement, job preparation, job submission, and the preprocessing and post-processing of data.

AFRL DSRC HPC systems have been tuned for optimal performance in a batch system environment. Excessive interactive usage causes overloading of these systems and leads to considerable degradation of system performance.

1.2. Session Lifetime

To provide users with a more secure high performance computing environment, the AFRL DSRC has implemented changes that limit the lifetime of all terminal/window sessions. Any idle terminal or window session connections to the AFRL DSRC shall be terminated after 10 hours. Regardless of activity, any terminal or window session connections to the AFRL DSRC shall be terminated after 20 hours. A 15 minute warning message shall be sent to each such session prior to its termination.

1.3. Batch Use

The primary resource to schedule jobs on all systems is the CPU-hour (CPH). The CPH limit per user per project for a system is defined by multiplying half of the advertised number of system processors by forty-eight (48) hours; this value may be rounded for convenience. For example, a system comprised of 2048 processors distributed among its nodes would have a CPH limit of 1024 processors times 48 hours, which is approximately 50,000 CPU-hours.

In the case where a system has nodes with more memory (large-memory nodes) than other nodes, the scheduler will place jobs requiring more memory than the default memory per processor on the large-memory nodes.

Due to limitations in resource checking, the scheduler will allocate processors and memory as multiples of an indivisible process unit (consisting of processors or cores and an amount of shared memory); in some cases, this process unit will consist of more than one processor or core. If the requested number of processors and memory is not an even multiple of process units, these resource requests will be increased to match the equivalent number of process units.

Although every attempt will be made to keep entire systems available, interrupts will occur, and more frequently on nodes with larger numbers of processors. Users should use mechanisms to save the state of their jobs where available (most AFRL DSRC-supported applications can create restart files so that runs do not have to start from the beginning) to protect against system interrupts. Users running long jobs without saving the state of their jobs run at-risk with respect to system interrupts. Use of system-level check pointing is not recommended.

All HPC systems have identical queue names: standard, debug, and background; however, each queue has different properties as specified in the Table 1. Each of these queues is assigned a priority factor within the batch system. Within the standard queue, job classes (urgent, high, frontier, and standard) are also assigned a priority factor. The relative priorities of the queues and job classes are shown in Table 2. In addition, jobs using more processors will receive higher priority within a given job class. The scheduling of jobs uses job slot-reservation based on these priority factors, and increases system utilization via backfilling while waiting for resources to become available.

Mustang - 56,448 Cores - HPE SGI 8600
Priority	Queue Name	Job Class	Max Wall Clock Time	Max Cores Per Job	Comments
Highest	urgent	Urgent	168 Hours	28,224	Jobs belonging to DoD HPCMP Urgent Projects.
	debug	Debug	1 Hour	1,152	User testing.
	high	High	168 Hours	28,224	Jobs belonging to DoD HPCMP High Priority Projects.
	frontier	Frontier	168 Hours	28,224	Jobs belonging to DoD HPCMP Frontier Projects.
	standard	Standard	168 Hours	28,224	Standard jobs.
	transfer	N/A	48 Hours	1	Data transfer for user jobs.
Lowest	background	Background	120 Hours	48	Unrestricted Access - no allocation charge.

In conjunction with the HPCMP Baseline Configuration policy for Common Queue Names across the allocated centers, the AFRL DSRC will honor batch jobs that include the queue name for urgent, high (high priority) and frontier. Although the Job Class Priority will still be assigned by the project, the batch script for the queuing system using one of the Common Queue Names will be accepted. Note: if the project number does not match the Job Class or queue, the job will run in the class assigned by the project number, and not by the queue name selected.

Any project with an allocation may submit jobs to the background queue. Projects that have exhausted their allocations will only be able to submit jobs to the background queue. A background job cannot start if there is a foreground job waiting in any queue.

If any job attempts to use more resources than were specified when the job was submitted for batch processing, the center staff reserves the right to kill that job and provide the user with a rationale. Center staff also reserves the right to manage system utilization as they deem appropriate.

1.4. Special Requests

All special requests for allocated HPC resources, including increased priority within queues, increased queue parameters for maximum number of CPUS and Wall Time, and dedicated use, should be directed to the HPC Help Desk. Request approval will require documentation of the requirement and associated justification, verification by the AFRL DSRC Computational Technology Center staff and PET lead, and approval from the designated authority, as shown in the following Table. The AFRL DSRC Director may permit special requests for HPC resources independent of this model for exceptional circumstances.

Approval Authorities for Special Resource Requests
Resource Request	Approval Authority
Up to 10% of an HPC system/complex for 1 week or less	AFRL DSRC Director or Designee
Up to 20% of an HPC system/complex for 1 week or less	S/AAA
Up to 30% of an HPC system/complex for 2 weeks or less	Army/Navy/AF Service Principal on HPC Advisory Panel
Up to 100% of an HPC system/complex for greater than 2 weeks	HPCMP Program Director or Designee

1.5. Contact Information

If you have any questions concerning this policy, please contact the HPC Help Desk at 1-877-222-2039 or via email at help@helpdesk.hpc.mil.

1.6. Subject to Change Notice

The policies set forth in this document are subject to change without prior notice.

2. User Account Removal Policy

This policy covers the disposition or removal of user data when the user is no longer eligible for a given HPCMP account on any one or more systems in the HPCMP inventory.

At the time a user becomes ineligible for an HPCMP user account, the user's access to that account will be disabled.

The user and the Principal Investigator (PI) are responsible for arranging for the disposition of the data prior to account deactivation. The user may request special assistance or specific exemptions or extensions, based on such criteria as availability of resources, technical difficulties or other special needs. If the user does not request any assistance, then the respective center will promptly contact the user, the PI of the project, and the responsible S/AAA to determine the proposed disposition of the user's data. All data disposition actions will be performed as specified in the HPCMP's Data Protection Policy. If the center is unable to reach the aforementioned individuals, or if the contacted person(s) does not respond before the account is deactivated, the user's data stored on systems or home directories will be moved to archive storage, and one of the following two cases must hold:

User has an account at another HPCMP center. Then, the user, the PI of the project or responsible S/AAA, as appropriate, has one year to arrange to move the data from the archive to the HPCMP Center where they have an active account. After this time period has expired, the center may delete the user's data.
User does not have an account at another HPCMP center. Then, the user, the PI of the project, or responsible S/AAA, as appropriate, has one year to arrange to retrieve the data from the HPCMP resources. After this time period has expired, the center may delete the user's data.

Following the disposition of the user's data, the user account will be removed from the system.

In special cases such as but not limited to, security incidents or HPCMP resource abuse, access to a user account and/or user data may be immediately prohibited or deleted as appropriate for the circumstances as judged by the center or HPCMP.

Please note the following. Exceptions to this general data disposition policy can and will be made as necessary within the ability of the center to fulfill such requests, given reasonable justification as judged by the center. Also, contracts requiring data maintenance beyond the conditions of the data disposition policy cannot be accommodated by the center if the center is not a signatory to the contract. Such contracts may be considered when exceptions are requested.

If you have any questions concerning this policy, please contact the HPC Help Desk Accounts Center at 1-877-222-2039 or via email at help@helpdesk.hpc.mil.

3. Information Interchange Policy

The key methods that the AFRL DSRC uses to communicate announcements and important information to our users about HPC systems and the environment include:

Mass emails sent to all users or those assigned to a particular HPC system
Maintenance notices posted on the AFRL DSRC public site at https://www.afrl.hpc.mil
Maintenance notices posted on the HPC Centers public site at https://centers.hpc.mil
System login messages posted to the appropriate HPC systems.

It is vital to the AFRL DSRC's communication process, and mutually beneficial to our users, to understand the responsibilities of being a good citizen of the AFRL DSRC. We ask that users:

Please keep the AFRL DSRC apprised of current email addresses. This way we can assure that vital information about our Center reaches you. Please contact your S/AAA to have your email address updated. Please note that if the email address you give us is behind a firewall, you will need to arrange for your local system administrator to allow email from your work site to pass through the firewall boundary to the AFRL DSRC.
Please check the website, which has up to date current news and information on topics such as HPC resource availability, upcoming training opportunities, or updates to our user guides and the policies and procedures documentation.

Comments or questions are welcomed. These can be submitted by contacting the HPC Help Desk at 1-877-222-2039 or via email at help@helpdesk.hpc.mil.

4. System Availability Policy

A system will be declared down and made unavailable to users whenever a chronic and/or catastrophic hardware and/or software malfunction or an abnormal computer environment condition exists which could:

Result in corruption of user data.
Result in unpredictable and/or inaccurate runtime results.
Result in a violation of the integrity of the DSRC user environment.
Result in damage to the High Performance Computer System(s).

The integrity of the user environment defined in the AFRL DSRC User Guide is considered corrupt anytime a user must modify his/her normal operation while logged into the DSRC. Examples of malfunctions are:

User home ($HOME) directory not available.
User Workspace ($WORKDIR, $JOBDIR) areas not available.
If the archive system is unavailable, queues are suspended, but logins are enabled.

When a system is declared down, based on a system administrator's and/or computer operator's judgment, users will be prevented from using the affected system(s) and all existing batch jobs will be prevented from running. Batch jobs held during a "down state" will be run only after the system environment returns to a normal state.

Whenever there is a problem on one of the HPC systems that could be remedied by removing a part of the system from production (an activity called draining), it must first be determined how much of the system will be impacted by the draining in order to brief the necessary levels of management and the user community.

Where the architecture of the HPC system will allow a node to be removed from production with minimal impact to the system as a whole, then the system administrators can make the decision to remove the node with notification to the operators for information. Typically this pertains to cluster system architectures. In some cases, large SMP systems will allow individual CPUs to be downed, and the administrator can determine this and notify operations for information.

Where the architecture of the HPC system will allow significant portions of the system to be removed from production and still allow user production on a large part of the system to continue, then the system administrator along with government and contractor management can make the decision to remove that part of the system. The system should show that domain or SMP node as out of the normal queue for scheduling jobs so that the user community can determine current status. The system administrator will advise operations and the help center of this action.

In cases where workspace will be unavailable, or a complete system needs to be drained for maintenance, contractor and government director level management will be notified. In cases involving an entire system, user services will email users of the downtime schedule and schedule for returning the system to production.

If you have any questions concerning this policy, please contact the HPC Help Desk at 1-877-222-2039 or via email at help@helpdesk.hpc.mil.

5. Workspace Policy

5.1. `$WORKDIR`

$WORKDIR is the local temporary file system (i.e., local high-speed disk) that is available on all AFRL DSRC high performance computing (HPC) systems and is available to all users.

$WORKDIR is not intended for use as a permanent file storage area by users.

$WORKDIR is intended for use by executing programs to perform file I/O that is local to that system in order to avoid file systems with space restrictions, such as a user's home ($HOME) or /tmp directories.

The $WORKDIR file system is NOT backed up or exported to any other system. In the event of file or directory structure deletion or a catastrophic disk failure, such files and directory structures are lost.

It is the user's responsibility to transfer files that need to be saved to a location that allows for permanent file storage, such as the user's archival ($ARCHIVE_HOME) or home ($HOME) directory locations. Please note that a user's archival storage area has no disk quota assigned to it, while a user's home directory area has a disk quota assigned.

5.2. Creation and Access of User `$WORKDIR` Directory

Each user is assigned ownership of a $WORKDIR sub-directory named $WORKDIR/username, where username is the user's AFRL DSRC login name. This sub-directory will be created for the user at login via the AFRL DSRC global .cshrc file whenever appropriate.

The environment variable $WORKDIR is created to point to the user's $WORKDIR/username directory. For example, to access $WORKDIR from the command line type "cd $WORKDIR".

When a batch job is executed, the environment variable $JOBDIR is created and points to the user's $WORKDIR/username/jobid directory. Jobid is the job identifier number assigned by the batch submittal process.

It is recommended that user batch jobs perform the following steps:

Copy needed input data files from your archive ($ARCHIVE_HOME) directory or home ($HOME) directory to either $WORKDIR or $JOBDIR. (Using $JOBDIR is recommended.)
Execute your program.
Copy output data files to be saved from $WORKDIR or $JOBDIR to either your archive ($ARCHIVE_HOME) directory or your home ($HOME) directory. Then delete the files from $WORKDIR and $JOBDIR in order to keep the $WORKDIR file system from becoming too full.

Sample batch submission scripts incorporating these steps are available for each HPC system.

5.3. `$WORKDIR` Maintenance

In order to provide sufficient free $WORKDIR disk space to our users, the following $WORKDIR maintenance policy was implemented on all HPC systems:

A $WORKDIR scrubber program will run every day on all HPC systems.
All files and directory structures in the $WORKDIR directory location that are older than 30 days are subject to deletion. $JOBDIR is part of $WORKDIR and is an exception that is noted below.
All files and directory structures in the $JOBDIR directory location are subject to deletion once they have aged more than 30 days after the completion of the batch job associated with the $JOBDIR directory.

For each system, a value of 30 days was selected that would be large enough to allow users to retain temporary files and directory structures in $WORKDIR but small enough to prevent, except under periods of unusually high volume, the need to delete files and directory structures less than 30 days old. Because workload varies, system administrators may need, on occasion, to delete $WORKDIR files and directory structures less than 30 days old, until sufficient disk space is freed up. To minimize the times when early deletion of $WORKDIR files and directory structures is required, users are encouraged to use $WORKDIR efficiently and economically.
If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, we will notify you via email 6 days prior to deletion.

However, if a critical disk space shortage occurs, notifications will be sent only if time permits.

If you have any questions concerning this policy, please contact the HPC Help Desk at 1-877-222-2039 or via email at help@helpdesk.hpc.mil.

6. Data Import and Export Policy

This policy outlines the methods available to users to move files into and out of the AFRL DSRC environment. Users accept sole responsibility for the transfer and validation of their data after the transfer.

6.1. Network File Transfer

The preferred transfer method is file transfers over the network using the encrypted (Kerberos) file transfer programs rcp, scp, or ftp. In cases where large numbers of files (> 1000) and/or large amounts of data (> 100 GBytes) must be transferred, users should contact the HPC Help Desk for assistance in the process. Depending on the nature of the transfer, transfer time may be improved by reordering the data retrieval from tapes, taking advantage of available bandwidth to/from the Center, or dividing the transfer into smaller parts; the AFRL DSRC staff will assist the users to the extent that they are able. Limitations such as available resources and network problems outside the Center can be expected, and the user should allow sufficient time to do the transfers.

6.2. Reading/Writing Media

There are currently no facilities or provisions as available to import or export user data on tape with the resources of the mass storage/archival system.

If you have any questions concerning this policy, please contact the HPC Help Desk at 1-877-222-2039 or via e-mail at help@helpdesk.hpc.mil.

7. Password Sharing Policy

Users are responsible for all password(s), account(s), YubiKey(s), and associated Personal Identification Number (PIN(s)) issued to them. Users are not to share their password(s), account(s), YubiKey(s), or PIN(s) with any other individual for any reason. Doing so is a violation of the contract that users are required to sign in order to obtain access to DoD High Performance Computing Modernization Program (HPCMP) computational resources.

Upon discovery/notification of a violation of the above policy, the following actions will be taken:

The account (i.e., username) will be disabled. No further logins will be permitted.
All account assets will be frozen. File and directory permissions will be set such that no other users can gain access to the account assets.
Any executing jobs will be permitted to complete; however, any jobs residing in input queues will be deleted.
The Service/Agency Approval Authority (S/AAA) who authorized the account will be notified of the policy violation and the actions taken.

Upon the first occurrence of a violation of the above policy, the S/AAA has the authority to request that the account be re-enabled. Upon the occurrence of a second or subsequent violation of the above policy, the account will only be re-enabled if the user's supervisory chain of command, S/AAA, and the High Performance Computing Modernization Office (HPCMO) all agree that the account should be re-enabled.

The disposition of account assets will be determined by the S/AAA. The S/AAA can:

Request that account assets be transferred to another account.
Request that account assets be returned to the user.
Request that account assets be deleted and the account closed.

If there are associate investigators who need access to AFRL DSRC computer resources, we encourage them to apply for an account. Separate account holders may access common project data as authorized by the project leader.

If you have any questions or concerns regarding this policy, please feel free to contact the HPC Help Desk at 1-877-222-2039 or by e-mail at help@helpdesk.hpc.mil.

Policy User Guide

Table of Contents

1. AFRL DSRC Use Policy

1.1. Interactive Use

1.2. Session Lifetime

1.3. Batch Use

1.4. Special Requests

1.5. Contact Information

1.6. Subject to Change Notice

2. User Account Removal Policy

3. Information Interchange Policy

4. System Availability Policy

5. Workspace Policy

5.1. $WORKDIR

5.2. Creation and Access of User $WORKDIR Directory

5.3. $WORKDIR Maintenance

6. Data Import and Export Policy

6.1. Network File Transfer

6.2. Reading/Writing Media

7. Password Sharing Policy

5.1. `$WORKDIR`

5.2. Creation and Access of User `$WORKDIR` Directory

5.3. `$WORKDIR` Maintenance