Table of contents:
The current batch queue system installed is Sun GridEngine.
Batch queues will be adjusted to optimize fairness, system usage, and to allow hardware maintaince as necessary.
Currently, there are two main queues:
We strongly discourage the use of a single cpu of a parallel machine. Single cpu jobs use resources that would be more effectively used by other users' paralell jobs. Users needing computing cycles but unwilling to parallelize their applications should make other arrangements to farm the unused computing cycles of systems belonging to their colleagues. A small number of single cpu jobs will be tolerated, but only if the single cpu jobs are not blocking resources needed by parallel jobs. Array jobs would pe preferable to multiple single cpu jobs.
We do not currently prefer any particular method of parallelization. For example, for shared memory jobs, pthreads or OpenMP could be used. MPI can be used to make jobs more flexible and able to use more than one node. Embarassingly parallelizable jobs are also acceptable, but should be structured as an array job to minimize overhead on job scheduling.
Nodes will be powered back on to match the needs of the batch queue system. Clusters with a large number of long term unused nodes will be deeply powered down, which requires manual intervention to power back on.
Each user will be allowed a maximum of 100G of space per head-node.
Exceptions can be made on request, but the disk space is limited and
not easily expandable. Disk quotas are implemented and enforced.
You can view your current disk quota with the command
quota -vs
You should get a warning when you log in if you are exceeding your soft limit.
When the soft limit is exceeded, you have one week to go below it.
When the time expires or if you exceed your hard limit, you will not be able
to write any more data to the disk, and editing files will cause data loss.
Note that users must comply with the data retention policy, and not store large data files on the head nodes long term. Users violating this may get their files archived and their disk quota reduced.
Space on newton will be less limited, but you may need to ask for more if you run out. Newton also will not be backed up, but the RAID arrays on newton have a somewhat higher level of redundancy.
Currently, not all users have newton accounts; but all users can get one by asking.
Data older than 6 months left on cluster head nodes will be archived and removed without warning, but with notice after the fact. Accounts that allow the grace period to expire after exceeding the soft quota limit may be archived before the 6 months expire. If accounts are archived, most likely the entire account will be archived (including current files) and all the files will be removed from the cluster head node. Archived files will be restored on newton only by specific request. If space on head nodes becomes short, some users may be asked to reduce space use or move files to newton. Accounts found storing data long term on head nodes may find their disk quota reduced below the default limits without warning.
Events that impact current and future use of the cluster may also be put in the message of the day (viewed at each login) for a short period. Versions of these messages will also be posted in the blog.
Also, occasionally unexpected events occur (such as frequent power outages during Florida's lightning season). To keep mailing list traffic low, the mailing list will not usually be notified of such unplanned events. However, all events will be posted in the I2Lab blog, currently at http://www.i2lab.ucf.edu/blog. You can view the blog either as a web page or as an RSS feed. Ask for assistance if you would like help finding an appropriate RSS reader.
Events will be posted in the blog as soon as reasonable (usually as soon as we know about planned events, and within an hour to half a day after we find out about unplanned events).
Note that while efforts will be made to prevent it while jobs are running, compute nodes may be rebooted without warning. Head nodes will only be rebooted when absolutely necessary. Most (but not all) head nodes are on backup power; most compute nodes are not. Jobs using the batch queue system usually surive a head node reboot.
If other users might use the software, or you need assistance in installing it, you can ask it to be installed in a system directory for you. Some system directories can be distributed to each compute node to improve performance (software must be in an RPM package to be distributed). If multiple users install the same large software package, they may be asked to switch to a shared version in a system directory.
Please note that the I2Lab can not be responsible for adherance to software licenses unless it was bought by the I2Lab. Installation support will be limited for software requiring license agreements to be signed.
If you are unsure if a software package is already installed, please ask. A partial list can be found at:
However, sometimes it is helpful to access a node your job is already running on for debugging purposes. As long as this access is not abused, it will remain open.
Users are also strongly discouraged from leaving open shells or idle processes on compute nodes when they are not currently running a job. Such idle shells will likely be automatically closed without notice.
If you have problems with another user's job disrupting or slowing your job down, please notify me.