Computing Resources

Computation

The datamining cluster has seven machines. All are Pentium IIIs, and are running Red Hat Linux (version 7.3):

 Name   Type   Speed  Memory 
helios Dell PowerEdge 1550 1 Ghz 1GB
rhea Dell PowerEdge 1550 1 Ghz 1GB
apollo Dell PowerEdge 1550 1 Ghz 1GB
calypso Dell PowerEdge 1550 1 Ghz 1GB
poseidon Dell PowerEdge 1550 1 Ghz 1GB
vulcan Dell Tower 800 Mhz 512M
zeus Dell Server ?? 1 Ghz 1.5G

The first 5 (helios, rhea, apollo, calypso, poseidon) are for general computation. Typically, long-running processes should be run on these. Vulcan is intended for interactive work: login, edit, compile, etc. Zeus is the file server. But all bets are off near a paper deadline. If you really need something finished use any machine at any time as long as you negotiate with others who need things finished at the same time.

Disk Space

The datamining cluster has about 1 TB of disk space divided into a series of network accessable partitions that are mounded under /projects/dm/ (I think we may want to work on the names some):

/projects/dm/cold/
A 360GB RAID 5 partition served from zeus (maybe minus some space for the OS etc, I'm fuzzy on this). This space is for archiving tared and gziped versions of our important data sets. The data here is protected by the redundant RAID, but is not regularly backed up. If you want something here backed up (and it is a very good idea to want our data sets backed up) you need to split it into 40GB chunks (the size of our tape drive tapes) and send a special request to support.
/projects/dm/high1/
A 36GB partition served from dark. This space is for small files that change on a regular basis and that we really want to keep; like code, papers, and notes. It is backed up every day.
/projects/dm/high2/
A 36GB partition served from helios. This space is for small files that change on a regular basis and that we really want to keep; like code, papers, and notes. It is backed up every day.
/projects/dm/high3/ and /projects/dm/www/
Two partitions that add up to 36GB and are served from helios. high3 is for small files that change on a regular basis and that we really want to keep; like code, papers, and notes. www is for the data mining web site. The division of space between these needs to be discussed. They are both backed up every day.
/projects/dm/med1-4/
Four 36GB partitions served from rhea. This space is for project data from active projects that is hard (but not impossible) to reproduce. It is backed up once every two weeks. When the project is over this data should be deleted or moved to /projects/dm/cold in compressed format.
/projects/dm/low1-7/
Seven 73 GB partition served from various cluster machines (1 helios, 2 apollo, 2 calypso, 2 poseidon). This is for files that can be reproduced easily, like uncompressed versions of the files in /projects/dm/cold. It is never backed up

All of the machines also have a local 32 GB /scratch partition except zeus. This should be used for local caching of data. The /scratch disks are not backed up and in fact may be cleaned out occasionally and with very little warning. Be sure to keep copies of important files in the /projects/dm/ partitions

Users

The datamining cluster is intended to be used for datamining research. If you wish to use the cluster, you must get prior permission from Pedro Domingos (pedrod@cs.washington.edu) . (? What is the process for them then being added to pool of users? Should I put jrp's email here? We should ask jrp about this)

To obtain write access to the /projects/dm? and the /scratch disks, you must be a member of the dm group. If you have received permission to use the cluster, you can email Geoff Hulten (ghulten@cs.washington.edu) to be added to the dm group.

Remember that this cluster is a shared resource. Users of the cluster coordinate with each other via email. In general, the machines are fairly idle, so feel free to use as many as you need. However, expect to receive an email if another user is also trying to use the machines. Near paper deadlines especially, we ask that users who do not have deadlines gracefully refrain from using the cluster until the deadline has passed. So far, there are few enough users that this has not been a problem, and coordination of computational needs has easily been done via email with other cluster users.