Run programs overnight (HTCondor)
HTCondor is a system designed for High Throughput Computing in College computer rooms. The system allows you to submit a job (a program written in C, Fortran or Matlab or a simulation using Abaqus or StarCCM) to a central queue. It then takes jobs from the queue and finds a free computer to process the job. Once the job is completed, the results are returned to the central server and you are sent an email to let you know that your results are ready.
Applications other than those mentioned above are supported. If the software can be run from the cmd line then it should be possible to process using HTCondor. To add other applications, contact the HTCondor team: firstname.lastname@example.org.
HTCondor uses an algorithm to prioritise requests. Find out more: HTCondor algorithm.
Prepare and submit your file
- You must write your code so that it doesn’t need any user interaction. For example, if you have written code that asks for input when the program starts (perhaps the name of a data file, number of iterations, location for output), you need to edit your code so that the program can get this information another way. The variables in your code must be set to run silently.
- Programs and data files are uploaded using a web browser and it is usually not possible to upload a single file of more than 2GB on web browsers.
- Submit jobs to the queue: https://htcondor.cc.ic.ac.uk.
Length of run
Your code will be run on a College computer when it's not being used by another user. Typically, this will happen at night time so run times of 6-12 hours will cause no problem. If you think your code may need to run for longer than this then you should make sure that you use checkpointing or split your job into smaller programs.
The HTCondor scheduler keeps track of every job. If someone reboots a computer running your code then the scheduler will re-submit that job to another machine. It will keep doing this until the job exits properly.
The HTCondor website shows a list of jobs being processed and those still in the queue. If you visit your job space on the server (link sent in your submission email), you will find a file giving information on the progress of your job.
Checkpointing means that your code saves the results (say) every 1,000 iterations. For example, let's say that your code has a loop which runs a million times. Your code is probably written so that the program starts with the loop counter set to 1 and stops when the counter reaches 1,000,000 and then releases its results.
Each time your program runs it checks to see if there's a saved state and, if it finds one, it starts from that point rather than starting again from the beginning. For some jobs you may have to split your code into multiple jobs, but for StarCCM you may need to use a coarser mesh or analyse a smaller part of the geometry in order to reduce computation time.
Once the job is completed, the results are returned to the central server and you are sent an email to let you know that your results are ready. Your results will be stored for seven days after the job completes. You will receive an email when the job is finished and another email two days before the files are deleted from the server.
Check the troubleshooting tips below for suggestions. If you are still having problems with your job(s), contact the HTCondor team: email@example.com.
Jobs stuck in the queue or terminating before completion
- When you upload your job to the server you will get an email which tells you where the job has been stored temporarily. If you look in that folder, you will see a number of other files get created. One will have the extension ".out" and another will have the extension ".log". The .out file contains any text which might be printed to the screen as the program runs and the .log file gives information about what HTCondor is doing with your job and any errors that have occurred. If you don't undertand them, contact the HTCondor team: firstname.lastname@example.org.
- If your job has been stuck in the queue for a long time, have a look at the global job queue because it's possible that there are lots of jobs in front of yours: https://htcondor.cc.ic.ac.uk/htcondor/systemqueue.aspx.
If you are still having problems with your job(s), contact the HTCondor team: email@example.com.
Compiled exe jobs
- If your compiled program doesn't give the results you expect, check to make sure that you have uploaded all the resources you need (for example, if you are running a program which processes a data file, you must upload it).
- If it appears to run but you don't get any results, check your code. Where is it trying to put the results? Don't specify a path for any output files, because you don't know which computer will run the code and it will also not be able to access your H: drive, for example.
- One way to check what's going on is to copy all the files from the HTCondor server folder to your own machine and run the command there. You may see any errors or messages which will help you to diagnose what's going on.
- Matlab jobs must run without any user prompts and, critically, the final line of your main .m file must be exit. If you don't have that there , Matlab will never terminate and you won't get any results.
- If you want to troubleshoot, you can copy all the files from the HTCondor server folder to your own machine, open a command prompt, change to the folder where you've saved the files and run the command runmatlab.cmd. This will run Matlab and your job and you should be able to see what's going on.
While the job is running you will see the .log and .out files appear in the folder on the HTCondor server. These will give you information about the job's progress.