wiki:HpcFaq
close Warning:

Q1: Why install on your own cluster?

While we do allow all users onto our clusters and provide access for AUC data analysis, we do not really have a lot of resources/funding to support unlimited users on our equipment. We strongly encourage anyone who has access to their own cluster to chip in with additional HPC cycles. We have to write grants for our cycles, and the funding situation is extremely competitive right now. Currently, we are using several big iron clusters administered by the National Science Foundation under the XSEDE program (Ranger and Lonestar, together more than > 100,000 cores shared by many users), private clusters belonging to pharma companies, as well as smaller university clusters and a few small, local resources.

Q2:Can the analysis software be installed on a local cluster?

The analysis software has 2 components: A non-graphical c++ program that can be compiled on any linux cluster that supports MPI, and requires a few additional third party libraries. The second component is platform independent desktop GUI version which is used for data processing, result visualization and interpretation. The desktop component utilizes Qt-threads, not MPI. The HPC component does not need any X11 libraries or graphics support, or, for that matter, is not used directly by the user. None of the users need to have direct access to the supercomputer in order to run their analysis, this is handled differently (see below).

Q3: Why do I need Globus grid services?

It's a littled complicated. The GRAM5/globus services are needed by GFAC to initiate HPC jobs on the user's behalf on the remote HPC service, collect the results and provide status information to the USLIMS3 instance (see below).

There is a central UltraScan grid service called GFAC (generic factory). It is a program running at Indiana University that brokers HPC requests for UltraScan jobs to the requested HPC instance. The requests come from users that log into the USLIMS3 system (http://www.uslims3.uthscsa.edu) to access their data in a web environment, which is also used to submit jobs to the remote HPC resource. The GFAC program communicates with two different MySQL databases running in Texas on our equipment (this part we can, for the moment, provide for free to all users, incl. foreign users, and it also hosts the USLIMS3 infrastructure, but with some effort, the MySQL DB can also run elsewhere):

The first MySQL database is one of many databases, one MySQL instance for each institution using this system. This database holds all of the client's data and is secured over ssl on all transactions and requires authentication from the user. The second database is used by GFAC, which communicates with the Texas resources by receiving requests from a daemon running on the Texas system that constantly monitors if any user on any of the institutional MySQL databases has submitted a request. If so, it will package the input files, analysis parameters, etc. and assign a job ID and deposits the HPC request in the GFAC database and sends the package via GRAM5 globus tools to the remote HPC instance. Then it updates a status field in the GFAC database which is constantly monitored by the queue monitor that tells the user how their job is progressing and if there are any errors. The HPC jobs are monitored for stderr and stdout and exit status, and such output is properly interpreted and provided to the user as somewhat readable status/exit information. The GFAC system also "knows" about the particulars of the resources, including #of cores/node, total number of cores available, memory resources, and importantly the queuing software used, so it can provide proper syntax for each submitted job (the user never submits jobs directly to the cluster, they just see the results magically appear in their database).

While the job is running, it produces output messages (such as Monte Carlo iteration, genetic algorithm generation number, meniscus fit position, etc.) over a UDP port to the Texas resources, which has another daemon monitoring these UDP port messages from each configured cluster and posts these messages to a live web monitoring program ("queue viewer"), that the user can access from their USLIMS3 page. Once results are available, the GFAC program captures them and deposits them in the user's institutional database and deletes the job from the GFAC database. The desktop visualization program (available for Linux/X11, Mac OSX and Windows) will automatically find the results in the database and they can be used for further processing. We keep copies of the stdout/stderr logs in the GFAC database for 2 weeks for post-mortem analysis/debugging, but this may only be of limited usefulness to the user, though they can be accessed by the user. The Texas resource monitors can at any time query the GFAC program for additional status updates and reports. The Texas monitoring program will periodically (once a minute) process the queued/running jobs to see what's finished, pending, failed and running, and if running, at what state.

I know this is complicated and probably requires further explanation and help, and I would be happy to assist. However, installation on a remote resource should be straightforward as long as the proper certs can be installed (issued by the Texas Advanced Computing Center at UT Austin) and the UDP ports for asynchronous sending of messages can be opened. Very important: Any remote resource is made available on a case by case basis to remote users, and each user may have a different subset of clusters they are allowed to use.

The GFAC system's purpose is to abstract the submission of jobs in UltraScan to remote clusters and figure out what each cluster's specifics are for submitting, queuing, hardware resources, etc.

Q4: Where can I find the HPC source code?

The HPC component source code is part of the svn check out.

Q5: Is the HPC component using MPI?

The HPC program is completely based on MPI for parallelization. Any recent copy of MPICH or OpenMPI should work. Any hardware should work, we have been using anything from AMD opterons to various multicore Intel chip platforms.

Q6: What batch processing/grid middle ware software is needed?

you would need globus AND some batch queueing system like PBS or SGE. Both would be required.

Q7: Can the MySQL database accessed without the UltraScan software

This will not be necessary. If you still want to manually access data in the database, you will need the proper authentication tokens, but yes, all of these can be accessed through ssl with an appropriate interface. All methods are implemented as stored procedures which require SSL encryption, so your standard MySQL commandline interface will not work, but you will also not need it. If you really really need access to a Texas-hosted database, we can provide a user account, give you the authentication tokens and let you access from the command line through an ssh connection to the MySQL server, but ordinarily that's discouraged.

Q8: How are HPC jobs started?

All batch scripts would be executed by GFAC. All communications from GFAC would go through GRAM5, which are encrypted.

Q9: would there have to be changes to the UltraScan GUI code for implementation on a new cluster?

The GUI never talks to the HPC system at all, only to the database. The HPC system doesn't talk to either. GFAC handles most of the communication with the HPC system. The only other communication are the UDP output streams from the HPC system, which are monitored by the Texas resource. There shouldn't be any modifications necessary. You only need to install the prerequisits and configure your cluster properly:

  • installation of ONE user account with sufficient diskspace to support compilation and installation of the UltraScan C++ component and 3rd party libraries
  • installation of GRAM5 globus toolkit
  • installation of UltraScan MPI component and qt dependencies
  • installation of MPI, if not already available
  • requirement for a modern c++ compiler (a recent GCC version would work fine)
  • installation of grid certificates from UT Austin
  • configuration of UDP ports for posting status updates
  • set aside some time for testing.

Q10: what is the typical data size?

data volume is trivial. Nothing needs to be stored on the HPC resource. whatever is calculated can be deleted after it has been deposited in the Texas database. Resulting models are VERY small, a few kilobytes. Data are only needed during the fitting process, and vary in size from a few 10 kb to 1 MB. Memory needs are usually less than 1 GB for a 8-core node. A typical job typically would run well with 128-256 cores, although smaller configurations with 40-60 cores are routinely used and provide satisfactory performance, especially on some of the optimization routines. Larger number of cores are helpful for Monte Carlo analyses, and can be configured in GFAC to provide custom resource allocation.

Q11: How is the software licensed?

GNU General Public License. The entire program is 100% open source, so feel free to contribute if you are so inclined :-)

Last modified 6 years ago Last modified on Jul 18, 2012 2:26:35 PM