wiki:HpcInfo

Version 11 (modified by dzollars, 13 years ago) ( diff )

--

HPC Communications

UltraScan's communication with the High Performance Computer (HPC) or Grid Cluster is implemented according to the above drawings. The tasks are accomplished as described below. The original OpenOffice document is attached to this page.

Laboratory Information Management System (LIMS)

The purpose of this system is to interface with the user to specify an analysis type, such as the Genetic Algorithm (GA) or Two Dimensional Spectrum Analysis (2DSA), as well as the needed parameters for the analysis to the HPC system. After the user specifies the needed data, the data is packaged into a common data directory, uslims3.uthscsa.edu:/srv/www/htdocs/uslims3/uslims3_data. For this analysis request LIMS creates a new GUID, associates it with the analysis data, and creates a new record in the HPCAnalysisRequest database table. This record is the parent record for everything relating to this HPC analysis, and this GUID serves as the common identifier of this analysis. For instance, this GUID is the name of the subdirectory in the common data directory where all the files relating to this particular analysis are stored when the job is submitted to the HPC system. Another example: LIMS prepends the string "US3-" to this guid and it becomes the gfacID, used by the GFAC system to identify the job, and used in the listen script and grid control program to identify the job. The database name and gfacID is how the grid control program associates the HPC job with the analysis request record in the LIMS database.

Contents of the common data directory

Contents of file File name
Scan data itself runID.dataType.cell.channel.wavelength.auc
Edit profile runID.editedDataID.dataType.cell.channel.wavelength.xml
Analysis parameters hpcrequest-server-database-requestID.xml
Noise files random.ti_noise or random.ri_noise
Tar file with above content hpcinput-server-database-requestID.tar
Job submission parameters cluster_shortname-requestGUID-jobxmlfile.xml
Post-analysis job stats job_statistics.xml
Queue messages, stdout, stderr database-requestID-messages.txt
Analysis results, if any analysis.tar

The LIMS job submit process

In LIMS job submission happens in stages. In the first stage everything related to the data itself is collected and placed into a tar file. This includes the scan data itself, the edit profile, analysis parameters, and any noise files that have been selected. In the second stage, the job submission parameters are placed into a job submit file and the job is submitted. It is submitted as an HTTP request, where the job submission xml file is the body of the request and the tar file is sent as a base-64 encoded, chunk-split attachment. LIMS writes the parent HPCAnalysisRequest record in the LIMS database, as well as the analysis record for the job in the local gfac database. The analysis table in the local gfac database contains the jobs from all the LIMS databases that are currently being processed.

Supercomputer Queue

This task is controlled by the Supercomputer system. It is responsible for controlling the jobs running on that system and communication with clients.

Communication tasks include receiving tasks, returning job status, and informing the client when a task has been completed or aborted.

MPI_Analysis (UltraScan HPC Analysis Program)

The MPI_Analysis program reads the job submission xml file and uses that as a guide to read other data files as needed to populate internal data structures. It then performs the analysis, writing any needed output to disk.

At the beginning of the program, periodically during execution, and at the end of of processing, MPI_Analysis writes a UDP status datagram to a listener on the host and port specified in the control.xml file. Each datagram will consist of the analysisRequestGUID and a status (e.g. started, iteration number, finished). This is not a reliable two-way communication and it is the responsibility of the listener to follow up and manage any missed messages.

listen.php

This php program runs as daemon receiving udp packets from the MPI_Analysis program. It is responsible for updating the analysis table in the local gfac database and the HPCAnalysisResult table in the LIMS database with current status. Current status possibilities include:

Current Status Meaning
SUBMITTED Job is queued, waiting to be run.
SUBMIT_TIMEOUT Job is queued, waiting to be run, however it's been waiting for more than 24 hours.
RUNNING Job is running.
RUN_TIMEOUT Job is running, however it's been running for more than 24 hours.
DATA Job has completed, however the data has not arrived.
DATA_TIMEOUT Job has completed and we're waiting on the data, however we've been waiting for more than an hour.
COMPLETE Job has completed and data has been delivered.
FAILED Job has failed.
CANCELED User canceled the job.
ERROR Grid control or listen has encountered an undocumented error.

grid-control.php

This php program is scheduled periodically via cron. It checks the jobs in the local gfac analysis table (current jobs), and determines what actions to take, if any. Current status possibilities and the actions that are taken (:

Current Status Actions taken Timeout On timeout, change status to
SUBMITTED If > 10 mins, request status update 24 hours SUBMIT_TIMEOUT
SUBMIT_TIMEOUT Request a status update 24 hours FAILED
RUNNING If > 10 mins, request status update 24 hours RUN_TIMEOUT
RUN_TIMEOUT Request a status update 48 hours FAILED
DATA Request status; Request data every 5 mins 1 hour DATA_TIMEOUT
DATA_TIMEOUT Request status; Request data every 15 mins 24 hours FAILED
COMPLETE Do cleanup
FAILED Do cleanup
CANCELED Do cleanup
ERROR Do cleanup

Cleanup performs several functions, and is called by the grid control program if appropriate. Cleanup delivers the analysis results tar file into the common data directory, expands it, and places any model files and noise files into the appropriate LIMS database. Cleanup also copies the stdout content, the stderr content, and all queue messages to the queue message file in the common data directory. Finally, cleanup deletes the appropriate analysis record from the gfac database and informs the user of final status by email.

Attachments (3)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.