= HPC Communications = * [wiki:LIMS3SchemaOverview Overview] * [wiki:LIMS3SchemaDataflow1 Data Flow, step 1] * [wiki:LIMS3SchemaDataflow2 Data Flow, step 2] * [wiki:LIMS3SchemaDataflow3 Data Flow, step 3] * [wiki:LIMS3SchemaDataflow4 Data Flow, step 4] * [wiki:LIMS3SchemaDataflowAsync Data Flow, Asynchronous Step] !UltraScan's communication with the High Performance Computer (HPC) or Grid Cluster is implemented according to the above drawings. The tasks are accomplished as described below. The original !OpenOffice document is attached to this page. == Laboratory Information Management System (LIMS) == The purpose of this system is to interface with the user to specify an analysis type, such as the Genetic Algorithm (GA) or Two Dimensional Spectrum Analysis (2DSA), as well as the needed parameters for the analysis to the HPC system. After the user specifies the needed data, the data is packaged into a common data directory, uslims3.uthscsa.edu:/srv/www/htdocs/uslims3/uslims3_data. For this analysis request LIMS creates a new GUID, associates it with the analysis data, and creates a new record in the HPCAnalysisRequest database table. This record is the parent record for everything relating to this HPC analysis, and this GUID serves as the common identifier of this analysis. For instance, this GUID is the name of the subdirectory in the common data directory where all the files relating to this particular analysis are stored when the job is submitted to the HPC system. Another example: LIMS prepends the string "US3-" to this guid and it becomes the gfacID, used by the GFAC system to identify the job, and used in the listen script and grid control program to identify the job. The database name and gfacID is how the grid control program associates the HPC job with the analysis request record in the LIMS database. == Contents of the data directory == || Contents of file || File name || || Scan data itself || runID.dataType.cell.channel.wavelength.auc || || Edit profile || runID.editedDataID.dataType.cell.channel.wavelength.xml || || Analysis parameters || hpcrequest-server-database-requestID.xml || || Noise files || random.ti_noise or random.ri_noise || || Tar file with above content || hpcinput-server-database-requestID.tar || || Job submission parameters || cluster_shortname-requestGUID-jobxmlfile.xml || || Post-analysis job stats || job_statistics.xml || || Queue messages, stdout, stderr || database-requestID-messages.txt || || Analysis results, if any || analysis.tar || == The LIMS job submit process == In LIMS job submission happens in stages. In the first stage everything related to the data itself is collected and placed into a tar file. This includes the scan data itself, the edit profile, analysis parameters, and any noise files that have been selected. In the second stage, the job submission parameters are placed into a job submit file and the job is submitted. It is submitted as an HTTP request, where the job submission xml file is the body of the request and the tar file is sent as a base-64 encoded, chunk-split attachment. LIMS writes the parent HPCAnalysisRequest record in the LIMS database, as well as the analysis record for the job in the local gfac database, which contains the jobs from all the LIMS databases that are currently being processed. == Supercomputer Queue == This task is controlled by the Supercomputer system. It is responsible for controlling the jobs running on that system and communication with clients. Communication tasks include receiving tasks, returning job status, and informing the client when a task has been completed or aborted. == MPI_Analysis (!UltraScan HPC Analysis Program) == The MPI_Analysis program reads the job submission xml file and uses that as a guide to read other data files as needed to populate internal data structures. It then performs the analysis, writing any needed output to disk. At the beginning of the program, periodically during execution, and at the end of of processing, MPI_Analysis writes a UDP status datagram to a listener on the host and port specified in the control.xml file. Each datagram will consist of the analysisRequestGUID and a status (e.g. started, iteration number, finished). This is not a reliable two-way communication and it is the responsibility of the listener to follow up and manage any missed messages. == listen == This php program runs as daemon receiving udp packets from the MPI_Analysis program. It is responsible for updating the analysis table in the local gfac database table and the HPCAnalysisResult table in the LIMS database with current status. Current status possibilities include: || Current Status || Meaning || || SUBMITTED || Job is queued, waiting to be run. || || SUBMIT_TIMEOUT || Job is queued, waiting to be run, however it's been waiting for more than 24 hours. || || RUNNING || Job is running. || || RUN_TIMEOUT || Job is running, however it's been running for more than 24 hours. || || DATA || Job has completed, however the data has not arrived. || || DATA_TIMEOUT || Job has completed and we're waiting on the data, however we've been waiting for more than an hour. || || COMPLETE || Job has completed and data has been delivered. || || FAILED || Job has failed. || || CANCELED || User canceled the job. || || ERROR || Grid control or listen has encountered an undocumented error. || == grid-control == This php program is scheduled periodically via cron. It checks the jobs in the local gfac analysis table (current jobs), and determines what actions to take, if any. Current status possibilities and the actions that are taken: || Current Status || Actions taken || Timeout || On timeout, change status to || || SUBMITTED || If > 10 mins, request status update || 24 hours || SUBMIT_TIMEOUT || || SUBMIT_TIMEOUT || Request a status update || 24 hours || FAILED || || RUNNING || If < 10 mins, request status update || 24 hours || RUN_TIMEOUT || || RUN_TIMEOUT || Request a status update || 48 hours || FAILED || || DATA || Request status; Request data every 5 mins || 1 hour || DATA_TIMEOUT || || DATA_TIMEOUT || Request status; Request data every 15 mins || 24 hours || FAILED || || COMPLETE || Do cleanup || || || || FAILED || Do cleanup || || || || CANCELED || Do cleanup || || || || ERROR || Do cleanup || || || Cleanup performs several functions, and is called by the grid control program if appropriate. Cleanup delivers the analysis results tar file into the common data directory, expands it, and places any model files and noise files into the appropriate LIMS database. Cleanup also copies the stdout content, the stderr content, and all queue messages to the queue message file in the common data directory. Finally, cleanup deletes the appropriate analysis record from the gfac database and informs the user of final status by email.