Changes between Version 6 and Version 7 of HpcInfo


Ignore:
Timestamp:
Jan 25, 2012, 11:45:56 PM (12 years ago)
Author:
dzollars
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • HpcInfo

    v6 v7  
    1616== Contents of the data directory ==
    1717
    18 [wiki:Us3HpcDb US3 HPC Database Tables]
    19 
    20 == The LIMS job submit process ==
    21 
    22 In LIMS job submission is a two-step process. In the first step everything related to the data itself is collected and placed into a tar file. File names:
    23 
    2418|| Contents of file              || File name                                  ||
    2519|| Scan data itself              || runID.dataType.cell.channel.wavelength.auc ||
     
    2822|| Noise files                   || random.ti_noise or random.ri_noise         ||
    2923|| Tar file with above content   || hpcinput-server-database-requestID.tar     ||
     24|| Job submission parameters     || cluster_shortname-requestGUID-jobxmlfile.xml ||
     25|| Post-analysis job stats       || job_statistics.xml                         ||
     26|| Queue messages, stdout, stderr || database-requestID-messages.txt           ||
     27|| Analysis results, if any      || analysis.tar                               ||
     28
     29== The LIMS job submit process ==
     30
     31In LIMS job submission happens in stages. In the first stage everything related to the data itself is collected and placed into a tar file. This includes the scan data itself, the edit profile, analysis parameters, and any noise files that have been selected. In the second stage, the job submission parameters are placed into a job submit file and the job is submitted. It is submitted as an HTTP request, where the job submission xml file is the body of the request and the tar file is sent as a base-64 encoded, chunk-split attachment. LIMS writes the parent HPCAnalysisRequest record in the LIMS database, as well as the analysis record for the job in the local gfac database, which contains the jobs from all the LIMS databases that are currently being processed.
    3032
    3133== Supercomputer Queue ==
    3234
    33 This task is controlled by the Supercomputer system.  It is responsible for controlling
    34 the jobs running on that system and communication with clients.
     35This task is controlled by the Supercomputer system.  It is responsible for controlling the jobs running on that system and communication with clients.
    3536
    36 Communication tasks include receiving tasks, returning job status, and informing the
    37 client when a task has been completed or aborted.
     37Communication tasks include receiving tasks, returning job status, and informing the client when a task has been completed or aborted.
    3838
    3939== MPI_Analysis (!UltraScan HPC Analysis Program) ==
    4040
    41 The MPI_Analysis program reads the control.xml file and uses that as a guide to read other data
    42 files as needed to populate internal data structures.  It then performs the analysis,
    43 writing any needed output to disk.
     41The MPI_Analysis program reads the job submission xml file and uses that as a guide to read other data files as needed to populate internal data structures.  It then performs the analysis, writing any needed output to disk.
    4442
    45 At the beginning of the program, periodically during execution, and at the end of of
    46 processing, MPI_Analysis writes a UDP status datagram to a listener on the host and port specified
    47 in the control.xml file.  Each datagram will consist of the analysisRequestGUID and a
    48 status (e.g. started, iteration number, finished).  This is not a reliable two-way
    49 communication and it is the responsibility of the listener to follow up and manage any
    50 missed messages.
     43At the beginning of the program, periodically during execution, and at the end of of processing, MPI_Analysis writes a UDP status datagram to a listener on the host and port specified in the control.xml file.  Each datagram will consist of the analysisRequestGUID and a status (e.g. started, iteration number, finished).  This is not a reliable two-way communication and it is the responsibility of the listener to follow up and manage any missed messages.
    5144
    52 == grid-timeout ==
     45== listen ==
    5346
    54 This program will either be scheduled periodically via cron, or run as a daemon.  It will
    55 check status of jobs in the mysql database and initiate a status query for jobs
    56 that have overdue status updates.  If a job has been aborted, it will notify the
    57 grid-listen program of that status.
     47This php program runs as daemon receiving udp packets from the MPI_Analysis program. It is responsible for updating the analysis table in the local gfac database table and the HPCAnalysisResult table in the LIMS database with current status. Current status possibilities include:
    5848
    59 == grid-query ==
     49|| Current Status           || Meaning                                                               ||
     50|| SUBMITTED                || Job is queued, waiting to be run. ||
     51|| SUBMIT_TIMEOUT           || Job is queued, waiting to be run, however it's been waiting for more than 24 hours. ||
     52|| RUNNING                  || Job is running.                                                                     ||
     53|| RUN_TIMEOUT              || Job is running, however it's been running for more than 24 hours.                   ||
     54|| DATA                     || Job has completed, however the data has not arrived.                                ||
     55|| DATA_TIMEOUT             || Job has completed and we're waiting on the data, however we've been waiting for more than an hour. ||
     56|| COMPLETE                 || Job has completed and data has been delivered.                                      ||
     57|| FAILED                   || Job has failed.                                                                     ||
     58|| CANCELED                 || User canceled the job.                                                              ||
     59|| ERROR                    || Grid control or listen has encountered an undocumented error.                       ||
    6060
    61 This is a command line program that submits a status query to the Supercomputer Queue and
    62 returns the result.
     61== grid-control ==
    6362
    64 == grid-listen ==
     63This php program is scheduled periodically via cron.  It checks the jobs in the local gfac analysis table (current jobs), and determines what actions to take, if any. Current status possibilities and the actions that are taken:
    6564
    66 This program runs as daemon receiving udp packets from the MPI_Analysis program or the grid-timeout
    67 program.  It is responsible for updating the mysql database table HPCAnalysisResult with current
    68 status and, upon completion or abort of an analysis, fetches needed files from
    69 the supercomputer cluster, sends an email to the user, and does any other cleanup necessary.
     65|| Current Status           || Actions taken                       || Timeout  || On timeout, change status to  ||
     66|| SUBMITTED                || If > 10 mins, request status update || 24 hours || SUBMIT_TIMEOUT    ||
     67|| SUBMIT_TIMEOUT           || Request a status update             || 24 hours || FAILED            ||
     68|| RUNNING                  || If < 10 mins, request status update || 24 hours || RUN_TIMEOUT       ||
     69|| RUN_TIMEOUT              || Request a status update             || 48 hours || FAILED            ||
     70|| DATA                     || Request status; Request data every 5 mins || 1 hour || DATA_TIMEOUT  ||
     71|| DATA_TIMEOUT             || Request status; Request data every 15 mins || 24 hours || FAILED     ||
     72|| COMPLETE                 || Do cleanup                          ||          ||                   ||
     73|| FAILED                   || Do cleanup                          ||          ||                   ||
     74|| CANCELED                 || Do cleanup                          ||          ||                   ||
     75|| ERROR                    || Do cleanup                          ||          ||                   ||
    7076
     77Cleanup performs several functions, and is called by the grid control program if appropriate. Cleanup delivers the analysis results tar file into the common data directory, expands it, and places any model files and noise files into the appropriate LIMS database. Cleanup also copies the stdout content, the stderr content, and all queue messages to the queue message file in the common data directory. Finally, cleanup deletes the appropriate analysis record from the gfac database and informs the user of final status by email.