wiki:RestartFromCheckpoint

Restarting a failed job from a checkpoint

Only GA MC jobs are currently supported for checkpoint restarts !

  • Example
    Emre, Jeremy:
    
    Not sure why this job crashed on laredo, I also got a message saying it was completed,
    but it didn't have the results attached.
    
    The logs seem to suggest a possible error on node 18 (which also went down a 
    couple of days ago):
    
    files missing:
    ----tail us_job110105093215.stderr --- from laredo.uthscsa.edu ---
    [compute-0-18:03219] [16] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3a9561d8b4]
    [compute-0-18:03219] [17]
    /home/tigre/ultrascan/bin64/us_fe_nnls_t_mpi(__gxx_personality_v0+0x449) [0x41b6b9]
    [compute-0-18:03219] *** End of error message ***
    /home/tigre/.globus/efc682a0-18e0-11e0-b337-bbe9e59ed75b/scheduler_pbs_cmd_script:
    line 26:  3219 Aborted                 /home/tigre/ultrascan/bin64/us_fe_nnls_t_mpi
    "/home/tigre/tmp/110105093215/experiments110105093215.dat"
    "/home/tigre/tmp/11105093215/solutes110105093215.dat" "990"
    
    Jeremy, please look into this node to find out what the problem is with it.
    
    Emre: Can you please send me a set of instructions for restarting a GA-MC
    run from the last checkpoint? I want to be able to do this myself without 
    having to bug you each time.
    
    You can also put the instructions on the US2 wiki.
    
    Thanks, -Borries
    Forwarded message:
    > > From gridcontrol@ultrascan.uthscsa.edu  Wed Jan  5 23:02:45 2011
    > > X-Original-To: demeler@biochem.uthscsa.edu
    > > Delivered-To: demeler@biochem.uthscsa.edu
    > > X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on biochem.uthscsa.edu
    > > X-Spam-Level: *
    > > X-Spam-Status: No, score=1.5 required=5.0 tests=MSGID_FROM_MTA_HEADER
    > > 	autolearn=no version=3.2.5
    > > Message-Id: <37c250$7cruq4@mta2-hsc.uthscsa.edu>
    > > X-IronPort-AV: E=Sophos;i="4.60,281,1291615200"; 
    > >    d="scan'208";a="248380228"
    > > MIME-Version: 1.0
    > > Content-Disposition: inline
    > > Content-Transfer-Encoding: binary
    > > Content-Type: text/plain
    > > X-Mailer: MIME::Lite 3.023 (F2.74; A2.04; B3.07; Q3.07)
    > > Date: Wed, 5 Jan 2011 23:02:44 -0600
    > > From: gridcontrol@ultrascan.uthscsa.edu
    > > To: demeler@biochem.uthscsa.edu
    > > Cc: emre@biochem.uthscsa.edu, jeremy@biochem.uthscsa.edu, demeler@biochem.uthscsa.edu
    > > Subject: Your TIGRE job running on laredo.uthscsa.edu has failed
    > > X-Folder: Bulk
    > > X-UID: 10039                                        
    > > 
    > > 
    > > The TIGRE GA_RA-MC-50 job using experiment Yue_M5-1_TD40_0018v2chnA.veloc.31 you recently submitted to laredo.uthscsa.edu has failed.  
    > > Our staff has been informed of this incident and we will be looking into the cause of this error.
    > > 
    > > Some possibilities for this error are -
    > > 1. The job ran longer than expected and timed out
    > > 2. The computer(s) on the selected TIGRE cluster had a systems failure
    > > 3. An unexpected error occured in the analysis software
    > > 
    > > If you are in a hurry to get results - we suggest you resubmit your job to a 
    > > different TIGRE cluster.
    > > 
    > > We apologize for any inconvience this may have caused.
    > > 
    > > jid 110105093215
    > > 
    
  • find the target cluster & the job id (laredo,110105093215) in this case
  • package up the experiment.dat, solute.dat and checkpoint.dat (find the one with the highest -#.dat) files
    emre@bcf:~$ ssh laredo
    emre@laredo's password:
    Last login: Wed Jan  5 10:02:15 2011 from bcf.biochemistry.uthscsa.edu
    Rocks 5.1 (V.I)
    Profile built 15:16 26-Mar-2009
    
    Kickstarted 10:55 26-Mar-2009
    [emre@structure ~]$ sudo su - tigre
    [tigre@structure ~]$ cd tmp/110105093215
    [tigre@structure 110105093215]$ ls exp* sol* check*
    checkpoint-110105093206-10.dat  checkpoint-110105093206-2.dat  checkpoint-110105093206-4.dat  checkpoint-110105093206-6.dat  checkpoint-110105093206-8.dat  experiments110105093215.dat
    checkpoint-110105093206-1.dat   checkpoint-110105093206-3.dat  checkpoint-110105093206-5.dat  checkpoint-110105093206-7.dat  checkpoint-110105093206-9.dat  solutes110105093215.dat
    [tigre@structure 110105093215]$ tar zcf j110105093215.gz experiments110105093215.dat solutes110105093215.dat checkpoint-110105093206-10.dat
    [tigre@structure 110105093215]$ scp j110105093215.gz emre@bcf.uthscsa.edu:
    [tigre@structure 110105093215]$ exit
    
  • put the jobs on ranger. this has to be done as user emre on bcf which is linked to the appropriate account on ranger
    • once on ranger, you could install your bcf public key and use your bcf account to access
    • Make sure you never ssh to ranger as user root! This will trip the firewall and we will lose all access to TACC until reset
      emre@bcf:~$ scp j110105093215.gz ranger.tacc.utexas.edu:
      
  • login to ranger and setup the job
    emre@bcf:~$ ranger
    --------------------- Project balances for user tg457210 ----------------------
    | Name           Avail SUs     Expires | Name           Avail SUs     Expires |
    | TG-MCB070039N    2892175  2011-03-31 | ULTRASCAN         393048  2011-03-31 |
    ------------------------ Disk quotas for user tg457210 ------------------------
    | Disk         Usage (GB)     Limit    %Used   File Usage       Limit   %Used |
    | /share              4.1         6    68.46        63198      100000   63.20 |
    | /work              94.2       350    26.92       257960     2000000   12.90 |
    -------------------------------------------------------------------------------
    login4% bash
    
    tg457210@login3:~$ cd $WORK/rerun
    tg457210@login3:/work/00451/tg457210/rerun$ ls
    bjob  done  j10  j11  j12  j13  j2870  s.gz  specjob
    tg457210@login3:/work/00451/tg457210/rerun$ mkdir j110105093215
    tg457210@login3:/work/00451/tg457210/rerun$ cd j110105093215/
    tg457210@login3:/work/00451/tg457210/rerun/j110105093215$ tar zxf ~j110105093215.gz
    
  • make the job
    • put any email, no @ implies biochem.uthscsa.edu
      tg457210@login3:/work/00451/tg457210/rerun/j110105093215$ mkjob.pl emre
      summary
      jid   j110105093215
      email emre@biochem.uthscsa.edu
      exp   experiments110105093215.dat
      sol   solutes110105093215.dat
      check checkpoint-110108113122-10.dat
      >bjob
      to submit the job: "$ qsub bjob"
      tg457210@login3:/work/00451/tg457210/rerun/j110105093215$
      
  • submit the job
    tg457210@login3:/work/00451/tg457210/rerun/j110105093215$ qsub bjob
    -------------------------------------------------------------------------
    ------- Welcome to TACC's Ranger System, an NSF TeraGrid Resource -------
    -------------------------------------------------------------------------
    --> Checking that you specified -V...
    --> Checking that you specified a time limit...
    --> Checking that you specified a queue...
    --> Setting project...
    --> Checking that you specified a parallel environment...
    --> Checking that you specified a valid parallel environment name...
    --> Checking that the minimum and maximum PE counts are the same...
    --> Checking that the number of PEs requested is valid...
    --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
    --> Requesting valid memory configuration (mt=31.3G)...
    --> Verifying WORK file-system availability...
    --> Verifying HOME file-system availability...
    --> Verifying SCRATCH file-system availability...
    --> Checking ssh setup...
    --> Checking that you didn't request more cores than the maximum...
    --> Checking that you don't already have the maximum number of jobs...
    --> Checking that your time limit isn't over the maximum...
    --> Checking available allocation...
    --> Submitting job...
    
    Your job 1755984 ("jrerun_j110105093215") has been submitted
    
  • monitor the job
    tg457210@login3:/work/00451/tg457210/rerun/j110105093215$ showq -u
    ACTIVE JOBS--------------------------
    JOBID     JOBNAME    USERNAME      STATE   CORE  REMAINING  STARTTIME
    ================================================================================
    
         0 active jobs :    0 of 3930 hosts (  0.00 %)
    
    WAITING JOBS------------------------
    JOBID     JOBNAME    USERNAME      STATE   CORE  WCLIMIT    QUEUETIME
    ================================================================================
    1755984   jrerun_j11 tg457210      Waiting 48     48:00:00  Thu Jan  6 09:32:21
    
    WAITING JOBS WITH JOB DEPENDENCIES---
    JOBID     JOBNAME    USERNAME      STATE   CORE  WCLIMIT    QUEUETIME
    ================================================================================
    
    UNSCHEDULED JOBS---------------------
    JOBID     JOBNAME    USERNAME      STATE   CORE  WCLIMIT    QUEUETIME
    ================================================================================
    
    Total jobs: 1     Active Jobs: 0     Waiting Jobs: 1     Dep/Unsched Jobs: 0
    tg457210@login3:/work/00451/tg457210/rerun/j110105093215$
    
  • At this point, wait for the job to complete.
  • Job started email
    Subject: Job 1755984 (jrerun_j110105093215) Started
    Job 1755984 (jrerun_j110105093215) Started
     User       = tg457210
     Queue      = long
     Host       = i112-311.ranger.tacc.utexas.edu
     Start Time = 01/06/2011 09:53:00
    
  • Job aborted email: ran out of 48 hours time
    Subject: Job 1755984 (jrerun_j110105093215) Aborted
    Job 1755984 (jrerun_j110105093215) Aborted
     Exit Status      = 0
     Signal           = KILL
     User             = tg457210
     Queue            = long@i112-311.ranger.tacc.utexas.edu
     Host             = i112-311.ranger.tacc.utexas.edu
     Start Time       = 01/06/2011 09:53:02
     End Time         = 01/08/2011 09:54:26
     CPU              = 00:01:57
     Max vmem         = 170.262M
    failed assumedly after job because:
    job 1755984.1 died through signal KILL (9)
    
  • restart job from last checkpoint:
    • find latest checkpoint: (on ranger)
      tg457210@login3:/work/00451/tg457210/rerun/j110105093215$ ls check*
      checkpoint-110105093206-10.dat  checkpoint-110106095339-13.dat  checkpoint-110106095339-17.dat  checkpoint-110106095339-21.dat  checkpoint-110106095339-25.dat  checkpoint-110106095339-29.dat  checkpoint-110106095339-33.dat
      checkpoint-110106095339-10.dat  checkpoint-110106095339-14.dat  checkpoint-110106095339-18.dat  checkpoint-110106095339-22.dat  checkpoint-110106095339-26.dat  checkpoint-110106095339-30.dat  checkpoint-110106095339-34.dat
      checkpoint-110106095339-11.dat  checkpoint-110106095339-15.dat  checkpoint-110106095339-19.dat  checkpoint-110106095339-23.dat  checkpoint-110106095339-27.dat  checkpoint-110106095339-31.dat  checkpoint-110106095339-35.dat
      checkpoint-110106095339-12.dat  checkpoint-110106095339-16.dat  checkpoint-110106095339-20.dat  checkpoint-110106095339-24.dat  checkpoint-110106095339-28.dat  checkpoint-110106095339-32.dat  checkpoint-110106095339-36.dat
      tg457210@login3:/work/00451/tg457210/rerun/j110105093215$
      
    • make a new bjob
      tg457210@login3:/work/00451/tg457210/rerun/j110105093215$ mkjob.pl emre
      summary
      jid   j110105093215
      email emre@biochem.uthscsa.edu
      exp   experiments110105093215.dat
      sol   solutes110105093215.dat
      check checkpoint-110108113122-36.dat
      >bjob
      to submit the job: "$ qsub bjob"
      tg457210@login3:/work/00451/tg457210/rerun/j110105093215$
      
    • resubmit
      tg457210@login3:/work/00451/tg457210/rerun/j110105093215$ qsub bjob
      -------------------------------------------------------------------------
      ------- Welcome to TACC's Ranger System, an NSF TeraGrid Resource -------
      -------------------------------------------------------------------------
      --> Checking that you specified -V...
      --> Checking that you specified a time limit...
      --> Checking that you specified a queue...
      --> Setting project...
      --> Checking that you specified a parallel environment...
      --> Checking that you specified a valid parallel environment name...
      --> Checking that the minimum and maximum PE counts are the same...
      --> Checking that the number of PEs requested is valid...
      --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
      --> Requesting valid memory configuration (mt=31.3G)...
      --> Verifying WORK file-system availability...
      --> Verifying HOME file-system availability...
      --> Verifying SCRATCH file-system availability...
      --> Checking ssh setup...
      --> Checking that you didn't request more cores than the maximum...
      --> Checking that you don't already have the maximum number of jobs...
      --> Checking that your time limit isn't over the maximum...
      --> Checking available allocation...
      --> Submitting job...
      
      Your job 1759622 ("jrerun_j110105093215") has been submitted
      tg457210@login3:/work/00451/tg457210/rerun/j110105093215$
      
  • repeat until finished.
  • resubmit started, email received:
    Subject: Job 1759622 (jrerun_j110105093215) Started
    Job 1759622 (jrerun_j110105093215) Started
     User       = tg457210
     Queue      = long
     Host       = i144-309.ranger.tacc.utexas.edu
     Start Time = 01/08/2011 11:31:09
    
Last modified 8 years ago Last modified on Jan 28, 2011 6:20:00 PM