wiki:JobTimeEstim

Mechanism to Estimate HPC Job CPUTime Based on Job Histories.

I. Big Picture

  • A periodic (once a day? once a week?) cron task accumulates statistics on every job run from the LIMS server and adds/updates entries in a job_metrics table.
  • The job_metrics table holds time ranges and statistics for job types identified by fields such as
    • job type (2DSA, ..., DMGA) + (_Simp, _Refi, _MFit, _MC, _TrMC, ...)
    • extended job type (job type plus detailed characteristics)
    • iterations
    • dataset count
    • total points (scans times radius points)
    • computed noises (0, 1, 2)
    • masters group count (1,...,32; GA-MC or DMGA-MC only)
    • cluster
    • cputime statistics (min, max, average, median, standard deviation)
    • wallime statistics (min, max, average, median, standard deviation)
    • over-estimate time statistics (min, max, ...)
  • During job submission, the current job is compared to job_metrics entries.
  • if extended job type and cluster match an entry, its max cputime is saved.
  • if no extended job type and cluster match is found, max cputime is interpolated or computed.
  • the estimated CPUtime is max cputime plus a pad (10% ?), rounded to next highest multiple of 5 minutes.

II. Details on job_metric Table

  • The "jobtype" field is a concatenation of base type of analysis (2DSA, ...) with an analysis subtype:
    • "_Simp" == simple analysis with no iterations of any type
    • "_Refi" == refinement iterations
    • "_MFit" == meniscus fit
    • "_MC" == montecarlo iterations
    • "_Tr" == Tichonov regularization
    • "_TrMC" == both Tr and MC
  • The "jobetype" field concatenates the jobtype string with additional substrings identifying numeric characteristics of the job:
    • computed noises == "_cn00", "_cn01", "_cn02";
    • datasets count == "_ds01", "_ds02", ..., "_ds50";
    • iterations == "_it001", "_it002", ..., "_it100", ...
    • points == "_p20000", ...
  • The "cnoises" field holds the number of computed noises (0, 1, 2).
  • The "datasets" field holds an integer number of datasets (1, ..., 50).
  • The "iterations" field holds an integer number of iterations (refinement, meniscus fit, MC).
  • The "points" field holds the dataset size in total number of points, the product of scans times radial points. If datasets>1, this is the maximum total points.
  • The "mgroupcount" field for GA-MC or DMGA-MC holds the masters group count.
  • The "cluster" field holds the actual cluster name ("alamo", "comet", ...) or "any".
  • A set of job time fields hold statistics (minimum, maximum, average, median, standard deviation) for CPUTime, WallTime and Over-Estimate (difference between job estimate and actual CPUTime of jobs). The specific fields are as follows.
    • "cputime_max"
    • "cputime_min"
    • "cputime_avg"
    • "cputime_med"
    • "cputime_dev"
    • "walltime_max"
    • "walltime_min"
    • "walltime_avg"
    • "walltime_med"
    • "walltime_dev"
    • "overest_max"
    • "overest_min"
    • "overest_avg"
    • "overest_med"
    • "overest_dev"

III. Details of job_metrics Creation and Use in Estimating

  • Every job run in the history of a LIMS server is evaluated to determine job characteristics, job times, and cluster.
  • An extended job type (jobetype) string is determined.
  • Time values are added to arrays of time values for matching jobetype and both matching cluster and cluster="any".
  • At completion of the jobs scan, a job_metrics table entry will be added or updated for each unique jobetype; for both matching cluster and cluster="any".
  • When a job is being submitted, its jobetype and cluster are used to search for a matching entry in the job_metrics table.
  • If a match to both jobetype and cluster is found, its cputime_max with a pad is used as the time estimate.
  • If only a jobetype match is found, the cputime_max is taken from the entry with matching jobetype and either (1) cluster="any" or (2) an entry with a cluster value with clear affinity to the job's cluster (e.g., "comet" and "jureca").
  • If no jobetype match is found, a time value is interpolated from job_metrics entries with jobtype match and an appropriate cluster match or affinity. The interpolation will use appropriate numeric noises/datasets/iterations/points values as the "X" value(s).
  • Note that the most common of this last case will likely be a jobtype match with matching noises/datasets/iterations. The interpolation will be based on data size (points).

IV. Example 1 of Job Time Estimate

  • A 2DSA simple analysis with both TI and RI noise computations is submitted to cluster "comet".
  • The job is a composite one with 48 datasets where the maximum size is 200,000 points (200 scans, 1000 radial points).
  • The jobtype value is "2DSA_Simp".
  • The jobetype value is "2DSA_Simp_cn02_ds48_it001_p200000".
  • The cluster value is "comet".
  • If a match to both jobetype and cluster is found, its cputime_max is used.
  • If matches to cluster and jobetype LIKE '2DSA_Simp_cn02_ds48_it001%' are found, the estimate will interpolate based on the points curve.
  • If matches to jobetype LIKE '2DSA_Simp_cn02_ds48_it001%' are found for clusters other than "comet", the estimate will interpolate based on the points curves and relative clusters factor.
  • Otherwise, matches to jobtype "2DSA_Simp" are found and the estimate will interpolate based on any relevant noises/datasets/iterations/points values.

V. Example 2 of Job Time Estimate

  • A DMGA-MC analysis is submitted to cluster "lonestar".
  • The number of MC iterations is 100.
  • The single dataset has a size of 120,000 total points.
  • The mgroupcount value is 16.
  • The jobtype value is "DMGA_MC"
  • The jobetype value is "DMGA_MC_cn00_ds01_it100_p120000".
  • The cluster value is "lonestar".
  • If a match to all of jobetype, mgroupcount, and cluster is found; its cputime_max is used.
  • If matches to cluster and jobetype LIKE 'DMGA_MC_cn00_ds01_it100%' and mgroupcount=16 are found, the estimate will interpolate based on the points curve.
  • If matches to jobetype LIKE 'DMGA_MC_cn00_ds01_it100%' are found for clusters other than "lonestar", the estimate will interpolate based on the points curves, mgroupcount, and relative clusters factor.
  • Otherwise, matches to jobtype "DMGA_MC" are found and the estimate will interpolate based on any relevant noises/datasets/iterations/points/mgroupcount values and (possibly) relative clusters factor.

VI. Open Questions

  • Once the cluster has been chosen, should a LIMS option to plot statistics and manually select time estimate be allowed?
  • Should there be a standalone GUI program to plot job statistics?
  • Should the estimate apply some kind of date-time weighting to statistics (i.e., should more recent jobs bias statistics more than older ones) ?
Last modified 3 years ago Last modified on Oct 1, 2015 5:37:43 PM