# Mechanism to Estimate HPC Job CPUTime Based on Job Histories.

## I. Big Picture

- A periodic (once a day? once a week?) cron task accumulates statistics on every job run from the LIMS server and adds/updates entries in a job_metrics table.

- The job_metrics table holds time ranges and statistics for job types identified by fields such as
- job type (2DSA, ..., DMGA) + (_Simp, _Refi, _MFit, _MC, _TrMC, ...)
- extended job type (job type plus detailed characteristics)
- iterations
- dataset count
- total points (scans times radius points)
- computed noises (0, 1, 2)
- masters group count (1,...,32; GA-MC or DMGA-MC only)
- cluster
- cputime statistics (min, max, average, median, standard deviation)
- wallime statistics (min, max, average, median, standard deviation)
- over-estimate time statistics (min, max, ...)

- During job submission, the current job is compared to job_metrics entries.

- if extended job type and cluster match an entry, its max cputime is saved.

- if no extended job type and cluster match is found, max cputime is interpolated or computed.

- the estimated CPUtime is max cputime plus a pad (10% ?), rounded to next highest multiple of 5 minutes.

## II. Details on job_metric Table

- The "jobtype" field is a concatenation of base type of analysis (2DSA, ...)
with an analysis subtype:
- "_Simp" == simple analysis with no iterations of any type
- "_Refi" == refinement iterations
- "_MFit" == meniscus fit
- "_MC" == montecarlo iterations
- "_Tr" == Tichonov regularization
- "_TrMC" == both Tr and MC

- The "jobetype" field concatenates the jobtype string with additional substrings identifying numeric characteristics of the job:
- computed noises == "_cn00", "_cn01", "_cn02";
- datasets count == "_ds01", "_ds02", ..., "_ds50";
- iterations == "_it001", "_it002", ..., "_it100", ...
- points == "_p20000", ...

- The "cnoises" field holds the number of computed noises (0, 1, 2).

- The "datasets" field holds an integer number of datasets (1, ..., 50).

- The "iterations" field holds an integer number of iterations (refinement, meniscus fit, MC).

- The "points" field holds the dataset size in total number of points, the product of scans times radial points. If datasets>1, this is the maximum total points.

- The "mgroupcount" field for GA-MC or DMGA-MC holds the masters group count.

- The "cluster" field holds the actual cluster name ("alamo", "comet", ...) or "any".

- A set of job time fields hold statistics (minimum, maximum, average, median, standard deviation) for CPUTime, WallTime and Over-Estimate (difference between job estimate and actual CPUTime of jobs). The specific fields are as follows.
- "cputime_max"
- "cputime_min"
- "cputime_avg"
- "cputime_med"
- "cputime_dev"
- "walltime_max"
- "walltime_min"
- "walltime_avg"
- "walltime_med"
- "walltime_dev"
- "overest_max"
- "overest_min"
- "overest_avg"
- "overest_med"
- "overest_dev"

## III. Details of job_metrics Creation and Use in Estimating

- Every job run in the history of a LIMS server is evaluated to determine job characteristics, job times, and cluster.

- An extended job type (jobetype) string is determined.

- Time values are added to arrays of time values for matching jobetype and both matching cluster and cluster="any".

- At completion of the jobs scan, a job_metrics table entry will be added or updated for each unique jobetype; for both matching cluster and cluster="any".

- When a job is being submitted, its jobetype and cluster are used to search for a matching entry in the job_metrics table.

- If a match to both jobetype and cluster is found, its cputime_max with a pad is used as the time estimate.

- If only a jobetype match is found, the cputime_max is taken from the entry with matching jobetype and either (1) cluster="any" or (2) an entry with a cluster value with clear affinity to the job's cluster (e.g., "comet" and "jureca").

- If no jobetype match is found, a time value is interpolated from job_metrics entries with jobtype match and an appropriate cluster match or affinity. The interpolation will use appropriate numeric noises/datasets/iterations/points values as the "X" value(s).

- Note that the most common of this last case will likely be a jobtype match with matching noises/datasets/iterations. The interpolation will be based on data size (points).

## IV. Example 1 of Job Time Estimate

- A 2DSA simple analysis with both TI and RI noise computations is submitted to cluster "comet".

- The job is a composite one with 48 datasets where the maximum size is 200,000 points (200 scans, 1000 radial points).

- The jobtype value is "2DSA_Simp".

- The jobetype value is "2DSA_Simp_cn02_ds48_it001_p200000".

- The cluster value is "comet".

- If a match to both jobetype and cluster is found, its cputime_max is used.

- If matches to cluster and jobetype LIKE '2DSA_Simp_cn02_ds48_it001%' are found, the estimate will interpolate based on the points curve.

- If matches to jobetype LIKE '2DSA_Simp_cn02_ds48_it001%' are found for clusters other than "comet", the estimate will interpolate based on the points curves and relative clusters factor.

- Otherwise, matches to jobtype "2DSA_Simp" are found and the estimate will interpolate based on any relevant noises/datasets/iterations/points values.

## V. Example 2 of Job Time Estimate

- A DMGA-MC analysis is submitted to cluster "lonestar".

- The number of MC iterations is 100.

- The single dataset has a size of 120,000 total points.

- The mgroupcount value is 16.

- The jobtype value is "DMGA_MC"

- The jobetype value is "DMGA_MC_cn00_ds01_it100_p120000".

- The cluster value is "lonestar".

- If a match to all of jobetype, mgroupcount, and cluster is found; its cputime_max is used.

- If matches to cluster and jobetype LIKE 'DMGA_MC_cn00_ds01_it100%' and mgroupcount=16 are found, the estimate will interpolate based on the points curve.

- If matches to jobetype LIKE 'DMGA_MC_cn00_ds01_it100%' are found for clusters other than "lonestar", the estimate will interpolate based on the points curves, mgroupcount, and relative clusters factor.

- Otherwise, matches to jobtype "DMGA_MC" are found and the estimate will interpolate based on any relevant noises/datasets/iterations/points/mgroupcount values and (possibly) relative clusters factor.

## VI. Open Questions

- Once the cluster has been chosen, should a LIMS option to plot statistics and manually select time estimate be allowed?

- Should there be a standalone GUI program to plot job statistics?

- Should the estimate apply some kind of date-time weighting to statistics (i.e., should more recent jobs bias statistics more than older ones) ?

Last modified 3 years ago
Last modified on Oct 1, 2015 5:37:43 PM