Difference between revisions of "Held Job Troubleshooting"

From Statistics Cluster
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Overview ==
 
== Overview ==
There exist various administrative scripts which run on the cluster automatically. If you find that your job has been held, designated by 'H' as a status, please use the following guidelines to understand why.
+
There exist various administrative scripts which run on the cluster automatically. One such script will examine running jobs and hold them if they are meet certain conditions. If you find that your job has been held, designated by 'H' as a status, please use the following guidelines to understand why.
  
 
== Viewing Job ClassAds ==
 
== Viewing Job ClassAds ==
Line 10: Line 10:
  
 
== Hold Reaons ==
 
== Hold Reaons ==
 +
 
=== Over Maximum Run Count ===
 
=== Over Maximum Run Count ===
 
The ClassAd <i>HoldReason</i> states
 
The ClassAd <i>HoldReason</i> states
Line 16: Line 17:
 
</pre>
 
</pre>
  
This means that your job has started 99 times already and is attempting to start again. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails.
+
This means that your job has started # times which is more than the maximum allowed restarts. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails.
 +
 
 
=== Used More Memory Than Requested ===
 
=== Used More Memory Than Requested ===
=== Used More Memory Than Slot Provided ===
+
The ClassAd <i>HoldReason</i> states
 +
<pre>
 +
<user> job <jobid> removed because its MemoryUsage # > 1200 and # > <RequestedMemory> * 1.2
 +
</pre>
 +
 
 +
This means that your job used more memory than the default minimum memory as well as exceeded the requested memory scaled by a factor of 1.2. If a user does not explicitly request memory, this is calculated by a formula in Condor.
 +
 
 +
The user should either
 +
# Request memory slightly larger than the used memory OR
 +
# Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps
 +
 
 +
=== Used More Memory Than Slot Memory Allocation ===
 +
The ClassAd <i>HoldReason</i> states
 +
<pre>
 +
<user> job <jobid> put on hold because its MemoryUsage # > <SlotMemory> * 1.2 + 500 (by user condor)
 +
</pre>
 +
 
 +
This means that your job used more memory than the default minimum memory as well as exceeded the allocated slot memory scaled by a factor of 1.2 + 500.
 +
 
 +
The user should either
 +
# Request memory slightly larger than the used memory OR
 +
# Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps
 +
 
 
=== Used More Disk Than Requested ===
 
=== Used More Disk Than Requested ===
 +
The ClassAd <i>HoldReason</i> states
 +
<pre>
 +
<user> job <jobid> removed because its RequestDisk # > 12000000 and # > <RequestedDisk> * 1.2
 +
</pre>
 +
 +
This means that your job used more disk than the default minimum disk space as well as exceeded the requested disk scaled by a factor of 1.2. If a user does not explicitly request disk, this is calculated by a formula in Condor.
 +
 +
The user should either
 +
# Request disk slightly larger than the used disk OR
 +
# Alter the code to use less disk space.

Latest revision as of 23:09, 15 April 2017

Overview

There exist various administrative scripts which run on the cluster automatically. One such script will examine running jobs and hold them if they are meet certain conditions. If you find that your job has been held, designated by 'H' as a status, please use the following guidelines to understand why.

Viewing Job ClassAds

When Condor holds a job, the ClassAd 'HoldReason' can be modified to explain the cause. To see the ClassAds of a job, use the command

condor_q -l <jobid> | grep HoldReason

Hold Reaons

Over Maximum Run Count

The ClassAd HoldReason states

<user> job <jobid> removed because its RunCount # > 99

This means that your job has started # times which is more than the maximum allowed restarts. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails.

Used More Memory Than Requested

The ClassAd HoldReason states

<user> job <jobid> removed because its MemoryUsage # > 1200 and # > <RequestedMemory> * 1.2

This means that your job used more memory than the default minimum memory as well as exceeded the requested memory scaled by a factor of 1.2. If a user does not explicitly request memory, this is calculated by a formula in Condor.

The user should either

  1. Request memory slightly larger than the used memory OR
  2. Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps

Used More Memory Than Slot Memory Allocation

The ClassAd HoldReason states

<user> job <jobid> put on hold because its MemoryUsage # > <SlotMemory> * 1.2 + 500 (by user condor)

This means that your job used more memory than the default minimum memory as well as exceeded the allocated slot memory scaled by a factor of 1.2 + 500.

The user should either

  1. Request memory slightly larger than the used memory OR
  2. Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps

Used More Disk Than Requested

The ClassAd HoldReason states

<user> job <jobid> removed because its RequestDisk # > 12000000 and # > <RequestedDisk> * 1.2

This means that your job used more disk than the default minimum disk space as well as exceeded the requested disk scaled by a factor of 1.2. If a user does not explicitly request disk, this is calculated by a formula in Condor.

The user should either

  1. Request disk slightly larger than the used disk OR
  2. Alter the code to use less disk space.