Difference between revisions of "Held Job Troubleshooting"

From Statistics Cluster
Jump to navigation Jump to search
Line 10: Line 10:
  
 
== Hold Reaons ==
 
== Hold Reaons ==
 +
 
=== Over Maximum Run Count ===
 
=== Over Maximum Run Count ===
 
The ClassAd <i>HoldReason</i> states
 
The ClassAd <i>HoldReason</i> states
Line 16: Line 17:
 
</pre>
 
</pre>
  
This means that your job has started 99 times already and is attempting to start again. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails.
+
This means that your job has started # times which is more than the maximum allowed restarts. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails.
 +
 
 
=== Used More Memory Than Requested ===
 
=== Used More Memory Than Requested ===
=== Used More Memory Than Slot Provided ===
+
The ClassAd <i>HoldReason</i> states
 +
<pre>
 +
<user> job <jobid> removed because its MemoryUsage # > 1200 and # > <RequestedMemory> * 1.2
 +
</pre>
 +
 
 +
This means that your job used more memory than the default minimum memory as well as exceeded the requested memory scaled by a factor of 1.2. If a user does not explicitly request memory, this is calculated by a formula in Condor.
 +
 
 +
The user should either
 +
# Request memory slightly larger than the used memory OR
 +
# Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps
 +
 
 +
=== Used More Memory Than Slot Memory Allocation ===
 +
The ClassAd <i>HoldReason</i> states
 +
<pre>
 +
<user> job <jobid> removed because its MemoryUsage # > 1200 and # > <SlotMemory> * 1.2 + 500
 +
</pre>
 +
 
 +
This means that your job used more memory than the default minimum memory as well as exceeded the allocated slot memory scaled by a factor of 1.2 + 500.
 +
 
 +
The user should either
 +
# Request memory slightly larger than the used memory OR
 +
# Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps
 +
 
 
=== Used More Disk Than Requested ===
 
=== Used More Disk Than Requested ===
 +
The ClassAd <i>HoldReason</i> states
 +
<pre>
 +
<user> job <jobid> removed because its RequestDisk # > 12000000 and # > <RequestedDisk> * 1.2
 +
</pre>
 +
 +
This means that your job used more disk than the default minimum disk space as well as exceeded the requested disk scaled by a factor of 1.2. If a user does not explicitly request disk, this is calculated by a formula in Condor.
 +
 +
The user should either
 +
# Request disk slightly larger than the used disk OR
 +
# Alter the code to use less disk space.

Revision as of 20:34, 5 April 2017

Overview

There exist various administrative scripts which run on the cluster automatically. If you find that your job has been held, designated by 'H' as a status, please use the following guidelines to understand why.

Viewing Job ClassAds

When Condor holds a job, the ClassAd 'HoldReason' can be modified to explain the cause. To see the ClassAds of a job, use the command

condor_q -l <jobid> | grep HoldReason

Hold Reaons

Over Maximum Run Count

The ClassAd HoldReason states

<user> job <jobid> removed because its RunCount # > 99

This means that your job has started # times which is more than the maximum allowed restarts. Typically, this indicates a problem with the job and should be removed. The code should be examined to find why it continually fails.

Used More Memory Than Requested

The ClassAd HoldReason states

<user> job <jobid> removed because its MemoryUsage # > 1200 and # > <RequestedMemory> * 1.2

This means that your job used more memory than the default minimum memory as well as exceeded the requested memory scaled by a factor of 1.2. If a user does not explicitly request memory, this is calculated by a formula in Condor.

The user should either

  1. Request memory slightly larger than the used memory OR
  2. Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps

Used More Memory Than Slot Memory Allocation

The ClassAd HoldReason states

<user> job <jobid> removed because its MemoryUsage # > 1200 and # > <SlotMemory> * 1.2 + 500

This means that your job used more memory than the default minimum memory as well as exceeded the allocated slot memory scaled by a factor of 1.2 + 500.

The user should either

  1. Request memory slightly larger than the used memory OR
  2. Alter the code to produce a smaller memory footprint. This might involve breaking the code into smaller steps

Used More Disk Than Requested

The ClassAd HoldReason states

<user> job <jobid> removed because its RequestDisk # > 12000000 and # > <RequestedDisk> * 1.2

This means that your job used more disk than the default minimum disk space as well as exceeded the requested disk scaled by a factor of 1.2. If a user does not explicitly request disk, this is calculated by a formula in Condor.

The user should either

  1. Request disk slightly larger than the used disk OR
  2. Alter the code to use less disk space.