Subsections
8.3 Development Release Series 7.7
This is the development release series of Condor.
The details of each version are described below.
Version 7.7.3
Release Notes:
- Condor version 7.7.3 not yet released.
New Features:
- condor_userprio supports new flags: The -grouporder flag displays submitter
entries for accounting groups at top of list and in breadth-first order by group
hierarchy. The -grouprollup flag reports accounting statistics for groups rolled up
by group hierarchy.
(Ticket #1926).
- condor_collector now avoids performance problems that happened
previously when clients initiated communication with the collector but
then delayed sending input.
(Ticket #2506).
- When using versions of glexec that create a copy of the proxy for use
by the job, Condor now ensures that this copy of the proxy is cleaned up
when the job is done.
(Ticket #2501).
Configuration Variable and ClassAd Attribute Additions and Changes:
- A new configuration variable NEGOTIATOR_SLOT_CONSTRAINT
can contain an expression which is passed from the Negotiator to the
Collector when it fetches ClassAds for the negotiation cycle.
(Ticket #2277).
- A new configuration variable NEGOTIATOR_SLOT_POOLSIZE_CONSTRAINT
replaces GROUP_DYNAMIC_MACH_CONSTRAINT , which has been retained as a
alternate name. The poolsize resulting from applying this constraint is used
to determine quotas for both dynamic groups and when there are no groups.
(Ticket #2277).
- The configuration variable NEGOTIATOR_STARTD_CONSTRAINT_REMOVE
was made obsolete by NEGOTIATOR_SLOT_CONSTRAINT so it has been removed.
(Ticket #2277).
- The configuration variables IGNORE_NFS_LOCK_ERRORS
and BIND_ALL_INTERFACES no longer support, undocumented,
'Y' or 'y' to mean True.
Bugs Fixed:
Known Bugs:
Additions and Changes to the Manual:
Version 7.7.2
Release Notes:
- Condor version 7.7.2 not yet released.
This release contains all features and bug fixes from Condor version 7.6.4
as are currently documented (section 8.4) in this manual.
New Features:
- There is command line support to both suspend and continue jobs.
The new tools condor_suspend and condor_continue will
suspend and continue running jobs.
(Ticket #2368).
- The EC2 GAHP now supports X.509 for connecting to and authenticating
with EC2 services. See section 5.3.7 for details
on using the X.509 protocol.
(Ticket #2084).
- Previously, the dedicated scheduler attempted to change the
Scheduler attribute on all parallel job processes in a durable fashion,
resulting in an fsync() for each process.
This has been changed to be not durable,
thereby improving the scalability by reducing the
number of fsync() calls without impacting correctness.
(Ticket #2367).
- In PrivSep mode, when an error is encountered when trying to
switch to the user account chosen for running the job,
the error message has been improved to make debugging easier.
Now, the error message distinguishes between safety check failures
for the UID, tracking group ID, primary group ID, and supplementary group IDs.
(Ticket #2364).
- The name of the user used to execute the job is now logged in
the condor_starter log, except when using glexec.
(Ticket #2268).
- condor_dagman now defaults to writing a partial DAG file
for a Rescue DAG,
as opposed to a full DAG file.
The Rescue DAG file is parsed in combination with the original DAG file,
meaning that any
changes to the original DAG input file take effect when running a Rescue DAG.
(Ticket #2165).
- The behavior of DAGMan is changed, such that, by default,
POST scripts will be run regardless of the return value from
the PRE script of the same node as described in section 2.10.2.
The previous behavior of not running the POST script can be restored by
either adding the -AlwaysRunPost option to the condor_submit_dag
command line,
or by setting the new configuration variable
DAGMAN_ALWAYS_RUN_POST to False,
as defined at 3.3.26.
(Ticket #2057).
- A matchmaking optimization has significantly improved the speed
of matching,
when there are machines with many slots.
(Ticket #2403).
- When the condor_schedd is starting up and it encounters corruption
in its job transaction log, the error message in the log file now reports
the offset within the file at which the error occurred.
(Ticket #2450).
- DAGMan will now copy PRIORITY values from the DAG input file to
the JobPrio attribute in the job ClassAd.
Furthermore, the PRIORITY values are propagated to child nodes and SUBDAGs,
so that child nodes always have priority at least that
of the maximum of the priorities of its parents.
This has been a cause of confusion for DAGMan users.
(Ticket #2167).
- The condor_startd now logs a clear message if it rejects a job
because no valid starters were detected.
(Ticket #2470).
Configuration Variable and ClassAd Attribute Additions and Changes:
- The new job ClassAd attribute PreserveRelativeExecutable,
when set to True, will prevent the condor_starter from
prepending Iwd to the command executable Cmd, when
when Cmd is a relative path name and TransferExecutable
is False.
(Ticket #2460).
- Attributes have been added to all daemons to publish statistics
about the the number of timers, signals, socket, and pipe messages
that have been handled, as well as the amount of time spent handling them. Statistics attributes for DaemonCore
have names that begin with DC or RecentDC.
(Ticket #2354).
- The default value of OpSys on Windows machines has been changed
to "WINDOWS", and a new attribute OpSysVer has been added
that contains the version number of the operating system.
This behavior is controlled by a new configuration variable
ENABLE_VERSIONED_OPSYS which defaults to False on Windows
and to True on other platforms.
The new machine ClassAd attribute OpSys_And_Ver will always contain
the versioned operating system.
Note that this change could cause problems with mixed pools,
because Condor version 7.7.2 condor_submit may add OpSys="WINDOWS",
but machines running Condor versions prior to 7.7.2 will be publishing
a versioned OpSys value,
unless there is an override in the configuration.
(Ticket #2366).
- Configuration variable COLLECTOR_ADDRESS_FILE is now set
in the example configuration,
similar to MASTER_ADDRESS_FILE.
This configuration variable is required when COLLECTOR_HOST
has the port set to 0, which means to select any available port.
In other environments, it should have no visible impact.
(Ticket #2375).
- Attributes have been added to the condor_schedd
to publish aggregate statistics
about jobs that are running and have completed, as well as counts of various
failures.
(Ticket #2197).
- The new configuration variable DAGMAN_WRITE_PARTIAL_RESCUE
enables the new feature of writing a partial DAG file, instead of a full
DAG input file, as a Rescue DAG.
See section 3.3.26 for a definition.
Also, the configuration variable
DAGMAN_OLD_RESCUE no longer exists,
as it is incompatible with the implementation of partial Rescue DAGs.
(Ticket #2165).
Bugs Fixed:
- Fixed a bug introduced in Condor version 7.7.1,
in the standard universe,
where the getdirentries() call failed during remote I/O situations.
(Ticket #2467).
- Fixed a bug in the condor_startd that was preventing dynamic slots
from being properly instantiated from partitionable slots.
(Ticket #2507).
- Fixed a bug introduced in Condor version 7.7.0,
in which the condor_startd may erroneously report
Can't find hostname of client machine.
In cases where Condor was unable to identify the host name,
the ClientMachine
attribute in the machine ClassAd would have gone unset.
(Ticket #2382).
- Fixed a bug existing since April 2001,
in which on start up of the condor_schedd, with parallel universe jobs,
the job queue sanity checking code would change the Scheduler
attribute on jobs,
only to have the attribute changed later by the dedicated scheduler.
(Ticket #2367).
- Machine ClassAds with the Offline attribute set to True,
and with neither MyType nor TargetType
attributes defined caused
the condor_collector to fail to start when it was next restarted.
(Ticket #2417).
- Fixed a file descriptor leak in the EC2 GAHP,
which would cause grid-type ec2 jobs to become held.
The HoldReason for most such jobs would be
Unable to read from accesskey file.
(Ticket #2447).
- Fixed a bug that could cause a job's standard output and error to
be written to the wrong location when should_transfer_files was
set to IF_NEEDED,
and the job runs on the machine where file transfer is not needed.
If the standard output or error file names contained any path information,
the output would be written to _condor_stdout or
_condor_stderr in the job's initial working directory.
(Ticket #1811).
- Fixed a bug introduced in Condor version 7.7.1
that could cause the condor_schedd daemon to crash after
failing to expand a
$$ macro in the job ClassAd.
(Ticket #2491).
Known Bugs:
- In Condor version 7.7.2,
the Condor daemons on Linux platforms rely on shared libraries.
A bug in Condor version 7.7.1 and all previous versions of Condor
prevents a 7.7.1 condor_master from starting 7.7.2 or later daemons.
This also means that a 7.7.1 condor_master cannot upgrade itself to
version 7.7.2.
If a 7.7.1 condor_master binary is replaced with
a 7.7.2 condor_master binary,
Condor will shut off, and need to be restarted by hand.
Additions and Changes to the Manual:
Version 7.7.1
Release Notes:
- Condor version 7.7.1 released on September 12, 2011.
This developer release contains all bug fixes from Condor version 7.6.3.
New Features:
- Condor now dynamically links with the OpenSSL and Kerberos security
libraries, and Condor will use the operating system's version of these
libraries, when they are available.
The tarball release of Condor on Linux platforms includes
a copy of these libraries.
If the operating system's version is incompatible with Condor,
Condor will use its own copy instead.
Condor's copy of these libraries is located under lib/condor/.
To prevent Condor from considering using them, delete these libraries.
(Ticket #1874).
- The ClassAd language now has an unparse() function.
It converts an expression into a string,
which is handy with the new eval() function.
(Ticket #1613).
- The new job ClassAd attribute KeepClaimIdle is defined with an integer
number of seconds in the job submit description file, as the example:
+KeepClaimIdle = 300
If set, then when the job exits,
if there are no other jobs immediately ready to run for this user,
the condor_schedd daemon,
instead of relinquishing the claim back to the condor_negotiator,
will keep the claim for the specified number of seconds.
This is useful if another job will be arriving soon,
which can happen with linear DAGs.
The condor_startd slot
will go to the Claimed Idle state for at least that many seconds until
either a new job arrives or the timeout occurs.
See page
,
the unnumbered Appendix A for a complete definition of this
job ClassAd attribute.
(Ticket #2094).
- The new PRE_SKIP key word in DAGMan changes the
behavior of DAG node execution such that the node's job and POST script
may be skipped based on the exit value of the PRE script.
See section 2.10.2 for details.
(Ticket #2122).
Configuration Variable and ClassAd Attribute Additions and Changes:
- The new configuration variable
NEGOTIATOR_STARTD_CONSTRAINT_REMOVE defaults to False.
When True, any ClassAds not satisfying the expression
in GROUP_DYNAMIC_MACH_CONSTRAINT are removed from the
list of condor_startd ClassAds considered for negotiation.
(Ticket #2232).
- The new configuration variable
NEGOTIATOR_UPDATE_AFTER_CYCLE defaults to False.
When True, it forces the condor_negotiator daemon
to update the negotiator ClassAd in the condor_collector daemon
at the end of every negotiation cycle.
This is handy for monitoring and debugging activities.
(Ticket #2373).
Bugs Fixed:
- Expressions for periodic policies such as
PERIODIC_HOLD and PERIODIC_RELEASE
could inadvertently cause a claim to be released,
if the condor_shadow exited before waiting for final update from the
condor_starter.
(Ticket #2329).
- condor_submit previously could incorrectly detect references
in the requirements expression to special attributes such as
Memory when the name of the attribute happened to appear in a
string literal or as part of the name of some other attribute.
The detection of references to various special attributes influences the
automatic requirements which are appended to the job requirements.
(Ticket #2350).
- In rare cases, CCB requests could cause the server to hang for
20 seconds while waiting for all of the request to arrive.
(Ticket #2360).
Known Bugs:
Additions and Changes to the Manual:
Version 7.7.0
Release Notes:
- Condor version 7.7.0 released on July 29, 2011.
This developer release contains all bug fixes from Condor version 7.6.2.
New Features:
Configuration Variable and ClassAd Attribute Additions and Changes:
- The new configuration variable NEGOTIATOR_UPDATE_AFTER_CYCLE
defaults to False.
If set to True, it will force the condor_negotiator daemon
to publish an update ClassAd to the condor_collector at the end of
every negotiation cycle.
This is useful if monitoring cycle-based statistics.
- The new configuration variable SHADOW_RUN_UNKNOWN_USER_JOBS
defaults to False.
When True, it allows the condor_shadow daemon to run jobs remotely
submitted from users not in the local password file.
- The configuration variables for security
DENY_CLIENT and HOSTDENY_CLIENT
now also look for the prefixes TOOL and SUBMIT.
- CONDOR_VIEW_HOST is now a comma and/or white space separated
list of hosts, in order to support more than one CondorView host.
- For a job with an X.509 proxy credential, the new job ClassAd
attribute X509UserProxyEmail is the email address extracted
from the proxy.
- On Linux execute machines with kernel version more recent than 2.6.27,
the proportional set size (PSS) in Kbytes summed across all
processes in the job is now reported in the attribute
ProportionalSetSizeKb. If the execute machine does not
support monitoring of PSS or PSS has not yet been measured, this
attribute will be undefined. PSS differs from ImageSize in
how memory shared between processes is accounted. The PSS for one
process is the sum of that process' memory pages divided by the
number of processes sharing each of the pages. ImageSize is
the same, except there is no division by the number of processes
sharing the pages.
- The new configuration variable DAGMAN_USE_STRICT
turns warnings into errors, as defined in section 3.3.26.
- The condor_schedd now publishes performance-related statistics.
Page
in Appendix A contains
definitions for these new attributes:
- DetectedMemory
- DetectedCpus
- UpdateInterval
- WindowedStatWidth
- ExitCode<N>
- ExitCodeCumulative<N>
- JobsSubmitted
- JobsSubmittedCumulative
- JobsStarted
- JobsStartedCumulative
- JobsCompleted
- JobsCompletedCumulative
- JobsExited
- JobsExitedCumulative
- ShadowExceptions
- ShadowExceptionsCumulative
- JobSubmissionRate
- JobStartRate
- JobCompletionRate
- MeanTimeToStart
- MeanTimeToStartCumulative
- MeanRunningTime
- MeanRunningTimeCumulative
- SumTimeToStartCumulative
- SumRunningTimeCumulative
- For Windows platforms, the condor_startd now publishes the
ClassAd attribute DotNetVersions,
containing a comma separated list of installed .NET versions.
Bugs Fixed:
- Fixed a bug in which the condor_startd daemon can get stuck in a
loop trying to execute an invalid,
that is non-existent, Daemon ClassAd Hook job.
- Fixed bug that would cause the condor_startd daemon to incorrectly
report Benchmarking activity instead of Idle activity,
when there is a problem launching the benchmarking programs.
- On Windows only, fixed a rare bug that could cause
a sporadic access violation when a Condor daemon spawned another process.
- Fixed a bug introduced in Condor version 7.5.5,
which caused the condor_schedd to die managing parallel jobs.
- The condor_startd daemon now looks up the condor_kbdd daemon address
on every update.
This fixed problems if the condor_kbdd daemon is restarted
during the condor_startd lifespan.
- Fixed bug in condor_hold that happened if the hold
reason contained a double quote character.
- Fixed a bug introduced in Condor version 7.5.6 that
caused any Daemon ClassAd hook job with non-empty value for
STARTD_CRON_<JobName>_ARGS,
SCHEDD_CRON_<JobName>_ARGS
or BENCHMARKS_<JobName>_ARGS to fail.
Also, the specification of
STARTD_CRON_<JobName>_ENV,
SCHEDD_CRON_<JobName>_ENV,
or BENCHMARKS_<JobName>_ENV for these jobs was ignored.
- Fixed bug in the RPM init script.
A status request would always report Condor as inactive,
and a shutdown request would not report failure if there was a
timeout shutting down Condor.
- File transfer plug-ins now have a correctly set environment.
- Fixed a problem with detecting IBM Java Virtual Machines whose
version strings have embedded newline characters.
- condor_q -analyze now works with ClassAd built-in functions.
- Fixed bug in condor_q -run, such that it displays
the host name correctly for local and scheduler universe jobs.
- Standalone checkpointing now works with compressed checkpoints again.
This had been broken in Condor version 7.5.4.
- On Windows, net stop condor would sometimes cause the
condor_master daemon to crash. This is now fixed.
- JobUniverse was effectively a required attribute for
jobs created via the Fetch Work hook,
due to the need to set the IS_VALID_CHECKPOINT_PLATFORM
expression, such that it would not evaluate to Undefined.
Now the default IS_VALID_CHECKPOINT_PLATFORM expression
evaluates to True when JobUniverse is not defined.
- When there are multiple cpus but only one slot, the slot name no
longer begins with slot1@.
- The tool condor_advertise seemed to be trying too hard to resolve
host names. This was fixed to only do the minimally necessary
number of look ups.
Known Bugs:
Additions and Changes to the Manual:
condor-admin@cs.wisc.edu