Subsections
8.4 Development Release Series 7.3
This is the development release series of Condor.
The details of each version are described below.
Version 7.3.2
Release Notes:
- The format of the output from condor_status with the -grid option
has been changed to provide more useful information.
- Removed the newline appended to the end of condor_status
-format output.
Therefore, code which parses the output of this command should now
be careful when trimming the last line.
New Features:
- condor_fetchlog may now fetch the history files of a condor_schedd
daemon. And, the history file kept by the condor_schedd daemon may
now be rotated daily or monthly.
- The condor_ckpt_server will automatically clean up stale
checkpoint files. The configuration variables which control this
behavior are described below.
- The condor_ckpt_server (either the 32-bit or 64-bit) executable
will now communicate correctly between 32-bit and 64-bit submit nodes.
If by some chance bit width issues arise in the checkpoint protocol
(for example, with file sizes),
clear error messages are logged in the checkpoint server logs.
- The new condor_ssh_to_job tool allows interactive debugging of running
jobs. See the manual page at
for details.
- The condor_status command is now substantially faster,
especially with the -format option.
- Grid universe grid type gt5 has been added for submission to
the new Globus GRAM5 service. When a GRAM service is identified as
gt5, jobmanager throttling and the Grid Monitor are not used.
See section 5.3.2 for details.
- Grid universe grid type cream has been added for submission
to the CREAM job service of gLite.
See section 5.3.8 for details.
- When low on file descriptors for creating new network sockets,
the condor_schedd daemon now avoids the unlimited stacking up of
messages that it sends periodically to the condor_negotiator
and condor_startd.
- The performance and failure handling of the Grid Monitor have been
improved.
- For grid type nordugrid in the grid universe,
job status information
is now obtained using Nordugrid ARC's LDAP server, which should greatly
improve performance. Also, Condor can now tell when these jobs are running.
- The new -valgrind option to condor_submit_dag
causes condor_submit_dag to generate a submit description file that
uses valgrind on condor_dagman, instead of the condor_dagman
binary as its executable.
- condor_dagman now lazily evaluates and opens node job log files.
Instead of parsing all submit description files and
immediately opening their specified log files at start up,
condor_dagman now parses
the submit description files just before each job is submitted,
and has each log file open only when relevant jobs are in the queue
or executing POST scripts.
In addition, condor_dagman now automatically generates a default user log
file for any node job that does not specify one.
- Both the support and documentation for the MPI universe have been removed.
MPI applications are supported through the use of the parallel universe.
- When the condor_startd daemon's test of virtual machine software fails
(for machines configured as capable of running virtual machines),
the condor_startd will periodically retry the test until it succeeds.
- The nordugrid_gahp now limits the number of connections
made to each NorduGrid ARC server and reuses connections when possible.
- Added the ClassAd function eval(), which takes a string
argument and evaluates the contents of the string as a ClassAd
expression. An policy example where this is useful is described in
section 3.5.9 on job suspension.
- The new condor_q option -attributes limits the
attributes which are displayed when using the -xml or -long
options.
Limiting the number of attributes also increases the efficiency of the query.
- Condor's power management capabilities are now implemented as a
plug-in. In particular, the condor_startd now runs an
external program, as specified by the configuration variable
HIBERNATION_PLUGIN ,
to perform the detection of available low power states and the
switching to these low power states.
- The new Condor daemon condor_rooster has been added to wake up
hibernating machines when the expression defined by the configuration variable
UNHIBERNATE becomes True.
The configuration variables relating to condor_rooster
are described in section 3.3.35.
- Added the ability to extract information from the user event log
reader's state buffer to the user log reader. This is implemented
through a new ReadUserLogStateAccess C++ class
as defined in read_user_log.h.
- Changes to the value of the configuration variable
CERTIFICATE_MAPFILE or the contents
of the file to which it refers no longer require a full restart of Condor.
Instead, the command condor_reconfig will cause the changes to be utilized.
- The condor_master daemon will now print the path and arguments
to any daemons it starts if D_FULLDEBUG is enabled. Previously,
there was no way to get it to display the arguments with which it
was starting a daemon.
- The condor_had daemon now has the ability to control daemons
other than the condor_negotiator. This is controlled via the
HAD_CONTROLLEE macro.
- Condor now recognizes VOMS extensions in X.509 proxies.
The VOMS attributes are encoded in the job ClassAd attribute
X509UserProxySubject.
- The condor_startd can now clean up stranded virtual machines,
following a crash of Condor or its host operating system.
- Following a crash, the condor_gridmanager no longer restarts all
of the jobmanagers for gt2 jobs. This should improve recovery time.
- Condor works better with the ClassAds categorized as generic
in the condor_collector daemon.
Various daemons that register themselves with generic ClassAds
can now have tools which use the -subsystem option manipulate
their ClassAds properly.
- Condor now provides a mechanism to enforce strict resource limiting for
some universes of running jobs.
Configuration Variable Additions and Changes:
- The new configuration variable EMAIL_SIGNATURE specifies
a custom signature to be appended to e-mail sent by the Condor system.
If defined, then this custom signature replaces the
default one specified internally.
- The new configuration variable CKPT_SERVER_CLIENT_TIMEOUT
informs the condor_schedd how long in seconds it is willing to wait
to try and talk to a condor_ckpt_server process before declaring a
condor_ckpt_server down.
See section 3.3.11 for the complete description.
- The new configuration variable
CKPT_SERVER_CLIENT_TIMEOUT_RETRY informs the condor_schedd
that once a condor_ckpt_server is been marked as down, how may seconds
must pass before the condor_schedd will try and communicate with the
condor_ckpt_server again.
See section 3.3.11
for the complete description.
- The new configuration variable
CKPT_SERVER_REMOVE_STALE_CKPT_INTERVAL informs the
condor_ckpt_server to begin removal of stale checkpoints at the specified
interval in seconds.
See section 3.3.8
for the complete description.
- The new configuration variable
CKPT_SERVER_STALE_CKPT_AGE_CUTOFF informs the
condor_ckpt_server how old a checkpoint file's access time must be
in order to be considered stale. This time is compared against the
current notion of now
when the checkpoint server checks the checkpoint image file.
See section 3.3.8
for the complete description.
- The new configuration variable SlotWeight may be used to
give a slot greater weight when calculating usage, computing fair
shares, and enforcing group quotas.
See 3.3.10 for the complete description.
- The new configuration variable MAX_PERIODIC_EXPR_INTERVAL
implements a ceiling on the time between evaluation of periodic expressions,
due to the adaptive timing implied by the configuration variable
PERIODIC_EXPR_TIMESLICE.
See 3.3.11 for the complete description.
- The new configuration variable GRIDMANAGER_SELECTION_EXPR
can be used to control how many condor_gridmanager processes will be
spawned to manage grid universe jobs. As a part of this change, removed
the configuration variable and supporting code for
GRIDMANAGER_PER_JOB since the new configuration variable
supersedes it.
See 3.3.11 for the complete description.
- The configuration variable
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE and the
corresponding throttle GRIDMANAGER_MAX_PENDING_SUBMITS
have been removed.
- The new configuration variable GRID_MONITOR_DISABLE_TIME
controls how long the condor_gridmanager will wait after encountering
an error before attempting to restart a Grid Monitor job.
See 3.3.24 for the complete description.
- The new pre-defined configuration macro DETECTED_MEMORY
indicates the amount of physical memory (RAM) detected by Condor.
The value is given in Mbytes.
- The new pre-defined configuration macro DETECTED_CORES
indicates the number of CPU cores detected by Condor.
- The new configuration variable
DELEGATE_FULL_JOB_GSI_CREDENTIALS
controls whether a full or limited X.509 proxy is delegated for grid type
gt2 grid universe jobs.
See 3.3.27
for the complete description.
- The new configuration variable UNHIBERNATE is used by
the condor_startd to advertise in its ClassAd a boolean expression
specifying when the machine should be woken up,
for example by condor_rooster.
See 3.3.10 for the complete description.
- The new configuration variable HIBERNATION_PLUGIN specifies the
path to the plug-in which the condor_startd uses both to detect
the low power state capabilities of a machine and to switch the
machine to a low power state.
See 3.3.10 for the complete description.
- The new configuration variable HIBERNATION_PLUGIN_ARGS
specifies additional command line arguments which the
condor_startd will pass to the plug-in when invoking it to
switch the machine to a low power state.
See 3.3.10 for the complete description.
- The new configuration variable HIBERNATION_OVERRIDE_WOL can be
used to direct the condor_startd to ignore Wake On LAN (WOL)
capabilities of the machine's network interface, and to switch to a
low power state even if the interface does not support WOL, or if
WOL is disabled on it.
See 3.3.10 for the complete description.
- The new configuration variable DAGMAN_USER_LOG_SCAN_INTERVAL
controls how long condor_dagman waits between checking job log files
for status updates.
See 3.3.26 for the complete description.
- The new configuration variable DAGMAN_DEFAULT_NODE_LOG sets
the default log file name for the new condor_dagman
default node log file feature.
See 3.3.26
for the complete description.
- Removed the configuration variable
DAGMAN_DELETE_OLD_LOGS ; new log file reading code makes it
obsolete.
- The new configuration variable HAD_CONTROLLEE is used
to specify the name of the daemon which the condor_had controls.
This name should match the daemon name in the condor_master's
DAEMON_LIST.
Bugs Fixed:
- Fixed a bug in ClassAd functions where arguments which should have been
correctly coerced into strings instead evaluated to ERROR.
- Fixed a confusing diagnostic message with the JobRouter, which happened
when a job was removed within 5 minutes of being submitted.
- Fixed a bug in which the use of dynamic slots
(see section 3.13.9)
caused the machine ClassAd attribute SLOT<N>_STARTD_ATTRS
to disappear from the ClassAd for some slots.
- Fixed a Windows platform bug in which the window belonging to
a Condor job does not receive a paint message.
- Fixed a bug causing condor_q -analyze to crash when there was no
condor_schedd daemon ClassAd file.
- Fixed a condor_procd crash caused when the environment of
a monitored process exceeded 1MByte in /proc.
- Fixed a Windows platform bug which could cause the condor_credd
to crash if a requested credential is not in the password store.
- Fixed a bug that was causing the job event log rotation lock to be
created with incorrect permissions.
- Fixed a bug in the rotation of the job event log which could cause it
never to be rotated in the Windows port of Condor.
- Fixed a potential race condition in the job event log initialization.
- Fixed race condition which could cause a crash of the condor_collector
and condor_schedd on shutdown.
- Fixed a bug in which the condor_master would sometimes die and produce
a dprintf_failure.MASTER file when either restarting due to new
binary timestamps or when started initially.
- Fixed a memory leak related to SOAP configuration variables
that occurred when Condor was reconfigured.
- Fixed a bug in which the submit description file command
cron_day_of_week was erroneously ignored.
- Fixed bug in which the configuration variables
MAX_JOB_QUEUE_LOG_ROTATIONS and GRIDMANAGER_SELECTION_EXPR
would not work properly at start up; they only worked after a condor_reconfig.
- Fixed a bug in which SOAP operations were being incorrectly authorized
with the peer IP $<$0.0.0.0$>$.
- Fixed a Windows platform bug in which not all Condor daemons were trusted
by the Windows Firewall
(previously known as Internet Connection Firewall or ICF).
- Fixed a shutdown race condition in the condor_master with respect to
high availability daemons.
- Fixed a bug in which a Condor daemon incorrectly determined it had
run out of socket descriptors.
- Fixed a bug where the condor_schedd would block for very long
periods of time while trying to connect to a down checkpoint server. Now
the condor_schedd will do a blocking connect with a timeout to the
checkpoint server for a configurable number of seconds. If the connect
fails, the condor_schedd will put a moratorium on connecting to the
checkpoint server until the configurable moratorium period passes. The
configuration file variables that describe this behavior are described
above.
- Changed the check that condor_dagman does for other
condor_dagman instances
running the same DAG, if it finds a lock file at startup.
Now, if condor_dagman is not sure whether the other DAGMan is alive,
it continues, rather than exiting.
- Fixed a major file descriptor leak in the Stork daemon.
- Fixed a bug in which successful Stork transfers were marked as failed.
- Fixed an uncommon memory leak in the user event log file reading code
when reading badly formatted events.
- Fixed a bug in which multiple machine ClassAds in the
condor_collector with the same Name,
but different StartdIPAddr attribute values,
would cause the condor_negotiator to exit with an error.
This is unusual and should not happen in a typical Condor installation.
The most likely cause is using condor_advertise
to advertise custom ClassAds for grid matchmaking.
- Fixed a bug that caused condor_dagman to core dump if all
submit attempts failed on a DAG node having a POST script.
This bug has existed since Condor version 7.1.4.
- Fixed a memory leak in the condor_schedd, which occurred when
the configuration variable NEGOTIATOR_MATCH_EXPRS was used.
- Fixed a bug in the Windows platform code that treats scripts as
executables.
Unknown file extensions were treated as an error,
rather than as a Windows executable.
- The condor_job_router now correctly sets the ClassAd attribute
EnteredCurrentStatus to the current time when creating a new routed job.
Previously, it copied this attribute from the original job.
- The condor_job_router emits a more friendly log message when it
observes that the routed copy of the job was removed.
- A fix has been made for a problem seen in 7.3.1 in which Condor daemons
using CCB to connect to other Condor daemons would sometimes consume
large amounts of CPU time for no good reason.
- Fixed a rare failure case bug in which attempts to connect via
CCB could stay in a pending state indefinitely.
- A Unix only bug caused Condor daemons to fail to start if
MAX_FILE_DESCRIPTORS was configured higher
than the current hard limit inherited by Condor. If Condor is running
as root, this is no longer the case.
- The condor_gridmanager now advertises grid ClassAds properly when there
are multiple condor_collector daemons.
- When using condor_q -xml and -format together to
limit the number of ClassAd attributes returned in the query, the XML
<classads> container tag was not generated. This is fixed, but
now the preferred way to limit the returned attributes is to
use condor_q option -attributes.
- Fixed a bug in which the Unix condor_master failed
when trying to restart itself,
if the configuration variable MASTER_LOCK was defined,
or if the condor_master was invoked with the -t option.
This bug has existed since the 7.0 series,
and likely has existed much longer than that.
- Fixed a significant memory leak in the gahp_server. This
leak was only present in previous Condor 7.3.x releases.
- Fixed a bug that can cause a removed job that is held and then
released to return to idle status.
- The Globus jar files distributed with the x86-64 RHEL 5 RPMs were
damaged, causing gt4 grid type jobs to fail. This has been fixed.
Known Bugs:
- The version 7.3.2 condor_dagman binary sometimes has problems running
rescue DAGs. Probably the best work around for this problem is to use
version 7.3.1 rather than 7.3.2 condor_dagman and condor_submit_dag
binaries, even if using version 7.3.2 for the rest of the Condor
installation.
Additions and Changes to the Manual:
Version 7.3.1
Release Notes:
New Features:
- Added the STARTD_HISTORY configuration parameter. If set, this
is a pathname to a history file, just like the condor_schedd maintains,
but only for jobs run on that startd.
- Added the JavaSpecificationVersion attribute to startds which
support Java. This allows users to request machines which support
a particular major version of Java, without specifying the exact
specific version. So, Java versions 1.6.0_01, 1.6.1_02 and 1.6.2_03
all advertise JavaSpecificationVersion of 1.6.
- Implemented a performance increase to condor_dagman which can
decrease the parsing times of DAG input files by up to 60 times.
This performance increase works for certain common DAG geometries.
This will help in submission and recovery
time for DAGs whose nodes have a very large number of dependency edges
associated with them.
- condor_q -analyze and -better-analyze now emit warnings
if the condor_schedd will not run jobs when it is out of swap space or
has hit the limit imposed by the configuration variable
MAX_JOBS_RUNNING.
- When matching Condor-G jobs to resources, if multiple jobs
match multiple resources, and every job has identical job rank, the
matchmaker would always fill up one particular resource first. Now,
the resources will be matched in a round robin fashion. This can be
overridden by setting job rank appropriately.
- Made the condor_schedd more efficient in how it stores
information about
$$() expansions in the job ClassAd.
Also made the condor_schedd more efficient in how it contacts
the condor_negotiator to submit reschedule requests.
- Improved the Job Router's heuristic for site throttle adjustment. It
is now quicker to release the throttle when the failure rate drops
below the configured threshold.
- Made the Job Router more efficient on startup by improving the way it
reads the job queue log file.
- Added an accessor class to the user log reader API to allow the
application to query about reader state, including the
difference in the event numbers and log position of two states. This
can be used by the application to determine the number of events
missed when missed events are detected.
- Added the ability to throttle the rate at which jobs are
stopped via condor_rm, condor_hold, condor_vacate_job,
and during a graceful shutdown of the condor_schedd daemon.
- In the configuration file, Condor now accepts expressions for
the values of configuration variables that are required to be
numeric literals or boolean constants.
Note that this does not imply that the
expressions may freely reference ClassAd values in places where they
could not before.
See section 3.3.1 for an example with
further explanation.
Configuration Variable Additions and Changes:
- Added the STARTD_HISTORY configuration parameter. If set, this
is a pathname to a history file, just like the condor_schedd maintains,
but only for jobs run on that startd.
- The new configuration variable UPDATE_OFFSET
causes the condor_startd to
delay the initial (and all further) updates that it sends to the
condor_collector. See 3.3.10 for more details.
- The new configuration variables
JOB_STOP_COUNT and JOB_STOP_DELAY
limit the rate at which jobs are stopped via condor_rm,
condor_hold, condor_vacate_job, and during a graceful shutdown of
the condor_schedd daemon.
See 3.3.11 and 3.3.11
for full definitions.
Bugs Fixed:
- Fixed a problem with job removal in the local universe that
would cause spurious error messages to be written to the log of the
condor_schedd daemon.
- The condor_schedd was failing to send `reschedule' commands to
flocked negotiators, so unless some other schedd in the negotiator's
pool sent it a reschedule command, negotiation cycles would only
happen every NEGOTIATOR_INTERVAL .
Known Bugs:
- When using CCB to connect to other Condor daemons, Condor 7.3.1
daemons can sometimes consume large amounts of CPU, potentially
causing performance problems. Condor 7.3.0 did not suffer from this
problem.
Additions and Changes to the Manual:
Version 7.3.0
Release Notes:
- This release is incompatible when communicating with
previous versions of Condor if CCB is enabled or if
PRIVATE_NETWORK_NAME is configured.
- Updated the DRMAA version.
This new version is compliant with GFD.133,
the DRMAA 1.0 grid recommendation standard.
Three new functions were added to meet the specification's requirements,
and several bugs were fixed.
New Features:
- Added support for using any recognized script as an executable
in a submit file on Windows. For more information please see
section 6.2.6 on
page
.
- Improved support for private networks:
Added CCB, the Condor Connection Broker. It is similar in
functionality to GCB, the Generic Connection Broker, but it has
several advantages, including ease of use and working on Windows as
well as Unix platforms.
GCB continues to work, but we may remove
it some time in the 7.3 development series. The main missing feature
in CCB at the moment that prevents it from replacing GCB,
is support for connectivity from one private network to another.
CCB only works
when connecting from a public network to a private one. For example,
jobs may be sent from a condor_schedd on the public Internet to
condor_startd daemons on a
private network, if the condor_startd daemons are configured
to use a CCB server that is accessible to the condor_schedd daemon.
However, if the condor_schedd daemon is on one private
network and the condor_startd daemons are on a different private network,
CCB does not help. For more information on CCB, see section 3.7.4.
- Added support for a CPU affinity on both Windows and Linux platforms.
- Added support for the condor_q -better-analyze option on Windows.
- Added WANT_HOLD. When PREEMPT becomes
true, if WANT_HOLD is true, the job is put on hold for the
reason (optionally) specified by WANT_HOLD_REASON and
WANT_HOLD_SUBCODE. These policy expressions are evaluated
by the execute machine. As usual, the job owner may specify
periodic_release and/or periodic_remove
expressions to react to specific hold states automatically.
- Added the ClassAd function debug().
See section 4.1.2 for the details of this function.
- The condor_schedd can now use MD5 check sums to avoid storing
multiple copies of the same executable in its SPOOL directory.
Note that this feature only affects executables sent to the
condor_schedd via the copy_to_spool command within
a submit description file.
- Reduced the number of sleeps condor_dagman does to maintain log
file consistency when a DAG uses multiple user logs for node jobs.
DAGMan now does one sleep per submit cycle,
instead of one sleep for each submit.
- Added the -import_env command-line flag to
condor_submit_dag. This explicitly puts the submittor's environment
into the .condor.sub file.
- Optimized the removal of large numbers of jobs.
Previously, removal of tens of thousands of jobs caused the
condor_schedd daemon to consume
a lot of CPU time for several minutes.
- Reduced memory usage by the condor_shadow daemon. Since there is one
condor_shadow process per running job, this helps increase the
number of running jobs that a submit machine can handle. Under Linux 2.6,
we found that running 10,000 jobs from a single submit machine
requires about 10GBytes of system RAM. We also found in this case that to
run more than 10,000 simultaneous jobs requires a 64-bit submit
machine. On a 32-bit Linux platform, kernel memory is exhausted,
regardless of how much additional RAM the system has.
- Reduced the memory usage of the condor_collector daemon,
when UPDATE_COLLECTOR_WITH_TCP = True.
Configuration Variable Additions and Changes:
- The new configuration variable OPEN_VERB_FOR_<EXT>_FILES
allows the default interpreter for scripts with an extension EXT to
be changed. For more information please see
section 6.2.6 on
page
.
- The new configuration variable CCB_ADDRESS
configures a daemon to use one or more
CCB servers to allow communication with Condor components outside of
the private network. See page
.
- The new configuration variable MAX_FILE_DESCRIPTORS
(on Unix platforms only) specifies the
required file descriptor limit for a Condor daemon. File descriptors
are a system resource used for open files and for network connections.
Condor daemons that make many simultaneous network connections may
require an increased number of file descriptors. For example, see
page
for information on file descriptor requirements
of CCB.
- The new configuration variables ENFORCE_CPU_AFFINITY and
SLOT<N>_CPU_AFFINITY on Linux platforms allow for
Condor to lock slots to given CPUs.
Definitions for these variables are at 3.3.13.
- The new configuration variable DEBUG_TIME_FORMAT
allows a custom specification for the format of the time
printed at the start of each line in a daemon's log file.
See 3.3.4 for the complete definition of
this variable.
- The new configuration variable SHARE_SPOOLED_EXECUTABLES
is a boolean value that determines whether the condor_schedd daemon will
use MD5 check sums to avoid storing multiple copies of the same
executable in the SPOOL directory. The default setting is
True.
- The new boolean configuration variable
EVENT_LOG_FSYNC provides control of the behavior of
Condor when writing events to the event log. Previously,
the behavior was as if this parameter were set to False.
See 3.3.4 for the complete definition of
this variable.
- The new boolean configuration variable
EVENT_LOG_LOCKING provides control of the behavior of
Condor when writing events to the event log. Previously,
the behavior was controlled by ENABLE_USERLOG_LOCKING.
See 3.3.4 for the complete definition of
this variable.
Bugs Fixed:
Known Bugs:
Additions and Changes to the Manual:
condor-admin@cs.wisc.edu