Subsections
8.6 Development Release Series 7.1
This is the development release series of Condor.
The details of each version are described below.
Version 7.1.4
Release Notes:
- The owner of the log file for the condor_vm-gahp
has changed to the condor user.
In Condor 7.1.2 and previous versions, it was owned by the
user that the virtual machine is started under.
Therefore, the owner of and permissions on an existing log file
are likely to be incorrect.
To correct the problem, an administrator may modify file
permissions such that the condor user may read and
write the log file.
Alternatively, an administrator may delete the file, and
Condor will create a new file with the expected owner and
permissions.
In addition, the definition for VM_GAHP_LOG
in the condor_config.generic file has changed for
Condor 7.1.3.
- The vm universe no longer supports the use of
the xm
command for running Xen virtual machines. The virsh tool
should be used instead.
- Condor no longer supports the standard universe feature in its
ports to Solaris. We may resurrect this feature in the future if demand
for it on this port grows again to sufficient levels.
New Features:
- Local entries in the configuration file may now be specified
by pre-pending a local name and a period to the normal name.
Local settings take precedence over the other settings.
The local name can be specified on the command line to all daemons via
the new -local-name command line option.
See section 3.3.1
for more details on how the local name will be used in the configuration,
and section 3.9.2
for more details on the command line parameters.
- Dynamic Startd Provisioning: New configuration options allow for slots
to be broken into job-sized pieces. While this feature is still under
ongoing development, we felt that what we had so far, although not yet
fulfilling our complete vision, is useful enough in its present form to
bring value to some installations.
- condor_submit_dag is now automatically run recursively on
nested DAGs (unless the new -no_recurse option is specified).
See
for details.
- Added the new SUBDAG EXTERNAL keyword (for specifying nested
DAGs) to condor_dagman. See
for details.
- It is now possible to have multiple rotations of the ``event
log'' file, such as ``EventLog'', ``EventLog.1'', ``EventLog.2'', ...
- The VM universe can now run VMware virtual machines on machines using
privilege separation without requiring the condor_vm-gahp binary to be
setuid root. Running the condor_vm-gahp as setuid root is no longer
supported for VMware or Xen.
- Condor now supports the ability for the condor_master to run a
program as it shuts down. This can be particularly useful for doing
a graceful shutdown, followed by, a reboot. This is
accomplished through the new
MASTER_SHUTDOWN_
Name
configuration variable.
The configuration variable MASTER_SHUTDOWN_
Name
is defined on page
),
and the manual page for condor_set_shutdown
is on page
.
- The condor_lease_manager is a new daemon. It
provides a mechanism for managing leases to resources described by
Condor's ClassAd mechanism. These resources and leases are managed
to be persistent.
- VM universe now works with privilege separation (PrivSep)
for VMware jobs. Xen is still not supported in PrivSep mode.
- Added the DIR directive for the SPLICE keyword in
the DAGMan language.
Please read section 2.10.6 on page
for
more information.
- For gt4 type grid jobs (i.e. WS GRAM), include a request to retry
failed attempts at file clean-up in the RSL job description.
- Improved the scalability of some algorithms used by the
condor_schedd and condor_negotiator when dealing with large
numbers of startds.
- Added the ability for the condor_master (actually, any
DaemonCore process with children) to kill child
processes that have quit responding SIGABRT instead of SIGKILL.
This is for debugging purposes on UNIX systems, and is controlled by
the new NOT_RESPONDING_WANT_CORE configuration
parameter. If the child process is configured with
CREATE_CORE_FILES enabled, the child process will then
generate a core dump.
This feature is currently implemented only on UNIX systems.
See
NOT_RESPONDING_WANT_CORE
on page
,
NOT_RESPONDING_TIMEOUT
on page
,
and
CREATE_CORE_FILES
on page
for more details.
- Condor can now be configured to keep a backup of the job queue
log on a local file system in case condor_schedd operations
involving writes, flushes, or syncs to the job queue log fail. This
is most likely to happen when the job queue log is stored on a
network file system like NFS. Such a backup enables an administrator
to see that a job failed to submit, but does not perform any
automatic recovery. See below for the these configuration parameters.
- Added preliminary support for ``Green Computing''. This is
supported only on Linux and Windows.
See section 3.16 on page
on
``Power Management'' for more details.
Configuration Variable Additions and Changes:
Bugs Fixed:
- In some rare cases, the condor_startd failed to fully preempt jobs.
The job itself was killed, but the condor_starter process watching over
it would not be killed. The slot would then stay in the Preempting state
indefinitely.
- condor_q performed poorly when querying a remote pool, using
-pool. It was using an older latency-bound protocol even when
the remote condor_schedd was new enough to use the improved protocol
that first appeared in version 6.9.3.
- When using USE_VISIBLE_DESKTOP the user's (slot or owner)
access-control entry removed from the Desktop's access-control list. This
fixes the previous behavior were users were added and never removed,
resulting in an overflow in access-control list, which can only contain
a fixed number of access-control entries.
- Fixed a bug where if log line caching was enabled in condor_dagman
and condor_dagman failed during the recovery process, the cache would
stay active. Now the cache is disabled in all cases at the end of recovery.
- Fixed a couple of bugs relevant only to the GLEXEC_STARTER
mode of operation. One bug would result in the SPOOL directory being
deleted if local universe jobs (which are not supported in
GLEXEC_STARTER mode) were submitted. The other bug prevented
COD jobs from running. Neither of these are problems for the newer
recommended GLEXEC_JOB mode.
- Fixed a bug that could cause the condor_procd to crash, depending
on the timing of its process snapshots.
- Fixed a bug that caused job status notifications from WS GRAM 4.2
servers to be lost.
- Fixed a file descriptor leak in the condor_vm-gahp.
- Jobs now go on hold with a clear hold reason if a path to a
directory is put in the transfer files list. Previously, the attempt
to run the job would simply fail and return to the idle state.
- If MAX_EVENT_LOG set to 0, then let event log grow without
bounds. Previously this behavior was broken, and setting
MAX_EVENT_LOG to 0 resulted in the log rotating with every
event. Now it works as documented.
Known Bugs:
- When fixing the USE_VISIBLE_DESKTOP bug, a new one was
inadvertently introduced. The bug manifests irrespective of the definition
of USE_VISIBLE_DESKTOP : the new code attempts to remove the current
user's access-control entry from the Desktop's access-control list even when
it was not added by Condor. This has the effect of inhibiting the creation
of new process for the logged on user.
Additions and Changes to the Manual:
- The extra space character injected into the names of Condor
daemons and programs has been removed.
- Previously undocumented Condor Perl module subroutines have
been documented.
Version 7.1.3
Release Notes:
- This developer release includes the majority of the bug fixes released
in stable version 7.0.5, including the security patches documented in that
release. See section 8.7 below.
- Updated the version of Globus Toolkit: The Condor binaries are now
linked against Globus v4.2.0.
- Updated the version of OpenSSL: The Condor binaries are now linked
against OpenSSL 0.9.8h.
- Updated the version of GCB: The Condor binaries are now linked
against GCB 1.5.6.
- Changes to the ALLOW_* and DENY_* configuration
variables no longer require the use of the -full option to
condor_reconfig upon reconfiguration.
New Features:
- Added a new mechanism termed Concurrency Limits. This
mechanism allows the Condor pool administrator to define an arbitrary
number of consumable resources in the configuration file of the
matchmaker. The availability of these consumable resources will be taken
into account during the matchmaking process. Individual jobs can specify
how many of each type of consumable resource is required.
Typical applications of Concurrency Limits could include management of
software licenses, database connections, or any other consumable resource
that is external to Condor. NOTE: Documentation still being written on
this feature.
See section 3.13.14) for documentation.
- Added support for Condor to manage serial high throughput computing
workloads on the IBM Blue Gene supercomputer. The IBM Blue Gene/P is now
a supported platform.
- Extended Job Hooks (see section 4.4) to allow for
alternate transformation and/or monitoring engines for the Job Router (see
section 5.6. Routing is still controlled by the Job
Router, but if Job Router Hooks are configured, then external programs or
scripts can be used to transform and monitor the job instead of Condor's
internal engine.
- Added support for the new protocol for WS GRAM introduced in Globus
4.2. For each WS GRAM resource, Condor automatically determines whether it is
speaking the 4.0 or 4.2 version of the protocol and responds appropriately.
When setting grid_resource in the submit file, use
gt4 for both WS GRAM 4.0 and 4.2.
- Added the ability for Windows slot users to load and run their jobs
within the context of their profile.
This includes the My Documents directory
hierarchy, its monikers, and the user's registry hive.
To use the profile, add a load_profile command to the
submit description file. A current restriction prevents the use of
load_profile
in conjunction with run_as_owner. Please refer to
section 6.2.5 for further details.
- The StarterLog file for local universe jobs now displays the job id
in each line in the file, so that interleaved messages relevant to
different jobs running concurrently can be identified.
- Added the -AllowVersionMismatch command line option to
condor_submit_dag and condor_dagman to (if absolutely necessary)
allow a version mismatch between condor_dagman and the
.condor.sub file used to submit it.
This permits a Condor version mismatch between
condor_submit_dag and condor_dagman).
- Streamlined the protocol between submit and execute machines; in some
instances, fewer messages will be exchanged over the network.
- When network requests are denied because of the authorization
policy, Condor now logs an explanation in the daemon log that denied
the request. This helps the administrator understand why the policy
denied the request, in case it is not obvious. A similar explanation
may be logged for requests that are accepted. This is only generated
if D_SECURITY is added to the daemon's debug options.
Configuration Variable Additions and Changes:
- Added the new configuration variable
MAX_PENDING_STARTD_CONTACTS . This limits the
number of simultaneous connection attempts by the condor_schedd when
it is requesting claims from the condor_startds. The intention is
to protect the condor_schedd from being overloaded by authentication
operations. The default is 0, which indicates no limit.
- Added the new configuration variable
SEC_INVALIDATE_SESSIONS_VIA_TCP , which
defaults to True. Previously, attempts to use an invalid security
session resulted in a UDP rather than a TCP response. In networks with
different firewall rules for UDP and TCP, the filtering of the session
invalidation messages was easily overlooked, since it would not
typically happen during the initial vetting of the pool. If these
packets were filtered out, then at the subsequent condor_collector
restart, no daemons would be able to advertise themselves to the
pool until their existing security sessions expired. The old behavior
can be achieved by setting this configuration parameter to False.
- Added the new configuration variable
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION .
This is a special authentication mechanism designed to minimize
overhead in the condor_schedd when communicating with the execute
machine. Essentially, matchmaking results in a secret being shared
between the condor_schedd and condor_startd, and this is used to
establish a strong security session between the execute and submit
daemons without going through the usual security negotiation protocol.
This is especially important when operating at large scale over high
latency networks, as in a glidein pool with one submit machine and thousands of
execute machines on a network with 0.1 second round trip times. See
for
details.
- Added configuration entry GLEXEC_JOB which replaces the
functionality previously encapsulated in GLEXEC_STARTER . Using
GLEXEC_JOB enables privilege separation in Condor via glexec in a
manner much more consistent with how Condor's own privilege separation
mechanism works. Specifically, the user identity switching will now occur
between the condor_starter and the actual user job.
- Added configuration parameter AMAZON_GAHP_WORKER_MAX_NUM
to specify a ceiling on the number of threads spawned on the submit
machine to support jobs running on Amazon EC2. Defaults to 5.
Bugs Fixed:
- Includes bug fixes from Condor v7.0.5, including the security fixes.
See section 8.7.
- Fixed a bug in the condor_schedd that would cause it to
except if a crontab entry was incorrectly formatted.
- Fixed a bug in the CondorView server (collector) that caused it
to except (crash) when it received a machine ClassAd without a valid state.
It now logs this under level D_ALWAYS and ignores the ClassAd.
- Fixed a bug from Condor version 7.1.2 that would cause
Condor daemons to start
consuming a lot of cpu time after rare types of communication failures
during security negotiation.
- Fixed a bug from Condor version 7.1.2 that in rare cases could cause
Condor to fail to recognize when a call to exec() fails on Unix
platforms.
- Fixed problems with configuration parameter
JOB_INHERITS_STARTER_ENVIRONMENT when using PrivSep.
- Improved the deletion of Amazon EC2 jobs when the server is
unreachable.
- Fixed problems with Condor parallel universe jobs when recovering from
a reboot of the submit machine.
Known Bugs:
Additions and Changes to the Manual:
Version 7.1.2
Release Notes:
New Features:
- Added formatTime(), a built-in ClassAd function to create a
formatted representation of the time. A detailed description of this
function is available in section 4.1.2, which
documents all of the available built-in ClassAd functions.
- Improved Condor's authentication handshake, so that daemons such
as the condor_schedd, which initiate connections to other daemons,
spend less time waiting for responses.
Authentication over high latency
networks is still rather expensive in Condor, so it still may be
necessary to scale up by running more condor_schedd and condor_collector
daemons than one would need for equivalent workloads on a low latency network.
Additional improvements in this area are planned.
Configuration Variable Additions and Changes:
Bugs Fixed:
- Fixed a memory leak, introduced in Condor version 7.1.1, which caused the
condor_startd daemon to grow without bound.
- Fixed a bug in condor_dagman that caused the user log file of
the first node job in a DAG to get created with 0600 permissions,
regardless of the user's umask. Note that this fix involved removing
the -condorlog and -storklog command-line arguments from
condor_submit_dag and condor_dagman.
- Fixed a problem from Condor version 7.1.1 that in some cases caused the
condor_starter to stop sending updates about the job status or
to send updates too frequently.
Known Bugs:
Additions and Changes to the Manual:
Version 7.1.1
Release Notes:
New Features:
Configuration Variable Additions and Changes:
- Added DAGMAN_DEBUG_CACHE_ENABLE and
DAGMAN_DEBUG_CACHE_SIZE which allow DAGMan to maintain a
cache of log lines and write out the cache as one open/write/close
sequence. DAGMAN_DEBUG_CACHE_ENABLE is a boolean
which turns on the ability for caching and defaults to False.
DAGMAN_DEBUG_CACHE_SIZE is a positive integer and represents
the size of the cache in bytes and defaults to 5 Megabytes.
- The existing BIND_ALL_INTERFACES configuration variable
now defaults to True.
- Added the HIBERNATE expression, which, when evaluated in
the context of each slot, determines if a machine should enter
a low power state. See page
for more
information.
- Added the HIBERNATE_CHECK_INTERVAL configuration variable,
which, if set to a non-zero value, enables the condor_startd to place the
machine in a low power state based on the evaluation of the
HIBERNATE expression. See
page
for more information.
- The existing VALID_SPOOL_FILES configuration variable
now automatically includes SCHEDD.lock,
the lock file used for high availability condor_schedd fail over.
Other high availability lock files are not currently included.
- Added the SEC_DEFAULT_AUTHENTICATION_TIMEOUT configuration
variable, where the definition DEFAULT may be replaced
by the usual list of contexts for security settings
(for example, CLIENT, READ, and WRITE).
This specifies the number of seconds that Condor should
allow for the authentication of network connections to complete.
Previously, GSI authentication was hard-coded to allow 5 minutes
for authentication.
Now it uses the same default as all other methods: 20 seconds.
- Added the STARTER_UPDATE_INTERVAL_TIMESLICE configuration
variable, which
specifies the highest fraction of time that the condor_starter should spend
collecting monitoring information about the job, such as disk usage.
It defaults to 0.1. If checking the disk usage of the job takes a
long time, the condor_starter will monitor less frequently than
specified by STARTER_UPDATE_INTERVAL.
Bugs Fixed:
- Fixed a bug introduced in 7.1.0 affecting configurations in
which authentication of all communication between the condor_shadow
and condor_schedd is required. This caused failure in the final update
after the job had finished running. The result was that the job would return
to the idle state to run again.
- Fixed a bug in Java universe where each slot would be told to
potentially use all the memory on the machine. Now, each JVM
receives the physical memory divided by the number of slots.
- On Windows, slot users would sometimes show up in the Windows Welcome
Screen. This has now been resolved.
The slot users need to be manually
removed for this to take effect and the machine may need to be rebooted for
the setting to be honored.
- Fixed a bug in the ClassAd string() function.
The function now properly converts integers and floats
to their string representation.
- The Windows Installer is now completely internationalized: it will no
longer fail to install because of a missing "Users" group; instead, it
will use the regionally appropriate group.
- Interoperability with Samba (as a PDC) has been improved. Condor
uses a fast form of login during credential validation. Unfortunately,
this login procedure fails under Samba, even if the credentials are
valid. The new behavior is to attempt the fast login, and on failure,
fall back to the slower form.
- Windows slot users no longer have the Batch Privilege added, nor
does Condor first attempt a Batch login for slot users. This was
causing permission problems on hardened versions of Windows, such
as Windows Sever 2003, in that not interactive users lacked the
permission to run batch files (via the cmd.exe tool). This affected
any user submitting jobs that used batch files as the executable.
- If the IWD is not defined in a job classified
ad that was either fetched by the condor_startd via job hooks, or
pushed to the condor_startd via COD, the condor_starter no
longer treats this as a fatal error, and instead uses the temporary
job execution sandbox as the initial working directory.
- Made some fixes to the new-style rescue DAG feature:
- condor_submit_dag no longer needs the -force flag if a rescue
DAG will be run, even if the files generated by condor_submit_dag
already exist.
- condor_submit_dag with the -force flag now renames any
existing new-style rescue DAG files, and therefore runs the original DAG.
- Fixed a problem that caused new-style rescue DAGs to fail when
condor_submit_dag is invoked with the -usedagdir flag.
Known Bugs:
Additions and Changes to the Manual:
- The manual now contains Windows installation instructions for
controlling the configuration for the vm universe.
Version 7.1.0
Release Notes:
- Upgrading to 7.1.0 from previous versions of Condor will make
existing Standard Universe jobs that have already run fail to match to
machines running Condor 7.1.0 unless the job previously ran on a
machine using the Red Hat 5.0 release of Condor. This is because the
value of the CheckpointPlatform attribute of the machine
ClassAd has changed in order to better represent checkpoint
compatibility. If this affects you, you can use condor_qedit to
change the LastCheckpointPlatform attribute of existing
Standard Universe jobs to match the new CheckpointPlatform
advertised by the machine ClassAd where the job last ran.
- Condor no longer supports root configuration files
(for example, /etc/condor/condor_config.root,
condor/condor_config.root, and
the file defined by the configuration variable
LOCAL_ROOT_CONFIG_FILE). This feature was intended to
give limited powers to a Unix administrator to configure some aspects
of Condor without gaining root powers. However, given the flexibility
of the configuration system, we decided that this was not practical.
As long as Condor is started up as root, it should be clearly
understood that whoever has the ability to edit the Condor
configuration files can effectively run arbitrary programs as root.
New Features:
- In the past, Condor has always sent work to the execute machines
by pushing jobs to the condor_startd, either from the
condor_schedd or via condor_cod.
As of version 7.1.0, The condor_startd now has the ability to pull
work by fetching jobs via a system of plug-ins or hooks.
Additional hooks are invoked by the condor_starter to help manage
work (especially for fetched jobs, but the condor_starter hooks
can be defined and invoked for other kinds of jobs as well).
For a complete description of the new hook system, read
section 4.4 on page
.
- Added the capability to insert commands into the .condor.sub
file produced by condor_submit_dag with the -append and
-insert_sub_file command-line arguments to condor_submit_dag and
the DAGMAN_INSERT_SUB_FILE configuration variable.
See the condor_submit_dag manual page on
page
and the configuration variable definition on
page
for more information.
- For platforms running a Windows operating system, the Arch
machine ClassAd attribute more correctly reflects the architectures
supported. Instead of values "INTEL" and "UNDEFINED",
the values will now be: "INTEL" for x86,
"IA64" for Intel Itanium,
and "X86_64" for both AMD and Intel 64-bit processors.
These values are listed in the unnumbered subsection labeled
Machine ClassAd Attributes on page
.
- The Windows MSI installer now supports extended vm universe
options. These new options include: the ability to set the
networking type, how much memory the vm universe can use
on a host, and
the ability to set the version of VMware installed on the host.
- The condor_status and condor_q command line tools now have a
version option which prints the version of those specific tools. This
can be useful when multiple versions of Condor are installed on the
same machine.
- The configuration variable CONDOR_VIEW_HOST may now
contain a port number and may (if desired) refer to a
condor_collector daemon running on the same host as the
condor_collector that is forwarding ads. It is also now possible to
use the forwarded ads for matchmaking purposes. For example, several
collectors could forward ads to a single aggregating collector which
a condor_negotiator then uses as its source of information for
matchmaking.
- condor_dagman deals with rescue DAGs in a more sophisticated
way; this is especially helpful for nested DAGs.
See the rescue DAG subsection
of the condor_dagman
manual section for more information.
- Additional logging details for unusual error cases to help
identify problems.
- A new (optional) daemon named condor_job_router has been
added, so far only on Unix. It may be configured to transform vanilla
universe jobs into grid universe jobs, for example to send excess jobs
to other sites via Condor-C or Condor-G. For details, see
page
.
- Previously, condor_q -better-analyze was supported on most
but not all versions of Linux. It is now supported on all Unix platforms
but not yet on Windows.
Configuration Variable Additions and Changes:
- Added new configuration variables
ALLOW_CLIENT and DENY_CLIENT as
client-side authorization controls.
When using a mutual authentication method (such as GSI, SSL, or Kerberos),
these variables allow the specification of
which authenticated servers the Condor tools and daemons should
trust when they form a connection to the server.
Because of the addition of these variables,
the GSI-specific, client-side authorization configuration variable
GSI_DAEMON_NAME is retired, and no longer valid.
- Added the DAGMAN_INSERT_SUB_FILE variable, which allows a file
of commands to be inserted into .condor.sub files generated
by condor_submit_dag. See page
for more information.
- The semantics of CLAIM_WORKLIFE were previously not
clearly defined before the start of the first job. A delay between
the condor_schedd claiming a slot and the condor_shadow starting a
job could be caused by the submit machine being very busy or by
JOB_START_DELAY. Previously, such a delay would
unpredictably result in the first job being rejected if
CLAIM_WORKLIFE expired during that time. Now,
CLAIM_WORKLIFE is defined to apply only after the first job
has started. Therefore, setting it to zero has the effect of allowing
exactly one job per claim to run. The default is still the special
value -1, which places no limit on how long the slot may continue
accepting new jobs from the condor_schedd that claimed it.
- Added the DAGMAN_OLD_RESCUE variable, which controls whether
condor_dagman writes rescue DAGs in the old way. See
page
for more information.
- Added the DAGMAN_AUTO_RESCUE variable, which controls
whether condor_dagman automatically runs an existing rescue DAG.
See page
for more information.
- Added the DAGMAN_MAX_RESCUE_NUM variable, which
controls the maximum "new-style" rescue DAG number written or
automatically run by condor_dagman.
See page
for more information.
Bugs Fixed:
- The Condor Build ID is now printed by condor_version and placed
in the logs for machines running a Windows operating system.
- condor_quill and the condor_dbmsd correctly register
themselves with the Windows firewall.
- condor_submit_dag now avoids possibly running off the end
of the argument list if an argument requiring a value does not have one.
- The condor_submit_dag -debug argument now must be
specified with at least -de to avoid conflict with the
-dagman argument.
- Added missing information about the -config argument to
condor_submit_dag's usage message.
- condor_dagman no longer considers duplicate edges in a DAG a
fatal error (it is now a warning).
Known Bugs:
- No hook is invoked if a fetched job does not contain enough data
to be spawned by a condor_starter or if other errors prevent the
job from being run after the condor_startd agrees to accept the
work.
This limitation will be addressed in a future version of Condor,
most likely via the addition of a new hook invoked whenever the
condor_starter fails to spawn a job.
For more information about the new hook system included in Condor
version 7.1.0, read section 4.4 on
page
.
Additions and Changes to the Manual:
- Added "WINNT60" for the Vista operating system to
the documented list of possible values for the machine ClassAd
attribute OpSys.
condor-admin@cs.wisc.edu