Subsections
8.7 Stable Release Series 7.0
This is a stable release series of Condor.
It is based on the 6.9 development series.
All new features added or bugs fixed in the 6.9 series are available
in the 7.0 series.
As usual, only bug fixes (and potentially, ports to new platforms)
will be provided in future 7.0.x releases.
New features will be added in the 7.1.x development series.
On backwards compatibility:
we believe that Condor 7.0.x and 6.8.x are wire-compatible,
and can be freely mixed between computers in a Condor pool.
However, we do not regularly test this compatibility and cannot guarantee it,
so we recommend using a single release of Condor when possible.
Please note that although you can mix Condor 7.0.x and 6.8.x in a pool,
you cannot mix them on a single computer.
That is, a condor_master daemon running 6.8.x cannot run Condor daemons
from version 7.0.x, or vice-versa.
The details of each version are described below.
Version 7.0.6
Release Notes:
New Features:
Configuration Variable Additions and Changes:
Bugs Fixed:
- In some rare cases, the condor_startd failed to fully preempt jobs.
The job itself was killed, but the condor_starter process watching over
it would not be killed. The slot would then stay in the Preempting state
indefinitely.
Known Bugs:
Additions and Changes to the Manual:
Version 7.0.5
Release Notes:
This release contains many bug fixes and some improvements to error handling
of Local Universe jobs. Note that some of the bug fixes are
security-related; therefore, we recommend sites either upgrade Condor, or
restrict permissions on who is allowed to submit Condor jobs to trusted
users. Bug fixes that are security related are clearly marked in the Bugs
Fixed section below along with a description of the potential security
impact. The Condor Project believes in the full disclosure of information,
and therefore complete vulnerability details can be found at
http://www.cs.wisc.edu/condor/security/. However, in order to give an
adequate upgrade window for production installations, we will delay posting
the full vulnerability details fixed in this release for 30 days (until the
week of November 3rd 2008).
New Features:
- Local universe jobs now go on hold for the same specific reasons that
vanilla jobs may go on hold. Examples are missing input or executable files.
Previously, when local universe jobs failed in this manner,
the jobs returned to the idle state in the job queue,
repetitively attempting to run,
and failing over and over until the job is removed.
- Local universe jobs now have the ClassAd attribute NumShadowStarts.
Although local universe jobs do not have a condor_shadow process,
this attribute
is introduced to keep management of local universe as similar to
vanilla universe as possible. For local universe jobs, this attribute
is identical to the attribute JobRunCount,
which indicates how many times a
local condor_starter process has been created to run the job.
Configuration Variable Additions and Changes:
Bugs Fixed:
Known Bugs:
Additions and Changes to the Manual:
- Descriptions of previously undocumented Condor Perl module subroutines
have been added to the manual. See section 4.5.6.
Version 7.0.4
Release Notes:
New Features:
- Added functionality to periodically update timestamps on lock files.
This prevents administrative programs from deleting in-use lock files and
causing undefined behavior.
- When the configuration variable SCHEDD_NAME ends in
the
@ symbol,
Condor will no longer append the fully qualified
host name to the value.
This makes it possible to configure a high availability
job queue that works with the remote submission of jobs.
Configuration Variable Additions and Changes:
- Added configuration variable: LOCK_FILE_UPDATE_INTERVAL .
Please see page
for a complete
description.
- Changed the default value of configuration variable
SEC_DEFAULT_SESSION_DURATION from 8640000 seconds (100 days)
to 86400 seconds (1 day).
Bugs Fixed:
Known Bugs:
- A bug in 7.0.4 affects jobs using Condor file transfer on submit
machines that are configured to deny write access from execute
machines. The result is that output from jobs may fail to be copied
back to the submit machine. The problem may or may not affect jobs
that run for less than eight hours, but it definitely will affect jobs
that run for more than eight hours. An example of a configuration
vulnerable to this problem is one where DAEMON level access is allowed
to all execute nodes but WRITE level access is not. When the problem
happens, the condor_shadow log will contain a line like the following:
DaemonCore: PERMISSION DENIED to unknown user from host ...
for command 61001 (FILETRANS_DOWNLOAD), access level WRITE
The workaround for this problem is to allow WRITE access from the
execute nodes. If the existing configuration requires WRITE access to
be authenticated, then simply add WRITE access by the authenticated
condor identities associated with all execute nodes. If WRITE access
is not currently required to be authenticated, then allow
unauthenticated WRITE access from all worker nodes. Note that this
does not imply that execute nodes will be able to modify the
job queue without authenticating. Remote commands that modify the job
queue (for example, condor_submit or condor_qedit) always require that the
user be authenticated, no matter what configuration options are used;
if no method of remote authentication can succeed in the pool for
WRITE operations, then commands that modify the job queue can only run
on the submit machine.
Additions and Changes to the Manual:
Version 7.0.3
Release Notes:
- This is a bug fix release. A bug in Condor version 7.0.2 sometimes caused
the condor_schedd to become unresponsive for 20 seconds when starting
the condor_shadow to run a job.
Therefore, anyone running 7.0.2 is strongly encouraged to upgrade.
New Features:
Configuration Variable Additions and Changes:
- The configuration variable VALID_SPOOL_FILES now automatically
includes SCHEDD.lock,
the lock file used for high availability condor_schedd fail over. Other
high availability lock files are not currently included.
Bugs Fixed:
- Fixed a problem sometimes causing minutes or more of lag between
the time of job suspension or unsuspension and the corresponding entries
in the job user log.
- Fixed a problem in condor_q -better-analyze handling
requirements expressions containing the expression =!= UNDEFINED.
- Configuration variable GRIDMANAGER_GAHP_CALL_TIMEOUT
is now recognized for nordugrid grid universe jobs.
- Fixed a bug that could cause the condor_schedd daemon to abort
and restart some time after a graceful restart,
when jobs to which the condor_schedd daemon reconnected were preempted.
- Fixed a bug causing failure to reconnect to jobs which use
$$([expression])
in their ClassAds. The jobs would go on
hold with the hold reason:
"Cannot expand $$([expression])."
- Fixed a bug in Condor version 7.0.2 that sometimes caused
the condor_schedd daemon to become
unresponsive for 20 seconds when starting the condor_shadow daemon
to run a job.
Known Bugs:
Additions and Changes to the Manual:
- See
section 4.5.1
for documentation on finding the port number the condor_schedd daemon
is listening on for use with the web service API.
Version 7.0.2
Release Notes:
- On Unix, Condor no longer requires its EXECUTE directory to
be world-writable, as long as it is not on a root-squashed NFS mount and is
owned by the user given in the CONDOR_IDS setting (or by Condor's
real UID, if not started as root). Condor will automatically remove
world-writability from existing EXECUTE directories where possible.
Note: The EXECUTE directory has never been required to be
world-writable on Windows.
- With this release, a binary package for IA64 SUSE Linux Enterprise 8
will no longer be made available.
New Features:
- A clipped port to FreeBSD 7.0 x86 and x86_64 is available, but at this
time, it is not available for download as a binary package.
- Previously, condor_q -better-analyze was supported on most
but not all versions of Linux. It is now supported on all Unix platforms,
but not yet on Windows platforms.
Configuration Variable Additions and Changes:
- The new configuration variable
GRIDMANAGER_MAX_WS_DESTROYS_PER_RESOURCE limits the number
of simultaneous WS destroy commands issued to a given server for grid
universe jobs of type gt4. The default value is 5.
Bugs Fixed:
- Fixed a bug in the standard universe where if a Linux machine was
configured to use the Network Service Cache Daemon (nscd), taking
a checkpoint would be deferred indefinitely.
- Fixed a bug that caused the Quill daemon to crash.
- Fixed bug that prevented Quill, when running on a
Windows host, from successfully updating the database.
- Fixed a bug that prevented Quill's condor_dbmsd daemon from proper
shutting down upon request when running on Windows platforms.
- Fixed a bug that caused Stork to be completely broken.
- As a back port from Condor versions 7.1,
the Windows Installer is now completely
internationalized: it will no longer fail to install because of a
missing "Users" group; instead, it will use the regionally appropriate
group.
- As a back port from Condor versions 7.1,
interoperability with Samba (as a PDC) has been improved.
Condor uses a fast form of login during credential validation.
Unfortunately, this login procedure fails under Samba,
even if the credentials are valid. The new behavior is to attempt
the fast login, and on failure, fall back to the slower form.
- As a back port from Condor versions 7.1,
Windows slot users no longer have the
Batch Privilege added, nor does Condor first attempt a Batch login
for slot users. This was causing permission problems on hardened
versions of Windows, such as Windows Sever 2003, in that not
interactive users lacked the permission to run batch files
(via the cmd.exe tool).
This affected any user submitting jobs that used
batch files as the executable.
- Fixed a bug that could sometimes cause the condor_schedd
to either EXCEPT or crash shortly after a user issues a condor_rm
command with the -forcex option.
- condor_history in a Quill environment,
when given the -constraint option,
would ignore attributes from the vertical schema. This has been fixed.
- In Unix, when started as root,
the condor_master now changes the
effective user id back to root (instead of condor)
when restarting itself.
This occurs for example due to the command condor_restart.
This makes no difference unless the condor_master is wrapped
with a script, and the script expects to be run as root
not only on initial start up, but on restart as well.
- The dedicated scheduler would sometimes take two negotiation cycles
to acquire all the machines it needed to run a job.
This has been now fixed.
- condor_dagman no longer prints "Argument added" and
"Retry Abort Value" diagnostic messages at the default verbosity,
to reduce the size of the dagman.out file and the start up time
for very large DAGs.
- condor_dagman now prints a few fatal parse errors at lower
verbosity settings than it did previously.
- condor_preen no longer deletes MyProxy password files in the
Condor spool directory.
- When using TCP updates (UDP updates are the default), the
condor_collector would sometimes freeze for 20 seconds when
receiving an invalidation notice.
The notice is received when Condor is being turned off
on a machine in the pool.
- Fixed a case in which the condor_schedd's job queue log file
could get corrupted when encountering errors writing to the disk such
as `out of space'. This type of corruption was detected by the
condor_schedd the next time it restarted and read the file to
restore the job queue, so you would only have been affected by this
problem if your condor_schedd refused to start up until you fixed or
removed the job queue log file. This bug has existed in all versions
of Condor, but it became more likely to occur in 6.9.4.
- The configuration setting JAVA may now contain spaces.
Previously, this did not work.
- Fixed a problem that caused occasional failure to detect hung
Condor daemons.
- Fixed a file descriptor leak in the negotiator. The leak happened
whenever the negotiator failed to initiate the NEGOTIATE command to
a condor_schedd, for example if security negotiation failed
with the condor_schedd.
Under Unix, this would eventually cause the condor_negotiator to run out of
file descriptors, exit, and restart. This bug affected all previous
versions of Condor.
- Fixed several bugs in the user log reader that caused it to
generate an invalid persisted state if no events had been read in.
When read back in, this persisted state would cause the reader to
segfault during initialization.
- Fixed a bug causing communication problems if different portions
of a Condor pool were configured with different values of
SEC_DEFAULT_SESSION_DURATION. This bug affects all
previous versions of Condor. The client side of the connection was
always using its own security session duration, even if the server's
duration was shorter. Among other potential problems, this was
observed to cause file transfer failures when the starter was
configured with a longer session duration than the shadow.
- Fixed a bug in the user log writer that was causing the writing
of events to the global event log fail in some conditions.
- In the grid universe, submission of nordugrid jobs is now properly
throttled by configuration parameters
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE and
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE .
- The NorduGrid GAHP server can now properly extract job execution
information from newer NorduGrid servers. Previously, the GAHP could crash
when talking to newer servers.
- Fixed a bug that caused condor_config_val -set or
-rset to fail if security negotiation was turned off.
This happens, for example, if
SEC_DEFAULT_NEGOTIATION = NEVER.
This bug was introduced in Condor 7.0.0.
- Fixed a bug that could cause incorrect IP addresses to be advertised
when the condor_collector was on a multi-homed host.
- Fixed a problem where unexpected ownership and permissions on files
inside a job's working directory could cause the condor_starter to EXCEPT.
- Improved the speed at which the condor_startd can handle claim
requests, particularly when the condor_startd manages a large number
of slots.
- Fixed an error in the way the condor_procd calculates image size for
jobs that involve multiple processes. Previously the maximum image size for
any single process was being used. Now the image size sum across all
processes is used.
- The condor_procd no longer truncates its log file on start up.
Enabling a log file for the condor_procd is only recommended for
debugging, since it is not rotated to conserve disk space.
- Fixed a problem present in Condor 7.0.1 and 7.1.0 where the
condor_startd will crash upon deactivating or releasing a COD claim.
- Condor on Windows can now correctly handle job image size when
processes are created that allocate more than 2GB of address space.
- The JOB_INHERITS_STARTER_ENVIRONMENT setting now works when
the GLEXEC_STARTER feature is in use.
- Fixed a problem causing condor_schedd to perform poorly when
handling large job queues in which there are any idle local or
scheduler universe jobs (for example, Condor cron jobs).
- Sped up condor_schedd graceful shutdown when disconnecting
from running jobs that have job leases. Previously, it would only
disconnect from one such job at a time, so if there were a lot of jobs
running, condor_schedd could take so long to shut down that job leases
expire before it has a chance to restart and reconnect to the jobs.
- Fixed a bug that could cause incorrect IP addresses to be advertised
when the condor_collector was on a multi-homed host.
Known Bugs:
Additions and Changes to the Manual:
Version 7.0.1
Release Notes:
- Fixed a bug in Condor's authorization policy reader. The bug
affects cases where the policy (ALLOW/DENY and
HOSTALLOW/HOSTDENY settings) mixes host-based
authorizations with authorizations that refer to the authenticated
user name. In some cases, this bug would result in host-based
settings not being applied to authenticated users.
New Features:
- Support for Backfill Jobs is now available on Windows platforms.
For more information on this, please see
section 3.13.11 on
page
.
- Condor has been ported to Red Hat Enterprise Linux
5.0 running on the 32-bit x86 architecture and on the 64-bit x86_64
architecture.
- The command email_attributes in a job submit
description file defines a set of job ClassAd attributes whose values
should be included in the e-mail notification of job completion.
- The configuration variable CONDOR_VIEW_HOST may now
contain a port number, and may refer to a
condor_collector daemon running on the same host as the
condor_collector that is forwarding ClassAds. It is also now possible to
use the forwarded ClassAds for matchmaking purposes. For example, several
condor_collector daemons could forward ClassAds to
a single aggregating condor_collector daemon which
a condor_negotiator then uses as its source of information for
matchmaking.
- condor_configure and condor_install now detect missing
shared libraries (such as libstdc++.so.5 on Linux), and print
messages and exit if missing libraries are detected. The new command
line option -ignore-missing-libs causes it not to exit
after the messages have been printed, and to proceed with the
installation.
- Added a -force command line option to condor_configure
(and condor_install) which will turn on -overwrite and
-ignore-missing-libs.
- condor_configure now writes simple sh and csh shell scripts
which can be sourced by their respective shells to set the user's
PATH and CONDOR_CONFIG environment variables. By default, these
are created in the root of the Condor installation, but this can be
changed via the -env-scripts-dir command line option. Also,
the creation of these scripts can be disabled with the
-no-env-scripts command line option.
Configuration Variable Additions and Changes:
- The new configuration variables PREEMPTION_REQUIREMENTS_STABLE
and PREEMPTION_RANK_STABLE are boolean values to
identify whether or not attributes used within the definition of
PREEMPTION_REQUIREMENTS and PREEMPTION_RANK remain
unchanged during a negotiation cycle.
See section 3.3.17 on
page
for
complete definitions.
- The configuration variable STARTER_UPLOAD_TIMEOUT
changed its default value to 300 seconds.
- The new configuration variable CKPT_PROBE
specifies an internal to Condor
executable which determines information about how a process is laid out
in memory, in addition to other information. This executable is not yet
available on Windows platforms.
- The new configuration variable
CKPT_SERVER_CHECK_PARENT_INTERVAL sets an interval
of time between checks by the checkpoint server to see if
its parent, the condor_master daemon, has gone away unexpectedly.
The checkpoint server shuts itself down if this happens.
The default interval for checking is 120 seconds.
Setting this parameter to 0 disables the check.
Bugs Fixed:
Known Bugs:
- When using condor_compile with the RHEL5 x86 port of Condor to
produce a standard universe executable, one will see a warning message
about how linking with dynamic libraries is not portable. This warning
is erroneous and should be ignored. It will be fixed in a future version
of Condor.
Additions and Changes to the Manual:
- The existing configuration variables
SYSTEM_PERIODIC_HOLD , SYSTEM_PERIODIC_RELEASE , and
SYSTEM_PERIODIC_REMOVE have documented definitions.
See section 3.3.11 for definitions.
- A manual page for condor_load_history has been added.
Version 7.0.0
Release Notes:
- PVM support has been dropped.
- The time zone for the PostgreSQL 8.2 database
used with Quill on Windows machines must be explicitly set
to use an abbreviation.
This Windows environment variable is
TZ.
Proper abbreviations for the value of this variable may be found
within the PostgreSQL installation in a file,
share/timezonesets/<continent>.txt, where
<continent> is replaced by the continent of the
desired time zone.
New Features:
Configuration Variable Additions and Changes:
- The new configuration variable
DEAD_COLLECTOR_MAX_AVOIDANCE_TIME defines the maximum
time in seconds that a daemon will fail over from a primary
condor_collector to a secondary condor_collector.
See section 3.3.3 on
page
for a
complete definition.
Bugs Fixed:
Known Bugs:
Additions and Changes to the Manual:
condor-admin@cs.wisc.edu