Subsections


8.7 Stable Release Series 7.0

This is a stable release series of Condor. It is based on the 6.9 development series. All new features added or bugs fixed in the 6.9 series are available in the 7.0 series. As usual, only bug fixes (and potentially, ports to new platforms) will be provided in future 7.0.x releases. New features will be added in the 7.1.x development series.

On backwards compatibility: we believe that Condor 7.0.x and 6.8.x are wire-compatible, and can be freely mixed between computers in a Condor pool. However, we do not regularly test this compatibility and cannot guarantee it, so we recommend using a single release of Condor when possible. Please note that although you can mix Condor 7.0.x and 6.8.x in a pool, you cannot mix them on a single computer. That is, a condor_master daemon running 6.8.x cannot run Condor daemons from version 7.0.x, or vice-versa.

The details of each version are described below.


Version 7.0.6

Release Notes:

  • None.

New Features:

  • None.

Configuration Variable Additions and Changes:

  • None.

Bugs Fixed:

  • In some rare cases, the condor_startd failed to fully preempt jobs. The job itself was killed, but the condor_starter process watching over it would not be killed. The slot would then stay in the Preempting state indefinitely.

Known Bugs:

  • None.

Additions and Changes to the Manual:

  • None.


Version 7.0.5

Release Notes:

This release contains many bug fixes and some improvements to error handling of Local Universe jobs. Note that some of the bug fixes are security-related; therefore, we recommend sites either upgrade Condor, or restrict permissions on who is allowed to submit Condor jobs to trusted users. Bug fixes that are security related are clearly marked in the Bugs Fixed section below along with a description of the potential security impact. The Condor Project believes in the full disclosure of information, and therefore complete vulnerability details can be found at http://www.cs.wisc.edu/condor/security/. However, in order to give an adequate upgrade window for production installations, we will delay posting the full vulnerability details fixed in this release for 30 days (until the week of November 3rd 2008).

New Features:

  • Local universe jobs now go on hold for the same specific reasons that vanilla jobs may go on hold. Examples are missing input or executable files. Previously, when local universe jobs failed in this manner, the jobs returned to the idle state in the job queue, repetitively attempting to run, and failing over and over until the job is removed.

  • Local universe jobs now have the ClassAd attribute NumShadowStarts. Although local universe jobs do not have a condor_shadow process, this attribute is introduced to keep management of local universe as similar to vanilla universe as possible. For local universe jobs, this attribute is identical to the attribute JobRunCount, which indicates how many times a local condor_starter process has been created to run the job.

Configuration Variable Additions and Changes:

  • None.

Bugs Fixed:

  • Security Item: A flaw was found and fixed in the way Condor processes user submitted jobs. It was possible for a user who had permissions to submit jobs into Condor to do so in a way that could cause that job to run as any other non-root user. We have not had any reported incidents exploiting this flaw. (CVE-2008-3826)

  • Security Item: A stack-based buffer overflow flaw was found and fixed in the condor_schedd daemon. A user who had permissions to submit a job could do so in a manner that could cause the condor_schedd to crash, or potentially, execute arbitrary code on the submit machine with the condor_schedd's identity. We are not aware of any known exploits for this flaw. We have not had any reported incidents exploiting this flaw. (CVE-2008-3828)

  • Security Item: A denial-of-service flaw was found and fixed in the condor_schedd daemon. A user who had permissions to submit a job could have done so in a manner that would cause condor_schedd to crash. We have not had any reported incidents exploiting this flaw. (CVE-2008-3829)

  • Security Item: A flaw was found and fixed in the way Condor processes allow and deny net masks for access control. If Condor's configuration file contained overlapping net masks in the allow or deny rules, it could have caused those rules to be ignored, potentially allowing unintended access to users in Condor's deny authorization lists. We have not had any reported incidents exploiting this flaw. (CVE-2008-3830)

  • Fixed a segmentation fault bug with condor_submit -dump when universe=grid or x509userproxy=<anything>.

  • Fixed a stack overflow bug in the condor_negotiator daemon.

  • Fixed condor_submit -dump such that it would function with the standard universe.

  • Fixed a memory leak in the condor_startd, which occurred during the handling of a condor_reconfig command.

  • When the configuration variable NEGOTIATOR_CONSIDER_PREEMPTION is defined to be False, this no longer results in machines in the Owner state being ignored during matchmaking. Previously, even if START was True, machines in the Owner state were disregarded.

  • Setting JobLeaseDuration to be less than 15 minutes caused the condor_schedd daemon to abort and restart the next time a condor_reconfig command was executed. The error message in the condor_schedd log appeared as:

    ALIVE_INTERVAL in the condor configuration is too high (300).
    

  • Fixed a slow memory leak affecting the condor_startd, condor_schedd, and condor_collector daemons. This leak would probably require many months of continuous operation before causing noticeable problems.

  • Fixed a bug that caused a condor_schedd daemon crash. The crash occurred during a fast shut down of the condor_schedd daemon as it dealt with local universe jobs or with any job that required reconnection when the condor_schedd daemon started up.

  • Local and scheduler universe jobs were failing to increment the JobRunCount attribute in the job ClassAd when an attempt to run the job was made. This problem was introduced in 6.9.5.

  • Some rare types of failures during file transfer caused the Condor daemon conducting the transfer to hang indefinitely. For example, if the file transfer process created by the condor_schedd was killed by an administrator or crashed due to an internal error, the condor_schedd would become unresponsive.

  • GCB was updated, fixing minor bugs with GCB temporary files (typically the file(s) /tmp/gcb-inherit-*). These bugs did not impact GCB functionality. Earlier versions would leave temporary files behind. Temporary files would have permissions of 000. With the fix, under normal operations the files should be deleted, and the condor user should have read and write access to the files.

  • Evaluation of the configuration variable STARTD_AD_REEVAL_EXPR did not work for many types of expressions. The problem resulted in the following message in the condor_negotiator daemon log:

    Can't evaluate STARTD_AD_REEVAL_EXPR  ...
    

  • Reconnecting to parallel universe jobs after a restart of the condor_schedd daemon, would sometimes fail. The failure was caused by the condor_shadow trying to connect to the address of the previous instance of the condor_schedd rather than the address of the current instance.

  • Made the condor_gridmanager less aggressive in forwarding refreshed proxies for gt2 grid universe jobs. Now, the refreshed proxy will not be forwarded until the old proxy has less than six hours of life until expiration.

  • Fixed a bug in the condor_gridmanager that could result in job status updates from the Grid Monitor to be ignored.

  • The Grid Monitor no longer changes the last-modified time of GRAM state files whose job's status is FAILED. This should make it easier for file cleaners to remove the the GRAM state files of old, abandoned jobs.

  • Fixed a problem that could cause flocked jobs to fail due to authorization errors in the condor_starter. Such failures were more likely to occur for long-running jobs or if the condor_schedd were issued a full reconfig during the job's execution.

  • Fixed a condor_gridmanager crash on Windows. This crash only appeared if GRIDMANAGER_DEBUG were set to a higher level than the default.

  • In PrivSep mode, a job would previously fail if it created a symlink in its sandbox pointing to a file owned by a UID other than that used to run the job. This behavior has been fixed.

Known Bugs:

  • None.

Additions and Changes to the Manual:

  • Descriptions of previously undocumented Condor Perl module subroutines have been added to the manual. See section 4.5.6.


Version 7.0.4

Release Notes:

  • This release fixes a problem causing possible incorrect handling of wild cards in authorization lists. Examples of the configuration variables that specify authorization lists are
      ALLOW_WRITE
      DENY_WRITE
      HOSTALLOW_WRITE
      HOSTDENY_WRITE
    
    If a configuration variable uses the asterisk character (*) in configuration variables that specify the authorization policy, it is advisable to upgrade. This is especially true for the use of wild cards in any DENY list, since this problem could result in access being allowed, when it should have been denied. This issue affects all previous versions of Condor.

  • The default daemon-to-daemon security session duration has been changed from 100 days to 1 day. This should reduce memory usage in the condor_collector in pools with short-lived condor_startds (e.g. glidein pools or pools whose machines are rebooted every night).

New Features:

  • Added functionality to periodically update timestamps on lock files. This prevents administrative programs from deleting in-use lock files and causing undefined behavior.

  • When the configuration variable SCHEDD_NAME ends in the @ symbol, Condor will no longer append the fully qualified host name to the value. This makes it possible to configure a high availability job queue that works with the remote submission of jobs.

Configuration Variable Additions and Changes:

  • Added configuration variable: LOCK_FILE_UPDATE_INTERVAL . Please see page [*] for a complete description.

  • Changed the default value of configuration variable SEC_DEFAULT_SESSION_DURATION from 8640000 seconds (100 days) to 86400 seconds (1 day).

Bugs Fixed:

  • Fixed a bug in the condor_c-gahp that caused it to fail repeatedly on Windows, if more than two Condor-C jobs were submitted at the same time.

  • Fixed a problem that caused the condor_collector's memory usage to increase dramatically, if condor_findhost was run repeatedly.

  • Fixed a bug where Windows jobs suspended by Condor would never be continued, despite log files indicating successful continuation. This problem has existed since the 6.9.2 release of Condor.

  • Fixed a problem that could cause condor_dagman to core dump if straced, especially if the dagman.out file is on a shared file system.

  • Fixed a problem introduced in 7.0.1 that could cause the condor_schedd daemon to crash when starting parallel or MPI universe jobs. In some cases, the problem would result in the following log message:

    ERROR ``Assertion ERROR on (mrec->request_claim_sock == sock)'' \
      at line 1361 in file dedicated_scheduler.C
    

  • The condor_procd daemon now periodically updates the timestamps on the named pipe file system objects that it uses for communication. This prevents these objects from being cleaned up by programs like tmpwatch, which would result in Condor daemon exceptions.

  • Fixed a problem introduced in Condor 7.0.2 that would cause daemons to fail on start up on Windows 2000.

  • Fixed a problem where standard universe jobs would fail to start when using PrivSep, if the PROCD_ADDRESS configuration variable was not defined.

  • If the X509 proxy of a vanilla universe job has been refreshed, the updated file will no longer be returned when the job completes.

  • If ClassAd attributes StreamOut or StreamErr are missing from the job ClassAd of a grid universe job, the default value for these attributes is now False.

Known Bugs:

  • A bug in 7.0.4 affects jobs using Condor file transfer on submit machines that are configured to deny write access from execute machines. The result is that output from jobs may fail to be copied back to the submit machine. The problem may or may not affect jobs that run for less than eight hours, but it definitely will affect jobs that run for more than eight hours. An example of a configuration vulnerable to this problem is one where DAEMON level access is allowed to all execute nodes but WRITE level access is not. When the problem happens, the condor_shadow log will contain a line like the following:

    DaemonCore: PERMISSION DENIED to unknown user from host ...
    for command 61001 (FILETRANS_DOWNLOAD), access level WRITE
    

    The workaround for this problem is to allow WRITE access from the execute nodes. If the existing configuration requires WRITE access to be authenticated, then simply add WRITE access by the authenticated condor identities associated with all execute nodes. If WRITE access is not currently required to be authenticated, then allow unauthenticated WRITE access from all worker nodes. Note that this does not imply that execute nodes will be able to modify the job queue without authenticating. Remote commands that modify the job queue (for example, condor_submit or condor_qedit) always require that the user be authenticated, no matter what configuration options are used; if no method of remote authentication can succeed in the pool for WRITE operations, then commands that modify the job queue can only run on the submit machine.

Additions and Changes to the Manual:

  • None.


Version 7.0.3

Release Notes:

  • This is a bug fix release. A bug in Condor version 7.0.2 sometimes caused the condor_schedd to become unresponsive for 20 seconds when starting the condor_shadow to run a job. Therefore, anyone running 7.0.2 is strongly encouraged to upgrade.

New Features:

  • None.

Configuration Variable Additions and Changes:

  • The configuration variable VALID_SPOOL_FILES now automatically includes SCHEDD.lock, the lock file used for high availability condor_schedd fail over. Other high availability lock files are not currently included.

Bugs Fixed:

  • Fixed a problem sometimes causing minutes or more of lag between the time of job suspension or unsuspension and the corresponding entries in the job user log.

  • Fixed a problem in condor_q -better-analyze handling requirements expressions containing the expression =!= UNDEFINED.

  • Configuration variable GRIDMANAGER_GAHP_CALL_TIMEOUT is now recognized for nordugrid grid universe jobs.

  • Fixed a bug that could cause the condor_schedd daemon to abort and restart some time after a graceful restart, when jobs to which the condor_schedd daemon reconnected were preempted.

  • Fixed a bug causing failure to reconnect to jobs which use $$([expression]) in their ClassAds. The jobs would go on hold with the hold reason: "Cannot expand $$([expression])."

  • Fixed a bug in Condor version 7.0.2 that sometimes caused the condor_schedd daemon to become unresponsive for 20 seconds when starting the condor_shadow daemon to run a job.

Known Bugs:

  • None.

Additions and Changes to the Manual:

  • See section 4.5.1 for documentation on finding the port number the condor_schedd daemon is listening on for use with the web service API.


Version 7.0.2

Release Notes:

  • On Unix, Condor no longer requires its EXECUTE directory to be world-writable, as long as it is not on a root-squashed NFS mount and is owned by the user given in the CONDOR_IDS setting (or by Condor's real UID, if not started as root). Condor will automatically remove world-writability from existing EXECUTE directories where possible. Note: The EXECUTE directory has never been required to be world-writable on Windows.

  • With this release, a binary package for IA64 SUSE Linux Enterprise 8 will no longer be made available.

New Features:

  • A clipped port to FreeBSD 7.0 x86 and x86_64 is available, but at this time, it is not available for download as a binary package.

  • Previously, condor_q -better-analyze was supported on most but not all versions of Linux. It is now supported on all Unix platforms, but not yet on Windows platforms.

Configuration Variable Additions and Changes:

  • The new configuration variable GRIDMANAGER_MAX_WS_DESTROYS_PER_RESOURCE limits the number of simultaneous WS destroy commands issued to a given server for grid universe jobs of type gt4. The default value is 5.

Bugs Fixed:

  • Fixed a bug in the standard universe where if a Linux machine was configured to use the Network Service Cache Daemon (nscd), taking a checkpoint would be deferred indefinitely.

  • Fixed a bug that caused the Quill daemon to crash.

  • Fixed bug that prevented Quill, when running on a Windows host, from successfully updating the database.

  • Fixed a bug that prevented Quill's condor_dbmsd daemon from proper shutting down upon request when running on Windows platforms.

  • Fixed a bug that caused Stork to be completely broken.

  • As a back port from Condor versions 7.1, the Windows Installer is now completely internationalized: it will no longer fail to install because of a missing "Users" group; instead, it will use the regionally appropriate group.

  • As a back port from Condor versions 7.1, interoperability with Samba (as a PDC) has been improved. Condor uses a fast form of login during credential validation. Unfortunately, this login procedure fails under Samba, even if the credentials are valid. The new behavior is to attempt the fast login, and on failure, fall back to the slower form.

  • As a back port from Condor versions 7.1, Windows slot users no longer have the Batch Privilege added, nor does Condor first attempt a Batch login for slot users. This was causing permission problems on hardened versions of Windows, such as Windows Sever 2003, in that not interactive users lacked the permission to run batch files (via the cmd.exe tool). This affected any user submitting jobs that used batch files as the executable.

  • Fixed a bug that could sometimes cause the condor_schedd to either EXCEPT or crash shortly after a user issues a condor_rm command with the -forcex option.

  • condor_history in a Quill environment, when given the -constraint option, would ignore attributes from the vertical schema. This has been fixed.

  • In Unix, when started as root, the condor_master now changes the effective user id back to root (instead of condor) when restarting itself. This occurs for example due to the command condor_restart. This makes no difference unless the condor_master is wrapped with a script, and the script expects to be run as root not only on initial start up, but on restart as well.

  • The dedicated scheduler would sometimes take two negotiation cycles to acquire all the machines it needed to run a job. This has been now fixed.

  • condor_dagman no longer prints "Argument added" and "Retry Abort Value" diagnostic messages at the default verbosity, to reduce the size of the dagman.out file and the start up time for very large DAGs.

  • condor_dagman now prints a few fatal parse errors at lower verbosity settings than it did previously.

  • condor_preen no longer deletes MyProxy password files in the Condor spool directory.

  • When using TCP updates (UDP updates are the default), the condor_collector would sometimes freeze for 20 seconds when receiving an invalidation notice. The notice is received when Condor is being turned off on a machine in the pool.

  • Fixed a case in which the condor_schedd's job queue log file could get corrupted when encountering errors writing to the disk such as `out of space'. This type of corruption was detected by the condor_schedd the next time it restarted and read the file to restore the job queue, so you would only have been affected by this problem if your condor_schedd refused to start up until you fixed or removed the job queue log file. This bug has existed in all versions of Condor, but it became more likely to occur in 6.9.4.

  • The configuration setting JAVA may now contain spaces. Previously, this did not work.

  • Fixed a problem that caused occasional failure to detect hung Condor daemons.

  • Fixed a file descriptor leak in the negotiator. The leak happened whenever the negotiator failed to initiate the NEGOTIATE command to a condor_schedd, for example if security negotiation failed with the condor_schedd. Under Unix, this would eventually cause the condor_negotiator to run out of file descriptors, exit, and restart. This bug affected all previous versions of Condor.

  • Fixed several bugs in the user log reader that caused it to generate an invalid persisted state if no events had been read in. When read back in, this persisted state would cause the reader to segfault during initialization.

  • Fixed a bug causing communication problems if different portions of a Condor pool were configured with different values of SEC_DEFAULT_SESSION_DURATION. This bug affects all previous versions of Condor. The client side of the connection was always using its own security session duration, even if the server's duration was shorter. Among other potential problems, this was observed to cause file transfer failures when the starter was configured with a longer session duration than the shadow.

  • Fixed a bug in the user log writer that was causing the writing of events to the global event log fail in some conditions.

  • In the grid universe, submission of nordugrid jobs is now properly throttled by configuration parameters GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE and GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE .

  • The NorduGrid GAHP server can now properly extract job execution information from newer NorduGrid servers. Previously, the GAHP could crash when talking to newer servers.

  • Fixed a bug that caused condor_config_val -set or -rset to fail if security negotiation was turned off. This happens, for example, if SEC_DEFAULT_NEGOTIATION = NEVER. This bug was introduced in Condor 7.0.0.

  • Fixed a bug that could cause incorrect IP addresses to be advertised when the condor_collector was on a multi-homed host.

  • Fixed a problem where unexpected ownership and permissions on files inside a job's working directory could cause the condor_starter to EXCEPT.

  • Improved the speed at which the condor_startd can handle claim requests, particularly when the condor_startd manages a large number of slots.

  • Fixed an error in the way the condor_procd calculates image size for jobs that involve multiple processes. Previously the maximum image size for any single process was being used. Now the image size sum across all processes is used.

  • The condor_procd no longer truncates its log file on start up. Enabling a log file for the condor_procd is only recommended for debugging, since it is not rotated to conserve disk space.

  • Fixed a problem present in Condor 7.0.1 and 7.1.0 where the condor_startd will crash upon deactivating or releasing a COD claim.

  • Condor on Windows can now correctly handle job image size when processes are created that allocate more than 2GB of address space.

  • The JOB_INHERITS_STARTER_ENVIRONMENT setting now works when the GLEXEC_STARTER feature is in use.

  • Fixed a problem causing condor_schedd to perform poorly when handling large job queues in which there are any idle local or scheduler universe jobs (for example, Condor cron jobs).

  • Sped up condor_schedd graceful shutdown when disconnecting from running jobs that have job leases. Previously, it would only disconnect from one such job at a time, so if there were a lot of jobs running, condor_schedd could take so long to shut down that job leases expire before it has a chance to restart and reconnect to the jobs.

  • Fixed a bug that could cause incorrect IP addresses to be advertised when the condor_collector was on a multi-homed host.

Known Bugs:

  • None.

Additions and Changes to the Manual:

  • None.


Version 7.0.1

Release Notes:

  • Fixed a bug in Condor's authorization policy reader. The bug affects cases where the policy (ALLOW/DENY and HOSTALLOW/HOSTDENY settings) mixes host-based authorizations with authorizations that refer to the authenticated user name. In some cases, this bug would result in host-based settings not being applied to authenticated users.

New Features:

  • Support for Backfill Jobs is now available on Windows platforms. For more information on this, please see section 3.13.11 on page [*].

  • Condor has been ported to Red Hat Enterprise Linux 5.0 running on the 32-bit x86 architecture and on the 64-bit x86_64 architecture.

  • The command email_attributes in a job submit description file defines a set of job ClassAd attributes whose values should be included in the e-mail notification of job completion.

  • The configuration variable CONDOR_VIEW_HOST may now contain a port number, and may refer to a condor_collector daemon running on the same host as the condor_collector that is forwarding ClassAds. It is also now possible to use the forwarded ClassAds for matchmaking purposes. For example, several condor_collector daemons could forward ClassAds to a single aggregating condor_collector daemon which a condor_negotiator then uses as its source of information for matchmaking.

  • condor_configure and condor_install now detect missing shared libraries (such as libstdc++.so.5 on Linux), and print messages and exit if missing libraries are detected. The new command line option -ignore-missing-libs causes it not to exit after the messages have been printed, and to proceed with the installation.

  • Added a -force command line option to condor_configure (and condor_install) which will turn on -overwrite and -ignore-missing-libs.

  • condor_configure now writes simple sh and csh shell scripts which can be sourced by their respective shells to set the user's PATH and CONDOR_CONFIG environment variables. By default, these are created in the root of the Condor installation, but this can be changed via the -env-scripts-dir command line option. Also, the creation of these scripts can be disabled with the -no-env-scripts command line option.

Configuration Variable Additions and Changes:

  • The new configuration variables PREEMPTION_REQUIREMENTS_STABLE and PREEMPTION_RANK_STABLE are boolean values to identify whether or not attributes used within the definition of PREEMPTION_REQUIREMENTS and PREEMPTION_RANK remain unchanged during a negotiation cycle. See section 3.3.17 on page [*] for complete definitions.

  • The configuration variable STARTER_UPLOAD_TIMEOUT changed its default value to 300 seconds.

  • The new configuration variable CKPT_PROBE specifies an internal to Condor executable which determines information about how a process is laid out in memory, in addition to other information. This executable is not yet available on Windows platforms.

  • The new configuration variable CKPT_SERVER_CHECK_PARENT_INTERVAL sets an interval of time between checks by the checkpoint server to see if its parent, the condor_master daemon, has gone away unexpectedly. The checkpoint server shuts itself down if this happens. The default interval for checking is 120 seconds. Setting this parameter to 0 disables the check.

Bugs Fixed:

  • Upgrade from PCRE v5.0 to PCRE v7.6, due to security vulnerabilities found in PCRE v5.0.

  • Fixed file descriptor leak in the condor_schedd when using the SOAP interface.

  • Fixed a bug that primarily affected pools with MaxJobRetirementTime (0 by default) set larger than REQUEST_CLAIM_TIMEOUT (30 minutes by default). Since 6.9.3, when the condor_schedd timed out requesting a claim to a slot, the condor_startd was not made aware of the canceled request. This resulted in some wasted time (up to ALIVE_INTERVAL) in which the condor_startd would wait for a job to run.

  • A problem with condor_history in a Quill environment incorrectly interpreting the -name option has been fixed.

  • A memory leak that prevented condor_load_history from running with large history files has been fixed.

  • A bug in condor_history when running in a quill environment has been fixed. This bug would cause the command to crash in some situations.

  • The job ClassAd attribute EmailAttributes now works for grid universe jobs.

  • On 32-bit Linux platforms, the job queue database file may now exceed 2GB. Previously, the condor_schedd would halt with an error when trying to write past the 2GB mark.

  • On 32-bit Linux platforms, condor_history can now read from history files larger than 2GB except when using the -backwards option.

  • Local universe jobs are now scheduled to run more promptly. Previously, new local universe jobs would sometimes take up to SCHEDD_INTERVAL (default 5 minutes) to be considered for running.

  • The memory usage of the condor_collector used to grow over time if daemons with new names kept joining and then leaving the pool (for example, in a Glidein pool). This was due to statistics on dropped updates that accumulated for all daemons that ever advertised themselves to the condor_collector. These statistics are now periodically purged of information about daemons which have not reported in a long time. How long is controlled by COLLECTOR_STATS_SWEEP , which defaults to 2 days.

  • Condor daemons would die when trying to send ClassAd advertisements to a host name that could not be resolved by DNS.

  • Since 6.9.5, file transfer errors for vanilla, java, or parallel jobs would sometimes not result in the job going on hold as it should. This was most likely for very small files that failed to be written for some reason.

  • The ImageSize reported for jobs on AIX was too big by a factor of 1024.

  • Since 6.9.5, condor_glidein failed in the set up stage, due to the change in syntax of quoting rules in the Condor submit description file for gt2 argument strings.

  • Fixed a bug in the condor_gridmanager that could prevent refreshed X509 proxies from being forwarded to the remote machine for grid universe jobs of type gt4.

  • Fixed a bug in Condor's authorization policy reader. The bug affects cases where the policy (ALLOW/DENY and HOSTALLOW/HOSTDENY settings) mixes host-based authorizations with authorizations that refer to the authenticated user name. In some cases, this bug would result in host-based settings not being applied to authenticated users.

  • Fixed a bug in condor_history which causes a crash when condor_quill is enabled.

  • Fixed a problem affecting the GSI and SSL authentication methods. When these methods successfully authenticated the user but failed to find a mapping of the X509 name to a condor user id, they were setting the authenticated name to gsi and ssl respectively. However, these names contain no domain, so they could not be referred to in the authorization policy. Now these anonymous mappings are gsi@unmappeduser and ssl@unmappeduser. Therefore, configuration to deny access by users who are not explicitly mapped in the map file appears as:

    DENY_READ = *@unmappeduser
    DENY_WRITE = *@unmappeduser
    

Known Bugs:

  • When using condor_compile with the RHEL5 x86 port of Condor to produce a standard universe executable, one will see a warning message about how linking with dynamic libraries is not portable. This warning is erroneous and should be ignored. It will be fixed in a future version of Condor.

Additions and Changes to the Manual:

  • The existing configuration variables SYSTEM_PERIODIC_HOLD , SYSTEM_PERIODIC_RELEASE , and SYSTEM_PERIODIC_REMOVE have documented definitions. See section 3.3.11 for definitions.

  • A manual page for condor_load_history has been added.


Version 7.0.0

Release Notes:

  • PVM support has been dropped.

  • The time zone for the PostgreSQL 8.2 database used with Quill on Windows machines must be explicitly set to use an abbreviation. This Windows environment variable is TZ. Proper abbreviations for the value of this variable may be found within the PostgreSQL installation in a file, share/timezonesets/<continent>.txt, where <continent> is replaced by the continent of the desired time zone.

New Features:

  • The Windows MSI installer now supports VM Universe.

  • Eliminated the ``tarball in a tarball'' in our distribution. The contents of release.tar from the distribution tarball (for example, condor-6.9.6-linux-x86-centos45-dynamic.tar.gz) is now included in the distribution tarball.

  • Updated condor_configure to match the above change. The -install option now takes a directory path as its parameter, for example -install=/path/to/release. It previously took the path to the release.tar tarball.

  • Added condor_install, which is a symlink to condor_configure. Invoking
        condor_install
    
    is identical to running
        condor_configure --install=.
    

  • Added the option -prefix=dir to condor_configure and condor_install. This is an alias for -install-dir=dir.

  • Added the option -backup option to condor_configure and condor_install. This option renames the target sbin directory, if the condor_master daemon exits while in the target sbin directory. Previous versions of condor_configure did this by default.

  • Changed the default behavior of condor_install to exit with a warning if the target sbin directory exists, the condor_master daemon is in the sbin directory, and neither the -backup nor -overwrite options are specified. This prevents condor_install from improperly moving an sbin directory out of the way. For example,
        condor_install --prefix=/usr
    
    will not move /usr/sbin out of the way unless the -backup option is also specified.

  • Updated the usage summary of condor_configure and condor_install to be much more readable.

Configuration Variable Additions and Changes:

  • The new configuration variable DEAD_COLLECTOR_MAX_AVOIDANCE_TIME defines the maximum time in seconds that a daemon will fail over from a primary condor_collector to a secondary condor_collector. See section 3.3.3 on page [*] for a complete definition.

Bugs Fixed:

  • Fixed a memory leak in the condor_procd daemon on Windows.

  • Fixed a problem that could cause Condor daemons to crash if a failure occurred when communicating with the condor_procd.

  • Fixed a couple of problems that were preventing the condor_startd from properly removing per-job directories when running with PrivSep.

  • The condor_startd will no longer fail to initialize, claiming the EXECUTE directory has improper permissions, when PrivSep is enabled.

  • Look ups of ClassAd attribute CurrentTime are now case-insensitive, just like all other attributes.

  • Fixed problems causing the following error message in the log file:

    ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=1). Closing it now.
    

  • The existence of the executable given in the submit file is now enforced (when transferring the executable and not using VM universe).

  • The copy of condor_dagman that ships with Condor is now automatically added to the list of trusted programs in the Windows Firewall.

  • Removed remove_kill_sig from the submission file generated by condor_submit_dag on Windows.

  • Fixed the algorithm in the condor_negotiator daemon, which with large numbers of machine ClassAds (for example, 10,000) was causing long delays at the beginning of each negotiation cycle.

  • Use of MAX_CONCURRENT_UPLOADS was resulting in a connection attempt from the condor_shadow to the condor_schedd with a fixed 10 second timeout, which is sometimes too small. This timeout has been increased to be the same as other connection timeouts between the condor_shadow and the condor_schedd, and it now respects SHADOW_TIMEOUT_MULTIPLIER, so it can be adjusted if necessary.

  • Fixed a problem with MAX_CONCURRENT_UPLOADS and MAX_CONCURRENT_DOWNLOADS , which was sometimes allowing more than the configured number of concurrent transfers to happen.

  • Fixed a bug in the condor_schedd that could cause it to crash due to file descriptor exhaustion when trying to send messages to hundreds of condor_startds simultaneously.

  • Fixed a 6.9.4 bug in the condor_startd that would cause it to crash when a BOINC backfill job exited.

  • Since 6.9.4, when using glExec, configuring SLOT<N>_EXECUTE would cause condor_starter to fail when starting the job.

  • Fixed a bug from 6.9.5 which caused authentication failure for the pool password authentication method.

  • Fixed a bug that caused Condor daemons to crash when encountering some types of invalid ClassAd expressions.

  • Fixed a bug under Linux that could cause multi-process daemons lacking a log lock file to crash while rotating logs that have reached their maximum configured size.

  • Fixed a bug under Windows that sometimes caused connection attempts between Condor daemons to fail with Windows error number 10056.

  • Fixed a problem in which there are multiple condor_collector daemons in a pool for fault tolerance. If the primary condor_collector failed, the condor_negotiator would fail over to the secondary condor_collector indefinitely (or until the secondary condor_collector also failed or the administrator ran condor_reconfig). This was a problem for users flocking jobs to the pool, because flocking currently only works with the primary condor_collector. Now, the condor_negotiator will fail over for a restricted amount of time, up to DEAD_COLLECTOR_MAX_AVOIDANCE_TIME seconds. The default is one hour, but if querying the dead primary condor_collector takes very little time to fail, the condor_negotiator may retry more frequently in order to remain responsive to flocked users.

  • Fixed a problem preventing the use of condor_q -analyze with the -pool option.

  • Fixed a problem in the condor_negotiator in which machines go unassigned when user priorities result in the machines getting split into shares that are rounded down to 0. For example if there are 10 machines and 100 equal priority submitters, then each submitter was getting 0.1 machines, which got rounded down to 0, so no machines were assigned to anybody. The message in the condor_negotiator log in this case was this:

    Over submitter resource limit (0) ... only consider startd ranks
    

  • Fixed a problem introduced in 6.9.3 that would cause daemons to run out of file descriptors if they create sub-processes and are configured to use a lock file for the debug log.

  • Standard universe jobs now work properly when using PrivSep.

  • Fixed problem with PrivSep mode where a job that dumps core would not get the core file transferred back to the the submit host if the transfer_output_files submit option were used.

  • Fixed a bug that caused the condor_starter to crash if a job called condor_chirp with the get_job_attr option.

Known Bugs:

  • None.

Additions and Changes to the Manual:

  • None.

condor-admin@cs.wisc.edu