[cups.development] [RFE] STR #3591: Socket backend will loop "forever"

Thu May 27 14:04:19 PDT 2010

DO NOT REPLY TO THIS MESSAGE.  INSTEAD, POST ANY RESPONSES TO THE LINK BELOW.

[STR New]

First, we use RHEL as our base - so we're using the latest available
release for version 5.5. However, this behavior appears to exist in
version 1.4.3 as well (although not confirmed). I have listed this bug as
SW version 1.3.8, but this may be an incorrect location, and by this I
apologize if it is. 

In our environment, we've identified a particular issue with regards to
printing to HP printers, or any printer that utilizes the socket backend.
We have found that when printing from our application in the US to a
printer located on a network in Asia or similar, we are plagued with
situations where a printer will frequently become non-responsive. The
socket daemon attempts to connect to the dead printer, and will continue
to do so until the printer comes back online.

Human behavior however, is the nature of this bug. It turns out, that when
the 5:00 PM bell rings in Asia, they turn off their printers and print
servers and go home. When this happens, socket.c identifies that the host
is down, and then loops with errors like this in error_log:

W [26/May/2010:15:48:21 +0000] [Job 2329127] recoverable: Network host
'172.16.56.27' is busy; will retry in 30 seconds...

This loop exists until the printer becomes responsive again. Although
there are situations where we want to continue to retry to print that job
(i.e., the printer is only down for a few minutes and comes back online on
its own) - but sometimes we just want to give up. In particular, when we
have an entire office closed for the weekend (or longer) - and CPU
utilization goes up from this loop.

Therefore, I'd recommend an enhancement which adds two configuration
settings for this:

1) Connection backoff time:
   - Rather than delaying for the hard-coded 30 seconds, allow for a 
     user-controlled delay period - i.e. 90 seconds - which will
     help at some of the high CPU usage we're getting on the servers
2) Connection give-up time:
   - After a certain number of retries on a recoverable failure, just 
     give up and assume the printer won't come back again and disable
     the printer

The relevant loop in backend/socket.c appears to start at line 276 of this
version: 
 * "$Id: socket.c 8896 2009-11-20 01:27:57Z mike $"

In our environment, we have a little over 300 printers configured in our
US application stack, and that number is expected to double by the end of
the year. Even with this quad Intel Xeon 3.0Ghz server with 4G of RAM, we
get hit by high-usage CPU sourced from this module, and we're forced to
take invasive action by killing the socket process via a cron job.
Although this generally works OK, it would still be ideal if the daemon
itself could identify and recover from this situation normally.

Link: http://www.cups.org/str.php?L3591
Version:  -feature