Printing delay

Tue Jul 3 16:07:45 PDT 2007

> > > So please, once more (some of them):
> > >
> > >    * How many clients altogether?
> >
> > We have 22 clients. Following is their cupsd.conf file:
> > LogLevel info
>
> I seem to remember that you had "LogLevel debug" on the *server*
> cupsd.conf. Is that right? If so, it serves to increase the load for the
> server unnecessarily... (Unless your server can "afford" it, I'd stick
> with LogLevel info until you really need to debug a printing problem.)
>
> > BrowseProtocols cups
> > BrowseInterval 3600
> > BrowsePoll cupsserver1
> > BrowsePoll cupsserver2
>
> This looks reasonable (polling only once per hour); but it still means,
> that the server will have to send 22x2800+ printer announcement packages
> per hour, which is about 17 per second (and which each contain the
> respective current status info for each printer, and which each need to
> be updated/verified internally by the server each time...)
>
> > BrowsePort 631
> > KeepAliveTimeout 3600
>
> I don't see a reason to "keep a persistent HTTP connection open after the
> last request" for an hour. Why not stick to the default, 60 seconds? Or
> even go down to 30? After all, you do have a lot of connections already
> happening (and a part of them will be "persistent HTTP", which you keep
> open unnecessarily).
>
> > Timeout 3600
>
> And why wait for a full hour "before an active HTTP or IPP request times
> out"? The default timeout is 300 seconds and should be OK...
>
> > BrowseTimeout 172800
>
> This is 48 hours... OK
>
> > ImplicitClasses Yes
> > Printcap /etc/opt/TWWfsw/cups124/printcap
> > PrintcapFormat BSD
> >
> >
> > >    * What is you typical concurrent jobs figure ("lpstat -o"), and
> > >      what is happing at peak printing hours?
> >
> > at 8:03am: lpstat -o |wc
> > 270
>
> Is that on one of the 22 clients, or is that on one of the 2 servers?
>
> > Yesterday, this one client processed 1773 print jobs
> > # grep -- "\[Job" errlog|grep "02/Jul"|wc
> >     1773   21236  141457
>
> And what was the same figure for each of the two servers?
>
>   [ BTW, I do not see how this command would give you the number of
>     print jobs; there are a lot of chances that "[job" is appearing
>     in the error_log without indicating successful completion, esp. if
>     you have LogLevel debug. Assuming "errlog" is meant to be
>     "error_log", I'd grep for "File 0 is complete" for an indication of
>     successfully completed jobs.
>
>     Anyways, the canonical file to look into for the number of jobs is
>     "page_log"....]
>
>
> > Peak hours occurs between 8am through 5pm daily. We have users login
> > to the app servers nationwide. I'm not the SAP admin so I don't really
> > know what they do.. :)
> >
> >
> > >    * Did you ever consider (and test) what I nicknamed "browse push"?
> >
> > We can only poll since the client and servers are on different
> > networks.
>
> Ah, OK. (So *each* client is on a different network from each other?)
>
> BrowsePolling by 22 clients is 22x the load for the CUPS server if you
> compare it to normal browsing.
>
> If you really want to explore *each* possibility to reduce that load,
> you should seriously consider to enable "cross gateway/router broadcasting"
>
> Broadcasts by default are blocked by routers and gateways. But I know
> there *IS* a way to let them pass through (don't ask me for specifics --
> I'm not a low-level TCP/IP packet scrutinizing guy, and it depends on
> your hardware used, etc.)....
>
> In your seat, I'd talk to my networking guys if it was important for the
> printing performance. They are able to do this, and it does not load more
> network traffic overall upon their pipes -- quite the contrary: one packet
> per printer addressing everybody instead of 22 different client requests
> triggering one packet per printer for each request...)
>
> They just need to be carefull when they set it up. (Configuration mistakes
> in the network setup can cause "broadcast storms", which is one reason why
> a general enabling of cross-gateway broacasts is frowned upon even in
> neighbouring LANs...)
>
> [...]
> > At the moment we have 22 clients polling and they poll every hour.
> >
> > I'm not sure what you mean by mixing BrowsePoll and "browse push" but
> > I'm guessing you mean a BrowsePoll on the clients and Browsing on the
> > servers.
>
> Yes. As I said in a previous post, I used it as a short way to mean
> "'Browsing On' + 'BrowseAddress $something' on a CUPS server causing it
> to actively send out broadcasts announing its available printqueues"
>
> > Well, my understanding is that the servers has to be set to
> > Browsing on, otherwise, clients who BrowsePoll will not be able to
> > retrieve updates.
>
> I'm not even sure about the "Browsing On" requirement on the server if
> all the clients do BrowsePoll. But the "BrowseAddress" surely isn't
> required, and causes to send out broadcasts (on top of responding to
> each BrowsePoll request). "Browsing On" with no "BrowseAddress" line
> does not cause broadcast announcements.
>
> > > Your 11 seconds very likely are not "processing time" (after which the
> > > job is completed), but "response time" (after which the job is fully
> > > submitted).
> >
> > So, on a client, when I submit a job, cupsd grabs that request and
> > sends it to the CUPS servers. It is that process of cupsd processing
> > job requests taking longer; is that a "response time" from the cupsd
> > then? If so, what can be done to improve the response time?
> >
> > job process flow:
> >
> > application(or lp/lpr) --(11sec)> cupsd --(1-2sec)-> cupsservers --(1-2sec)> printer
>
> I would not assume it is your *local* cupsd on the client that takes 11
> seconds to respond. Try to test this, on a client, you can try a few
> different commands for testing. For a given $printername, do f.e. this:
>
>   timex lpstat -h localhost -p printername
>   timex lpstat -h cupssserver1 -p printername
>
>   timex lpstat -h localhost -o printername
>   timex lpstat -h cupssserver1 -o printername
>
>   timex lpstat -h localhost -v printername
>   timex lpstat -h cupssserver1 -v printername
>
>   timex lpstat -h localhost -v
>   timex lpstat -h cupssserver1 -v
>
>
> Cheers,
> Kurt
>
> --
> Kurt Pfeifle

Kurt, thanks for your responses and explanations. I'll certainly look
into them for further tuning.

However, I seem to have narrowed down the root cause of the delay. It
turned out to be an open file descriptor shortage, as I have initially
thought it was.

Here's some before and after ulimit changes(clients only):

># timex lpstat -p|wc
real 21.32    <-- point of interest - long response time
user 3.54
sys  0.01

    2683   32196  177126

># timex lpr -Pabnull access_log
real 11.53    <-- point of interest - long response time
user 2.26
sys  0.05

Now, with the CUPS file descriptor increased, we have better performance:

># timex lpstat -p|wc
real 7.94  <--- much better
user 3.67
sys  0.08

    2683   32196  177126

># timex lpr -Pabnull access_log
real 4.49  <-- much, much, better; though could be lower
user 2.25
sys  0.07

Again, thanks for your feedback; certainly appreciated.

Thanks,
Angel