A few days ago I was looking over some new code for an Apache httpd module and I was afraid that the design would lead to daemons like the mod_cgid daemon inheriting the module’s pipe and inadvertently keeping the write end open and thereby break part of what the module used the pipe for. That lead to a desire to summarize Apache httpd file descriptors by which processes had them open. This morning I set out to write a script for that but with too little sleep+caffeine I stared at a mostly-empty Emacs buffer long enough that I decided to set a timer for one hour to force the issue.
The timer went off and I was still wading through far too much output; the same process id was listed multiple times for a given descriptor, and I couldn’t find the reason in the code. I wasted a bit of time messing with the code but finally went back to a normal lsof display in the shell and discovered what was going on: When using -g NNN to select via process group id, lsof is displaying the same process multiple times, as in this snippet from the repeated displays of all the fds for one of the httpd processes:
$ lsof -P -g 38239 -a -d ^txt,^rtd,^cwd,^mem,^DEL | grep '38243.*4u' lsof: WARNING: compiled for FreeBSD release 9.0-RC2; this is 9.0-RELEASE. httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED) httpd 38243 38239 trawick 4u IPv4 0xfffffe00271a57a0 0t0 TCP *:* (CLOSED)
I don’t see the same behavior on Linux. I ended up adding filtering to deal with the repetition. My script is now able to display a more manageable summary of “files” by process:
$ apfds.py 38239 fd 0 type VCHR name /dev/null 38239 38240 38241 38242 38243 fd 1 type VCHR name /dev/null 38239 38240 38241 38242 38243 fd 2 type VREG name /usr/home/trawick/inst/24-64/logs/error_log 38239 38240 38241 38242 38243 fd 3 type IPv6 dev 0xfffffe002706c000 name *:8080 38239 38241 38242 38243 fd 4 type IPv4 dev 0xfffffe00271a57a0 name *:* 38239 38240 38241 38242 38243 fd 5 type IPv6 dev 0xfffffe0027099b70 name *:10080 38239 38241 38242 38243 fd 6 type IPv4 dev 0xfffffe00271a73d0 name *:* 38239 38240 38241 38242 38243 fd 7 type PIPE dev 0xfffffe0002729000 name ->0xfffffe0002729158 38239 38240 38241 38242 38243 fd 8 type PIPE dev 0xfffffe0002729158 name ->0xfffffe0002729000 38239 38240 38241 38242 38243 fd 9 type VREG name /usr/home/trawick/inst/24-64/logs/access_log 38239 38240 38241 38242 38243 fd 10 type VREG name /usr/home/trawick/inst/24-64/logs/rewrite-map.38239 38239 38241 38242 38243 fd 3 type unix dev 0xfffffe00273ad2a8 name /home/trawick/inst/24-64/logs/cgisock.38239 38240 fd 5 name 0xfffffe0027171960 file struct, ty=0, op=0xffffffff81079180 38240 fd 11 type VREG name /usr/home/trawick/inst/24-64/logs/rewrite-map.38239 38241 38242 38243 fd 12 type KQUEUE dev 0xfffffe00131d4000 name count=0, state=0x2 38241 fd 12 type KQUEUE dev 0xfffffe0018b9c600 name count=0, state=0x2 38242 fd 12 type KQUEUE dev 0xfffffe002775e300 name count=0, state=0x2 38243
(In some cases it may not be correct to require the fds to match in order for two files to be the same; OTOH, the heuristics might be unmanageable, and it may help to see the separate listings for distinct fds anyway.)
Later
More FreeBSD fun: Given the Mac OS X issue, I reimplemented the lsof group selection to use -p pid1,pid2,pid3, with the list built internally via ps. But that doesn’t work at all on FreeBSD. Instead, it lists files for only the last pid in the list and exits with status 1. So the latest version of the script uses -g pgid (along with code to filter out the over-reporting) on FreeBSD and -p pid1,pid2,pid3 elsewhere. (I dare not try it on Solaris today.)
Later still…
I built lsof for Solaris 10 and didn’t see any glitches with -pLIST or -gPGID. Also, I was able to create a fix for the FreeBSD glitches and send it to the lsof author.