Comparing performance of SCGI and uwsgi protocols

I’ve been playing with httpd and nginx using different protocols to route to a Python application, and one of the questions that has arisen is whether or not I should try uwsgi (the protocol, not the uWSGI application) with httpd. The third-party module mod_proxy_uwsgi needs some debugging first, so it isn’t as simple as modifying one of my existing .conf snippets and seeing how fast it goes.

For the purposes of reverse proxy behind httpd or nginx, uwsgi is essentially SCGI with a more efficient encoding of the length of the parameters passed over — sending over binary lengths with the strings instead of just the string, that then has to be traversed to find the terminating binary zero. I don’t think the work in the web server is significantly different either way, though I think SCGI is cheaper for the web server because when copying over HTTP headers and other data it does a strlen() on the var name and value, then a memcpy() for each, including the terminating ‘\0’ so that the backend knows the extent. uwsgi requires copying both strings (without the ‘\0’) and building a two-byte binary length for each in a particular endian order (i.e., shifts and masks). This extra work for uwsgi more than overcomes the savings with the encoding of a couple of other lengths, for which SCGI requires building a printable string.

So how can I make an approximation of the speed-up without actually debugging the uwsgi support for httpd? First, it seems worthwhile to look at nginx. Here are some numbers with nginx, where the request consists of a POST with 71,453-byte body that is written to a very simple WSGI application running under uWSGI, which simply echoes it back. (8 runs each with ab, 10,000 requests, concurrency 100, throw out high and low runs)

Scenario Requests/sec
nginx, SCGI over a Unix socket 4,332
nginx, uwsgi over a Unix socket 4,363

So over a lot of runs and a lot of requests per run, uwsgi overall (web server processing + uWSGI processing) has an edge by a little less than one percent.

I came up with a lame experiment with mod_proxy_scgi to see how an improvement in the efficiency of dealing with the parameters might help. In the following patch, I simply remove the second pass of touching (and copying) the parameters in mod_proxy_scgi. Of course this only works if all requests are exactly the same, as in my little benchmark :)

--- mod_proxy_scgi.c 2014-05-05 13:21:39.253193636 -0400
+++ no-header-building 2014-05-05 13:15:33.785184159 -0400
@@ -247,6 +247,8 @@
     return OK;
 }

+static char *saved_headers;
+static apr_size_t saved_headers_size;

 /*
  * Send SCGI header block
@@ -292,6 +294,11 @@
     ns_len = apr_psprintf(r->pool, "%" APR_SIZE_T_FMT ":", headerlen);
     len = strlen(ns_len);
     headerlen += len + 1; /* 1 == , */
+
+    if (getenv("foo") && saved_headers) {
+        return sendall(conn, saved_headers, saved_headers_size, r);
+    }
+
     cp = buf = apr_palloc(r->pool, headerlen);
     memcpy(cp, ns_len, len);
     cp += len;
@@ -320,6 +327,15 @@
     }
     *cp++ = ',';

+    if (getenv("foo") && !saved_headers) {
+        char *tmp = malloc(headerlen);
+        memcpy(tmp, buf, headerlen);
+        saved_headers_size = headerlen;
+        saved_headers = tmp;
+        ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, r,
+                      "Saved headers...");
+    }
+
     return sendall(conn, buf, headerlen, r);
 }

It doesn’t exactly match what uWSGI has to do extra with SCGI, but it does knock off some processing in the web server.

Scenario Requests/sec
httpd, SCGI over a Unix socket, no optimization 9,997
httpd, SCGI over a Unix socket, optimization 10,014

(Again, these are the averages of multiple runs after throwing out highs and lows.)

Hmmm… Even getting rid of a lot of strlen() and memcpy() calls (my rough attempt to trade cycles in httpd for the cycles that would have been saved in uWSGI if we used uwsgi) resulted in much less than one percent improvement. I think I’ll stick with SCGI for now, and I don’t even think it is worthwhile to change httpd’s SCGI implementation to build the header in a single pass, which would get back only some of the cycles saved by the benchmark-specific optimization shown above. (And I don’t think httpd is suffering by not having a bundled, reliable implementation of uwsgi.)