Web page compression is not a new technology, but it has just recently gained higher recognition in the minds of IT administrators and managers because of the rapid ROI it generates. Compression extensions exist for most of the major Web server platforms, but in this article I will focus on the Apache and mod_gzip solution.
The idea behind GZIP-encoding documents is very straightforward. Take a file that is to be transmitted to a Web client, and send a compressed version of the data, rather than the raw file as it exists on the filesystem. Depending on the size of the file, the compressed version can run anywhere from 50% to 20% of the original file size.
In Apache, this can be achieved using a couple of different methods. Content Negotiation, which requires that two separate sets of HTML files be generated — one for clients that can handle GZIP-encoding, and one for those who can’t — is one method. The problem with this solution should be readily apparent: there is no provision in this methodology for GZIP-encoding dynamically-generated pages.
The more graceful solution for administrators who want to add GZIP-encoding to Apache is the use of mod_gzip. I consider it one of the overlooked gems for designing a high-performance Web server. Using this module, configured file types — based on file extension or MIME type — will be compressed using GZIP-encoding after they have been processed by all of Apache’s other modules, and before they are sent to the client. The compressed data that is generated reduces the number of bytes transferred to the client, without any loss in the structure or content of the original, uncompressed document.
mod_gzip can be compiled into Apache as either a static or dynamic module; I have chosen to compile it as a dynamic module in my own server. The advantage of using mod_gzip is that this method requires that nothing be done on the client side to make it work. All current browsers — Mozilla, Opera, and even Internet Explorer — understand and can process GZIP-encoded text content.
On the server side, all the server or site administrator has to do is compile the module, edit the appropriate configuration directives that were added to the httpd.conf file, enable the module in the httpd.conf file, and restart the server. In less than 10 minutes, you can be serving static and dynamic content using GZIP-encoding without the need to maintain multiple codebases for clients that can or cannot accept GZIP-encoded documents.
When a request is received from a client, Apache determines if mod_gzip should be invoked by noting if the “Accept-Encoding: gzip” HTTP request header has been sent by the client. If the client sends the header, mod_gzip will automatically compress the output of all configured file types when sending them to the client.
This client header announces to Apache that the client will understand files that have been GZIP-encoded. mod_gzip then processes the outgoing content and includes the following server response headers.
Content-Type: text/html Content-Encoding: gzip
These server response headers announce that the content returned from the server is GZIP-encoded, but that when the content is expanded by the client application, it should be treated as a standard HTML file. Not only is this successful for static HTML files, but this can be applied to pages that contain dynamic elements, such as those produced by Server-Side Includes (SSI), PHP, and other dynamic page generation methods. You can also use it to compress your Cascading Stylesheets (CSS) and plain text files. As well, a whole range of application file types can be compressed and sent to clients. My httpd.conf file sets the following configuration for the file types handled by mod_gzip:
This allows Microsoft Office and Postscript files to be GZIP-encoded, while not affecting PDF files. PDF files should not be GZIP-encoded, as they are already compressed in their native format, and compressing them leads to issues when attempting to display the files in Adobe Acrobat Reader. For the paranoid system administrator, you may want to explicitly exclude PDF files.
mod_gzip_item_exclude mime ^application/pdf$
Another side-note is that nothing needs to be done to allow the GZIP-encoding of OpenOffice (and presumably, StarOffice) documents. Their MIME-type is already set to text-plain, allowing them to be covered by one of the default rules.
How beneficial is sending GZIP-encoded content? In some simple tests I ran on my Web server using WGET, GZIP-encoded documents showed that even on a small Web server, there is the potential to produce a substantial savings in bandwidth usage.
|http://www.pierzchala.com/bio.html||Uncompressed File Size: 3122 bytes|
|http://www.pierzchala.com/bio.html||Compressed File Size: 1578 bytes|
|http://www.pierzchala.com/compress/homepage2.html||Uncompressed File Size: 56279 bytes|
|http://www.pierzchala.com/compress/homepage2.html||Compressed File Size: 16286 bytes|
Server administrators may be concerned that mod_gzip will place a heavy burden on their systems as files are compressed on the fly. I argue against that, pointing out that this does not seem to concern the administrators of Slashdot, one of the busiest Web servers on the Internet, who use mod_gzip in their very high-traffic environment.
 From http://www.15seconds.com/issue/020314.htm:
“Both Internet Explorer 5.5 and Internet Explorer 6.0 have a bug with decompression that affects some users. This bug is documented in: the Microsoft knowledge Base articles, Q312496 is for IE 6.0 â€¦ , the Q313712 is for IE 5.5. Basically Internet Explorer doesn’t decompress the response before it sends it to plug-ins like Adobe Photoshop.”