Copyright © 1999-2000 Tarma Software Research Ltd.
All rights reserved.
Please read the license terms below.
Cleans and shrinks HTML source files by removing superfluous white space, tags, and quotes. Typically, this reduces file sizes by 15-20%, thereby reducing download times and improving the responsiveness of your web pages.
Optionally, HCLEAN also removes images, scripts, styles, and active content. Wildcard file specifications and (recursive) subdirectory processing are supported. In combination with the ability to place the output files in a different directory tree altogether and only process files that are new or have changed, this allows quick and convenient processing of entire web sites.
Synopsis | File processing | Options | Notes | Examples | Author | License terms | Version history
HCLEAN [options] file_or_directory_names...
Processes one or more input files or directories (specified with any mixture of fully qualified and wildcard names) according to the options. The output files are placed in a separate output directory; by default, this directory is called '.output'. If one or more filenames are given, these are processed (optionally recursing into subdirectories to search for further matches); if one or more directory names are given, all files in these directories are processed. In all cases, processing is subject to the restrictions noted under File processing.
HCLEAN logs its actions to a file '.hclean.log' in the directory from which it was started. A summary is also displayed on the console while HCLEAN is running. The logfile contains details about the way HCLEAN treated each file (including the reason that files are excluded), warnings and fixes it applies, and diagnostic messages.
HCLEAN processes each file according to its name, type (which is inferred from the file name extension) and attributes. It uses the following rules (in the order given):
As a result, only files in category (4) are subject to HTML clean up and checking; all other files are either ignored or copied without change. Directories are treated similarly:
Unless otherwise noted, options are not case sensitive. They may be introduced by either a '/' or a '-' character. Therefore, both:
HCLEAN /z *.htm
and:
HCLEAN -Z *.htm
are correct (and equivalent). Furthermore, multiple options may be combined as long as only the last option in a combination is a multi-letter one. For example, the following combinations are allowed and unambiguous:
HCLEAN /suz /o:\Deploy *.htm
HCLEAN /suo:\Deploy /z *.htm
whereas these are not:
HCLEAN /zus *.htm
HCLEAN /zo:\Deploy *.htm
/Fx...
/Kx...
/O:dir
/R[y|n]
/S
/U
/Zx...
This is an early test version. Use with caution to prevent accidental data loss.
If no options are specified, HCLEAN processes the input files verbatim (and does not even squash white space).
Output files are given the same modified date and time as their input files, even though they are obviously modified later. This feature is included to make the /U (update only) option more accurate.
Regardless of the /zb option, white space zapping does not occur inside comments, nor inside <PRE> sections.
Comment zapping (option /zc) removes both inline comments: <!-- comment text --> and block comments: <COMMENT>comment text</COMMENT>, but not inside SCRIPT or STYLE elements (unless those are zapped too). In some pathological situations, the output document without the comments may be different from what browsers would otherwise render. For example, the following comment causes trouble with a lot of browsers:
<!-- comment1 -- text > -- comment2 -- >
The comment really only ends with the second '>' character (after comment2), but most browsers get in trouble and stop rendering considerable amounts of text after the comment (Opera is a notable exception; it handles this situation correctly). HCLEAN does the right thing, but the resulting output may yield a different (but more correct) result than the original input.
It does not by default remove </TD> tags, although strictly speaking, these are optional. However, Netscape Navigator 3 (and possibly earlier versions) has a bug that causes it to collapse multiple cells into one if the </TD> tag is missing from nested tables. However, if you also specify /z+ (aggressive zapping), then </TD> tags are removed.
Removes all image maps <MAP>...</MAP> and replaces all contained <AREA> specifications with links that use the AREA's ALT text and HREF. If either or both are absent, reasonable substitutes are chosen.
Style zapping (option /zy) currently leaves LINKs that refer to a style sheet intact.
Script zapping (option /zs) hasn't been tested at all yet.
HCLEAN can be used for many purposes. Below are a number of examples with hints about possible applications.
HCLEAN /zbceq *.htmHCLEAN /z *.htm HCLEAN /z /s /u /o:C:\Deploy *.htmHCLEAN /suz /o:C:\Deploy *.htmHCLEAN /suz /o:\\Enigma\pub\www *.htmHCLEAN /zi /o:test *.htm HCLEAN /zz index.htmHCLEAN /zz+ index.htmHCLEAN is developed by Tarma Software Research Ltd. (http://www.tarma.com).
Documentation and software are copyright © 1999-2000 Tarma Software Research Ltd. All rights reserved. You are hereby granted a license to use HCLEAN for your own personal and private use. You are allowed to redistribute the software and its documentation, but the redistribution should include the entire software program and the complete documentation, including all copyright notices and license terms.
The following are specifically not allowed under the terms of your license:
Please contact Tarma Software Research <sales@tarma.com> if you require different terms.
If you find HCLEAN useful, a link from your web pages to our web site at http://www.tarma.com would be appreciated. Send your comments, suggestions, etc. to support@tarma.com
Important: the HCLEAN software is provided "as-is". Tarma Software Research Ltd. makes no warranties, either express or implied, as to the quality or fitness for any particular purpose of the software. Tarma Software Research shall have no liability to you, or to any other person or entity for any direct, indirect, incidental, special, or consequential damages whatsoever, including, but not limited to the loss of revenue or profit, the loss of data, or other commercial or economic loss, even if Tarma Software Research has been advised of the possibility of such losses.
14 October 2000 - More internal reorganisations. HCLEAN now accepts a mixture of filenames and directory names. Use of an output directory is now mandatory (defaults to '.output'); removed /R (restore files) option accordingly. Added new rules for excluded file and directory names and attributes; revised the lists of excluded filename extensions and HTML-type extensions. File/directory enumeration has been made more efficient in preparation of an FTP-enabled edition. Wildcard matching has been improved. Various small changes to the console and logfile output format.
28 May 2000 - Added /Fx (fix external anchor links).
April-May 2000 - Internal reorganisations and small updates. Implementation uses new directory traversal classes in preparation of FTP-compatible version. Files other than HTML source files (for example, image files), are now also processed by simply copying them if they are new or changed. Read-only output files are now silently overwritten by HTMLCleaner.
March 2000 - Added various /F (fix) options. Added separate log file output.
28 February 2000 - Changed output format slightly.
15 February 2000 - Added subdirectory exclusion for directories whose names start with '.'
9 December 1999 - Made loading of SHELL32.DLL dynamic to improve application start-up time.
1 December 1999 - Corrected bug that was introduced in 1.00.005, causing comments not to be zapped if scripts or styles were zapped.
Released 29 November 1999: Changed comment and directive processing to make it more robust. It will now terminate a comment (started with '<!--') after seeing any sequence '--' [white_space] '>', rather than taking the more strict HTML/SGML interpretation that comments are surrounded by pairs of '--'. Therefore, the following will now be seen as a properly terminated comment:
<!-- -- -->although technically, it's a comment and a half and not yet terminated. The change was prompted by tools like Macromedia Fireworks that have a tendency to insert comments like:
<!------------ END OF JAVASCRIPT CODE ------------>and all bets are off in those cases. The new approach doesn't choke on these comments. For general declarations (starting with '<!name', such as '<!DOCTYPE'), the end of the declaration is now considered to be the first non-quoted '>', regardless of comments appearing in the declaration. As a result, the following will now be seen as a properly terminated declaration:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" -->although technically, it ends with an open comment and not is yet terminated when the '>' is seen. This is a rare situation anyway.
Released 26 November 1999: Implemented /U (update only), /S (process subdirectories), and /R (restore files).
Released 24 November 1999: Improved handling of comments (<!--...-->), declarations (<!name...> etc.), marked sections (<![name[...]]>) and processing instructions (<?...>). Made removal of </TD> tags optional (only under agressive zapping) to cater for NN3 bug. Removed /Z+ from the /Zz option list (must now be specified separately). Added TITLE attribute as possible source for IMG replacement text (ALT is still preferred source). Added /Z- (non-aggressive zap), although I'm not sure that it serves any useful purpose.
Released 18 November 1999: Some small modification to the documentation (i.e., this file), plus smarter handling of white space within STYLE and SCRIPT blocks, resulting in space gains (read: smaller files). Also, /O:dir (use output directory) now creates the output path if it doesn't already exist.
Released 18 November 1999: First semi-public version. Implements core functionality and some embryonic forms of more advanced options.