HCLEAN - an HTML cleaner

Copyright © 1999-2000 Tarma Software Research Ltd. All rights reserved.
Please read the license terms below.

Cleans and shrinks HTML source files by removing superfluous white space, tags, and quotes. Typically, this reduces file sizes by 15-20%, thereby reducing download times and improving the responsiveness of your web pages.

Optionally, HCLEAN also removes images, scripts, styles, and active content. Wildcard file specifications and (recursive) subdirectory processing are supported. In combination with the ability to place the output files in a different directory tree altogether and only process files that are new or have changed, this allows quick and convenient processing of entire web sites.

Synopsis | File processing | Options | Notes | Examples | Author | License terms | Version history


Synopsis

HCLEAN [options] file_or_directory_names...

Processes one or more input files or directories (specified with any mixture of fully qualified and wildcard names) according to the options. The output files are placed in a separate output directory; by default, this directory is called '.output'. If one or more filenames are given, these are processed (optionally recursing into subdirectories to search for further matches); if one or more directory names are given, all files in these directories are processed. In all cases, processing is subject to the restrictions noted under File processing.

HCLEAN logs its actions to a file '.hclean.log' in the directory from which it was started. A summary is also displayed on the console while HCLEAN is running. The logfile contains details about the way HCLEAN treated each file (including the reason that files are excluded), warnings and fixes it applies, and diagnostic messages.

Back to the top


File processing

HCLEAN processes each file according to its name, type (which is inferred from the file name extension) and attributes. It uses the following rules (in the order given):

  1. File names that start with '.' (a period) are ignored. This is similar to the Unix convention for hidden files and provides a convenient way to exclude files from processing, although these files are still shown in directory listings on Win32 systems.
  2. Files that have their Hidden attribute set are also ignored. These files are typically not shown in directory listings.
  3. Files with the following extensions are ignored: .$$$ .bak .bat .btm .dwt .idb .ilk .lbi .jbf .lnk .obj .pch .pdb .psp .res .tmp .val  These extensions are typically used by the system or by applications for scratch or other special purposes, and it is normally undesirable to process or copy them within the context of a website.
  4. Files with the following extensions are processed as HTML source files and treated accordingly: .asp .cfm .cfml .css .htm .html .jhtml .js .php .shtm .shtml .xml
  5. All other files are copied verbatim to the output directory.

As a result, only files in category (4) are subject to HTML clean up and checking; all other files are either ignored or copied without change. Directories are treated similarly:

  1. Directory names that start with '.' (a period) are ignored. This is similar to the Unix convention for hidden files and provides a convenient way to exclude directories from processing, although these directories are still shown in directory listings on Win32 systems.

    Note: The directories '.' and '..' (aliases for the current and parent directory, respectively) are processed if passed in on the command line.
     
  2. Directories that have their Hidden attribute set are also ignored. These directories are typically not shown in directory listings.
  3. All other directories are processed normally, using either the current file search mask (if given as a wildcard specification on the command line) or by enumerating all files in the directory.

Back to the top


Options

Unless otherwise noted, options are not case sensitive. They may be introduced by either a '/' or a '-' character. Therefore, both:

    HCLEAN /z *.htm

and:

    HCLEAN -Z *.htm

are correct (and equivalent). Furthermore, multiple options may be combined as long as only the last option in a combination is a multi-letter one. For example, the following combinations are allowed and unambiguous:

    HCLEAN /suz /o:\Deploy *.htm
    HCLEAN /suo:\Deploy /z *.htm

whereas these are not:

    HCLEAN /zus *.htm
    HCLEAN /zo:\Deploy *.htm

Available options

/Fx...
Fix 'x', where 'x' can be one or more of the following:
a - missing ALT attributes in <IMG> tags
x - missing TARGET attributes in <A> tags with an external HREF
z - all of the above

Note: /F options only work if at least one /Z option is also specified.
/Kx...
Keep 'x', where 'x' can be one or more of the following:
l - line breaks
z - all of the above
/O:dir
Use 'dir' as the directory in which to place output files. If necessary, this directory is created by HCLEAN. If subdirectories are also processed (option /S), then a matching subdirectory tree is built under 'dir'. If this option is not used, HCLEAN uses '.output' as the default output directory.

Warning: HCLEAN does not check if the given output directory makes sense. For example, if you specify /O:. (to use the current directory), the results are unpredictable and may lead to data loss. You should make sure that the output directory does not interfere with HCLEAN's processing. Safe choices are any directory name that starts with a '.' (like '.output', but not a single '.' as this refers to the current directory), or any directory tree that does not intersect with the tree that is currently being processed.
/R[y|n]
Obsolete. This option was useful in previous versions of HCLEAN and is still accepted in the current version for backward compatibility, but has no effect any more. In the current version of HCLEAN, output is always directed to a separate output directory (but see the warning under the /O:dir option).
/S
Process subdirectories. After processing files relative to the current directory, HCLEAN will traverse all subdirectories and process matching files there as well.

Excluded directories: as a special feature, HCLEAN will skip directories whose names start with '.' (similar to the UNIX convention for hidden file names). This allows you for example to use scrap or other work directories without having them processed by HCLEAN - just call them '.scrap' or something similar.

Warning: if you combine /S with /O:dir (set output directory), you should ensure that the output directory 'dir' is not a subdirectory of any of the starting directories, or infinite recursion will occur as HCLEAN processes ever deeper copies of the output directory.
/U
Update only. Processes only those files whose output version does not yet exist, or whose output version is older than the input version.
/Zx...
Zap 'x', where 'x' can be one or more of the following:
+ - aggressive zap: zaps potentially dangerous elements (see Notes)
a - active content: scripts (as '/zs'), APPLET, OBJECT (not implemented yet)
b - blanks (white space, see Notes)
c - comments (except within STYLE and SCRIPT sections)
e - end tags, where optional (see Notes)
f - frames: removes outer FRAMESET and retains inner FRAME contents (not implemented yet)
i - images: IMG, MAP, AREA
n - SPANs (but not the spanned contents)
q - quotes around attribute values without embedded spaces
s - scripts: SCRIPT blocks and inline event handlers (the latter not implemented yet)
y - styles: STYLE blocks and inline style definitions
z - all of the above except +; this must be specified separately
(nothing) - /Z is equivalent to /Zbceq and is the recommended option set to reduce HTML file size without disturbing the page layout.

See the Notes section for details about the various zapping options.

Back to the top


Notes

Back to the top


Examples and suggestions for use

HCLEAN can be used for many purposes. Below are a number of examples with hints about possible applications.

File size reduction

HCLEAN /zbceq *.htm
- or -
HCLEAN /z *.htm
Processes all .htm files in the current directory. Output files have the same names as the input files and are placed in the '.output' directory. During processing, white space is minimized and comments, optional end tags and attribute quotes are removed.

Note: /z is equivalent to /zbceq and is the recommended option set to reduce HTML file size without disturbing the page layout.
HCLEAN /z /s /u /o:C:\Deploy *.htm
- or -
HCLEAN /suz /o:C:\Deploy *.htm
Processes all .htm files in the current directory and its subdirectories. The output files are placed in the directory 'C:\Deploy' in a subdirectory tree that mirrors the original one. Output files have the same names as the input files, and matching files are only processed if the input file is newer than the output file (or the output file doesn't exist yet). During processing, white space is minimized and comments, optional end tags and attribute quotes are removed.

This option set is useful as a final clean-up step before a web site is published. It minimizes the HTML file sizes without altering the layout of the documents and adds a limited level of obfuscation. Also, because only new or changed files are processed, total processing time is minimized.

If you publish your web site on an intranet, this step can take care of your web site publishing as well: simply specify the (UNC) name of the web server and be done with it. For example (assuming your web server is called 'Enigma', its web site root share is 'pub', and the web site subdirectory is 'www'):

HCLEAN /suz /o:\\Enigma\pub\www *.htm

Usability testing

HCLEAN /zi /o:test *.htm
Processes all .htm files in the current directory and places the output files in the subdirectory 'test'. This subdirectory is created automatically if it doesn't already exist. Output files have the same names as the input files. During processing, all images and related tags and attributes are removed and replaced by their ALT texts. This is a good way to assess the usability of the page when image downloads are disabled (although it is actually more generous with MAP/AREA image maps than most browsers are).

Removal of unwanted markup

HCLEAN /zz index.htm
Processes the file 'index.htm' in the current directory and puts the result in '.output\index.htm'. As much non-essential HTML markup as possible is removed, while retaining a semblance of the original document.

Note: this option set can change the page layout of the HTML document quite dramatically if it contained images or styles, and may also change the behavior if it held active content.
HCLEAN /zz+ index.htm
Does the same as the previous example, but even more aggressively.

Back to the top


Author

HCLEAN is developed by Tarma Software Research Ltd. (http://www.tarma.com).

Back to the top


License terms

Documentation and software are copyright © 1999-2000 Tarma Software Research Ltd. All rights reserved. You are hereby granted a license to use HCLEAN for your own personal and private use. You are allowed to redistribute the software and its documentation, but the redistribution should include the entire software program and the complete documentation, including all copyright notices and license terms.

The following are specifically not allowed under the terms of your license:

Please contact Tarma Software Research <sales@tarma.com> if you require different terms.

If you find HCLEAN useful, a link from your web pages to our web site at http://www.tarma.com would be appreciated. Send your comments, suggestions, etc. to support@tarma.com

Important: the HCLEAN software is provided "as-is". Tarma Software Research Ltd. makes no warranties, either express or implied, as to the quality or fitness for any particular purpose of the software. Tarma Software Research shall have no liability to you, or to any other person or entity for any direct, indirect, incidental, special, or consequential damages whatsoever, including, but not limited to the loss of revenue or profit, the loss of data, or other commercial or economic loss, even if Tarma Software Research has been advised of the possibility of such losses.

Back to the top


Version history

1.20.015

14 October 2000 - More internal reorganisations. HCLEAN now accepts a mixture of filenames and directory names. Use of an output directory is now mandatory (defaults to '.output'); removed /R (restore files) option accordingly. Added new rules for excluded file and directory names and attributes; revised the lists of excluded filename extensions and HTML-type extensions. File/directory enumeration has been made more efficient in preparation of an FTP-enabled edition. Wildcard matching has been improved. Various small changes to the console and logfile output format.

1.20.014

28 May 2000 - Added /Fx (fix external anchor links).

1.20.011-013

April-May 2000 - Internal reorganisations and small updates. Implementation uses new directory traversal classes in preparation of FTP-compatible version. Files other than HTML source files (for example, image files), are now also processed by simply copying them if they are new or changed. Read-only output files are now silently overwritten by HTMLCleaner.

1.10.010

March 2000 - Added various /F (fix) options. Added separate log file output.

1.00.009

28 February 2000 - Changed output format slightly.

1.00.008

15 February 2000 - Added subdirectory exclusion for directories whose names start with '.'

1.00.007

9 December 1999 - Made loading of SHELL32.DLL dynamic to improve application start-up time.

1.00.006

1 December 1999 - Corrected bug that was introduced in 1.00.005, causing comments not to be zapped if scripts or styles were zapped.

1.00.005

Released 29 November 1999: Changed comment and directive processing to make it more robust. It will now terminate a comment (started with '<!--') after seeing any sequence '--' [white_space] '>', rather than taking the more strict HTML/SGML interpretation that comments are surrounded by pairs of '--'. Therefore, the following will now be seen as a properly terminated comment:

<!-- -- -->

although technically, it's a comment and a half and not yet terminated. The change was prompted by tools like Macromedia Fireworks that have a tendency to insert comments like:

<!------------ END OF JAVASCRIPT CODE ------------>

and all bets are off in those cases. The new approach doesn't choke on these comments. For general declarations (starting with '<!name', such as '<!DOCTYPE'), the end of the declaration is now considered to be the first non-quoted '>', regardless of comments appearing in the declaration. As a result, the following will now be seen as a properly terminated declaration:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" -->

although technically, it ends with an open comment and not is yet terminated when the '>' is seen. This is a rare situation anyway.

1.00.004

Released 26 November 1999: Implemented /U (update only), /S (process subdirectories), and /R (restore files).

1.00.003

Released 24 November 1999: Improved handling of comments (<!--...-->), declarations (<!name...> etc.), marked sections (<![name[...]]>) and processing instructions (<?...>). Made removal of </TD> tags optional (only under agressive zapping) to cater for NN3 bug. Removed /Z+ from the /Zz option list (must now be specified separately). Added TITLE attribute as possible source for IMG replacement text (ALT is still preferred source). Added /Z- (non-aggressive zap), although I'm not sure that it serves any useful purpose.

1.00.002

Released 18 November 1999: Some small modification to the documentation (i.e., this file), plus smarter handling of white space within STYLE and SCRIPT blocks, resulting in space gains (read: smaller files). Also, /O:dir (use output directory) now creates the output path if it doesn't already exist.

1.00.001

Released 18 November 1999: First semi-public version. Implements core functionality and some embryonic forms of more advanced options.

Back to the top


Documentation and software copyright © 1999-2000 Tarma Software Research Ltd. All rights reserved.
Web site: http://www.tarma.com. Support: support@tarma.com. Last modified: 3-09-06 12:12