SGML2TeX

SGML2TeX is a prototype program to convert the SGML tags in a document to control sequences using the conventions of TeX. This document reflects v0.97β of the program.

Environment

The program is currently written in PCL, a language developed explicitly for the Intel 80*86 chips, because of the speed of execution possible and the availability of a royalty-free run-time execute module (included with the distribution of this program). The program is therefore restricted in its current state to execution under MS-DOS or in an MS-DOS task window of DesqView, DesqView/X or MS-Windows. A future version will be written in a more portable language system, probably CWEB.

SGML is the Standard Generalised Markup Language (ISO 8879), the international standard for text markup. It is a language for defining other document structures, rather than a specification in itself of any particular document. An `SGML file' in the terms of the SGML2TeX program means an SGML `document instance': that is, the user text of an SGML document, without any DTD (Document Type Description) or other markup declarations. SGML2TeX does not perform any parsing or validating of the SGML file, as public-domain parsers are freely available for this purpose. It is therefore your responsibility to ensure that only a valid conformant orthogonal SGML file without minimisation is processed by this program, with the syntax inherent in the SGML Reference Concrete Syntax. No responsibility can be taken for the results of processing any other form of SGML text, but suggestions for improvement are always welcomed. (In plain language this means the program will process a normalised SGML file with no DTD or SGML declaration attached, but nothing else.) A good introduction to the concepts of SGML is published by the Text Encoding Initiative (TEI).

TeX is the typesetting system devised by Donald Knuth, and its add-on variants such as LaTeX. The code output by the SGML2TeX program includes empty definitions for all elements, attributes and entities encountered, in a dummy style file. It is your responsibility to implement the style file in order to achieve the desired result. SGML2TeX provides a configuration file option for the predefinition of known tags. The TeX system is available for almost all platforms in a choice of public-domain/shareware/freeware or commercial implementations: contact the TeX Users Group, Box 21041, Santa Barbara, CA 93121-1041, USA for further details (phone: [+1] (805) 963 1338; fax: [+1] (805) 963 8358; email tug@tug.org).

This is beta-release software. The program appears to perform as indicated below, but you are asked to report any bugs encountered to the author. The program is copyright of the author but unrestricted redistribution is permitted provided no modifications are made.

Installation

The program is distributed as a .zip file (available by anonymous ftp from www.ucc.ie as pub/sgml2tex.zip) so it must be copied to the root directory of your disk and unwrapped with the command

pkunzip -d -o sgml2tex A copy of the pkunzip program is available in the same ftp directory if you have not already got a copy and none is available locally. Unwrapping the .zip file as described creates a directory called \pcl, containing the program and the PCL runtime module. A batch file sgml2tex.bat to run the program with is also created in the root directory of the disk where the unzipping took place. This batch file can be moved to wherever in the DOS path you keep your batch utilities.

The batch file performs all necessary path-setting for execution, and sets your path back afterwards, so no modification to config.sys or autoexec.bat is necessary, provided the batch file is kept in a directory referred to in your DOS path. If you are not familiar with the path command on your machine, consult a local DOS expert and show them this section.

The sgml2tex.htm file (this documentation) and its conversions are unwrapped into a directory called \sgml. Details of the HTML format (an application of SGML) are available online.

The TeX eplain macros are used in the default translation, and a copy of eplain.tex is included in the .zip file: this is unwrapped into the \emtex\texinput directory, so if you are using one of the other versions of TeX for the PC rather than emTeX, you should move this file to wherever you keep your TeX macro files.

A preprocessed copy of this document is included as file sgml2tex.ps for printing on PostScript printers, but you can generate your own version as a test by moving to the \sgml directory and typing

sgml2tex /d html sgml2tex
tex sgml2tex
and then printing it using whatever print driver you normally use for TeX.

Running the program

If the sgml2tex.bat file is used, the command to run the program is

sgml2tex [/option [filename] ... ] sgmlfile [texfile [stylefile]] where If the batch file is not used, the full command is pcl run sgml2tex with options and arguments as before. In this case it is your responsibility to ensure that the \pcl directory is accessible to the DOS path.

Processing

During processing, a percentage bar indicator shows how much of the file has been processed. Counters are displayed for lines, characters and words processed. Execution can be interrupted at any stage with Ctrl-Break, and the command quit can then be used to leave the program after doing so. After execution, control is returned to the DOS prompt.

SGML elements in the file are converted to a TeX-compatible form. The program represents the SGML of the source file by

All multiple spaces, tabs and linends are considered equivalent to a single space unless they are the content of an element defined in the configuration file as `special' (see details of the special keyword in the section on configuration files below).

Example

For an example, the SGML fragment

Goethe's use of storm imagery can be summarised in the last lines of Torquato Tasso: here we find the phrase berstend reißt / Der Boden unter meinen Füßen auf.

is converted to \startP{}Goethe's use of storm imagery can be summarised in the last lines of \startBOOKTITLE{}\REF{43}Torquato Tasso\finishBOOKTITLE{}: here we find the phrase \startQUOTE{}berstend rei\szlig{}t / Der Boden unter meinen F\uuml{};\szlig{};en auf.\finishQUOTE{}\finishP{}

The style file output with the example above would contain the following entries:

\def\startP{} \def\finishP{} \def\startBOOKTITLE{} \def\finishBOOKTITLE{} \def\startQUOTE{} \def\finishQUOTE{} \def\REF#1{} \def\szlig{} \def\uuml{} It is the user's responsibility to define these adequately, for example: \def\startP{} \def\finishP{\par} \def\startBOOKTITLE{\it} \def\finishBOOKTITLE{\footnote*{\the\fntext{}}\rm} \def\startQUOTE{``} \def\finishQUOTE{''} \newtoks\fntext \def\REF#1{\fntext={#1}} \def\szlig{\ss} \def\uuml{\"u} in a way which is meaningful for the visual appearance required.

The configuration file

The default configuration file name for a given SGML file is taken from the file type of the SGML file, but with its own file type of .cfg (for example, processing thesis.doc will assume the existence of a doc.cfg file). This can be overridden with the /d option. If the /d option is not used, the default file sgml2tex.cfg is used, if present.

The configuration file can establish predefined equivalences for element names, attribute names and character entities, so avoiding the need to hand-edit a style file, and allowing further files using the same configuration file to be processed with reference to an existing style file.

The following statements can be put in the configuration file (a worked example is at the end of this document). The delimiter between tokens is one or more spaces. For this reason, space characters are not currently permitted within the TeX-strings.

element name TeX-start-string TeX-end-string
The keyword `element' is required, followed by
  1. the name of the element in the SGML file (without STAGO (<) or TAGC (>) delimiters);
  2. some TeX code to use when the start-tag is encountered;
  3. some TeX code to use when the end-tag is encountered.
If either TeX-string is a hyphen (-), then replacement will not be used, and the default action will be performed (translation using \start or \finish respectively, as described above). If either TeX-string is a caret (^), then that start-tag or end-tag respectively will be omitted from the output entirely.
The two TeX-strings can include the following substitution parameters:
attribute name TeX-pre-string TeX-post-string
The keyword `attribute' is required, followed by
  1. the name of the attribute in the SGML file;
  2. some TeX code to use when the attribute is encountered (to precede the attribute value);
  3. some TeX code to use to follow the attribute value.
entity name TeX-string
The keyword `entity' is required, followed by the entity name (without ERO (&) or REFC (;) delimiters), and some TeX code to substitute when it is encountered.
special name keyword
The keyword `special' is required, followed by an SGML element name and one of the following settings:
style name
The keyword `style' is required, followed by a filename of a style file to use in place of the default. This name is included in the preamble to the output .tex file using an \input statement.
map char string [string]
The keyword `map' is required, followed by a single character to remap, and a string to remap it to. An optional second string provides an alternate mapping for use in attribute values, where normal string mapping may be undesirable.

Example

Here is an example, suitable for the Goethe text quoted earlier:

element p ^ \par{} element booktitle \it{} \footnote*{\the\fntext{}}\rm{} element quote `` '' attribute ref \fntext={ }} entity szlig \ss{} entity uuml \"u style german.sty map \ \\ The file german.sty in this example is the user's responsibility, and would be presumed to contain any further TeX code required.

A sample configuration file and style file, html.cfg and html.sty, are provided. The program can be tested with this present file, sgml2tex.htm, in order to convert, process and print the documentation.

Futures

There are two other methods of converting from SGML to TeX: by redefining TeX's escape character (the backslash) to be the STAGO character (<) and writing a set of macros with capability to detect optional arguments (for attributes); or by performing a complete parse and validation of the SGML file and mapping the contents of the DTD more strictly onto (for example), LaTeX control sequences. Both methods have their advantages and disadvantages, but the present solution seems to offer a halfway house.

Plans for v1.0, the first full release of SGML2TeX, include a rewrite into CWEB, so that the code can be made immediately available on all platforms. The possibility is being investigated of using a code-generator for GUIs, so that X-Windows, MS-Windows, Macintosh, DOS and VT100 versions can be produced, but it is not clear that a native-mode GUI version makes any kind of sense in what is basically a batch-oriented conversion process, apart from using OLE-style drag-and-drop to start the program on a specific file.

Probably more useful would be a feeder file capability, which would allow SGML2TeX to run against a file containing a list of files to convert: this could come from a UNIX pipe or be explicit by prefixing the input filename with an @-sign. This would enable large-scale unattended operation.

The restriction on TeX-strings in the configuration file not containing spaces could be removed by requiring the field separator to be a TAB character, and the logic of the positioning of the replacement strings in the output could be used to determine if the strings need to be followed by a space or not, thus obviating the need for terminating double-curly-braces and removing the intrusive space which gets in TeX's way after such control sequences.

The inclusion of images, a frequent occurrence in HTML files, could be handled by requiring the user to ensure that PostScript versions of the GIFs or JPEGs were present before running TeX on a converted file. I am not aware of any TeX output driver which will handle these graphical formats direct.

One significant problem remains in the processing of malformed, invalid or non-compliant files, especially HTML files, in that many of the tags in HTML can be used for purposes other than those for which they were intended: one common trick used by people who think Mosaic is the only browser in existence is to use the <h4> tag to implement small bold type.

If you have suggestions and comments, please mail them to me: pflynn@curia.ucc.ie

SGML2TeX runs on Intel chips.