ERCIM News No.26 - July 1996 - INRIA
Computational Linguistics is needed to check Typographic Conventions
by Jacques André and Hélène Richy
Little attention is given to the quality of electronic documents
(especially to those using HTML) in terms of typography (eg abiding by the
rules of the Chicago Manual of Style). The specification of 'typographic
sheets' can help in checking typographic correctness in structured documents.
However, such a checker requires tools from computational linguistics.
Today, thanks to electronic documents and world wide networks, authors and
readers communicate directly. Alas, this is quite often done without the
help or the savoir faire underlying the traditional activities of typographers,
editors, correctors, and printers. For most people, typography concerns
visual aspects such as font and character design and the layout of pages.
Even if it is related to legibility, another aspect of typography is more
concerned with the text itself rather than with its appearance: there are
typographic conventions such as the rules given in the 'Chicago Manual of
Styles' or in the French 'Code typographique'. These rules refer not only
to spacing before or after punctuation, but to capitalization, use of italics,
use of acronyms and abbreviations, composition of numbers, etc. While spellers
and even syntactic checkers are increasingly offered with incorporated formatters,
very little is done in terms of typographic conventions (apart from naive
tests such as balancing of parentheses). We are now working on developing
such a typographic checker.
The purpose of a typographic corrector is to propose some corrections to
the author when errors are found. The problem is that linguistic parsers
analyse sentences with the assumption that the punctuation is correct, while
a typographic checker is supposed to detect punctuation errors (among others).
Our first approach is to use the logical structure of a document. Indeed,
many typographic rules are context dependent. For example, periods are omitted
at the end of centred headings, signatures or legends; capitals are allowed
in titles; in a bibliographic item, book titles are to be composed in italic,
etc. A typically more complex rule is the one describing the punctuation
to be used at the end of a list item: it depends on the rank of the specific
item in the list, the context of the list, ie whether or not it is within
a sentence. Structured documents allow the separation of different levels
of interest, for example separately defining the description of a document
type (SGML's DTD eg) and its physical description (DSSSL). A typographic
checker has been added to the Thot editor and works with typographic sheets,
based on the DTD. The word sheet implicitly refers to (cascading) style
sheets as they have the same spirit.
However, this first approach presents limitations with respect to linguistic
structures. Let us take two examples. The Chicago Manual of Style says "Omit
the period after ... running heads ..." . However, linguistic tools
(such as abbreviation dictionaries or morphematic analysis) are needed to
decide whether a dot is a period or an abbreviation mark (eg after etc.).
The same Manual of Style also says "The exclamation point should be
placed inside the quotation marks... when it is part of the quoted ... matter;
otherwise it should be placed outside." . This implies that a typographic
checker must, for example, be able to correctly semantically analyze the
two following sentences :
- The woman cried, "Those men are beating that child!"
- Her husband replied - calmly - "It is no concern of mine"!
Our research now consists in defining the lowest level of linguistic tools
needed for such a typographic checker.
Please contact:
Jacques André - Inria/Irisa
Tel: +33 99 84 73 50
E-mail: Jacques.Andre@irisa.fr
or Hélène Richy - CNRS/Irisa
Tel: +33 99 84 73 71
E-mail: Helene.Richy@irisa.fr
return to the contents page