% -*- lang: icon -*- \documentclass[11pt]{article} \usepackage[letterpaper]{geometry} \usepackage {noweb} \usepackage{alltt} \usepackage[hypertex,colorlinks=true,linkcolor=blue,extension=dvi]{hyperref} \pagestyle {noweb} \newcommand{\chunkref}[1] {$\langle$\subpageref{#1}$\rangle$} \title {Mpp, a Multilingual Pretty-Printer for Noweb} \author {Kostas N. Oikonomou \\ \textsf{oikonomou@att.com}} \begin {document} \maketitle \tableofcontents \section {Features and usage} [[mpp]] is a pretty-printer, written in Icon, for the [[noweb]] system. Its main features are \begin{itemize} \item Any chunk can be written in any language, and the pretty-printer will switch among languages. The language in which a chunk is written is indicated by writing its name in parentheses at the end of the chunk name: e.g. [[@<>]] or [[@<>]] \footnote{A file [[legit_lang_names]] lists the strings that are considered to name languages, so when [[mpp]] sees, for example, [[@<>]] it will not attempt to switch to language ``again''. Actually there is a more general mechanism: the options [[-d1]], [[-d2]] allow the user to specify two strings that delimit the language name; these default to parentheses.}. \item If no language is specified, a chunk inherits the language of its parent. \item A default language is specified when invoking [[mpp]]. So for a single-language [[noweb]] file, no chunk needs to specify a language. \item Languages are described by external ASCII files. Users can add their own, or modify some of the supplied language files without touching [[mpp]]'s code. \item [[mpp]] does not touch the user's indentation or line-breaking. \item The spec file for language $L$ specifies $L$'s reserved words, which are typeset in bold, strings such as [[>=]] which are typeset in math mode and appear as $\ge$, and arbitrary translations of strings into \TeX\ code. \item The strings that denote comments or quotes are customizable, read from the spec file for $L$. Comments are typeset in roman font, and \TeX 's math mode is active in comments. \end {itemize} [[mpp]] is a [[noweb]] filter, invoked as \begin{quote} [[mpp -lib]] $\langle$\textit{path}$\rangle$ [[-L]] $\langle$\textit{language}$\rangle$ [ [[-d1]] $\langle$\textit{s1}$\rangle$ [[-d2]] $\langle$\textit{s2}$\rangle$] \end{quote} where the (full) library path specifies where the language spec files are to be found, and the language is the initial or default one. \section {The basic design} The pretty-printer's design is based on the following two premises: \begin {itemize} \item It should be as independent of the target language as possible, and \item We don't want to write a full-blown scanner for the target language. \end {itemize} Strings of characters of the target language which we want to typeset specially are called ``interesting tokens''. There are three categories of interesting tokens: \begin {enumerate} \item Reserved words of the target language: we want to typeset them in bold, say. \item Other strings that we want to typeset specially, usually in math mode: e.g. $\le$ for [[<=]]. \item Comment and quoting tokens (characters): we want what follows them or what is enclosed by them to be typeset literally. \end {enumerate} By reading the language spec file, a table [[trans]] is constructed that defines a translation into \TeX\ code of every interesting token in the target language. Here is an excerpt from the language spec file for Icon; lines beginning with [[#]] are comments. \begin{alltt} # Reserved words +by +break +case # Keywords +&ascii +&clock # Translator directives +$include +$line # Mathematical translations $<= \verb|\|le $>= \verb|\|ge $>> \verb|\|succ # Arbitrary translations .\verb|{| \verb|\{| .\verb|\| \verb|\\| .~== \end{alltt} Entries beginning with a ``$+$'' are typeset in bold, those beginning with a ``\$'' define math mode translations, and entries beginning with a ``.'' followed by a pair of strings substitute the second string for the first. We use four sets of strings to define the tokens in categories 2 and 3: \begin {center} [[special]], [[comment1]], [[comment2]], [[quote2]]. \end {center} [[comment1]] is for unbalanced comment strings (e.g.\ [[%]] in Turing, [[#]] in Icon, [[!]] in Fortran), [[comment2]] is for balanced comment strings (e.g.\ [[/*]] and [[*/]] in C, or [[(*]] and [[*)]] in Mathematica), and [[quote2]] is for literal quotes, such as [["]], which we assume to be balanced. Our approach to recognizing the interesting tokens while scanning a line is to have a set of characters [[interesting]] (an Icon cset), containing all the characters by which an interesting token may begin. [[interesting]] is the union of \begin {itemize} \item the cset defining the characters which may begin a reserved word \item the cset containing the initial characters of all strings in the special, comment, and quote sets. \end {itemize} The basic idea is this: given a line of text, we scan up to a character in [[interesting]], and, depending on what this character is, we may try to complete the token by further scanning. If this succeeds, we look up the token in the [[trans]] table, and if the token is found, we output its translation, otherwise we output the token itself unchanged. When comment or quote tokens are recognized, further processing of the line may stop altogether, or temporarily, until a matching (closing) token is found. <<*>>= link options, strings, fullimag <> global language, trans global res_word_chars, special, comment1, comment2, quote2, interesting global begin_res_word, begin_special, begin_comment1, begin_comment2, begin_quote2 global in_comment1, in_comment2, in_quote global libpath, line_num <> <> <> <<[[main]] procedure>> <> @ @ <>= record lang_spec(res_word_chars, special, comment1, comment2, quote2, begin_res_word, begin_special, begin_comment1, begin_comment2, begin_quote2, interesting, trans) global known_langs # table of \texttt{lang\_spec} indexed by language. global legit_lang_names # set of strings that can be language names @ @ [[main]] interacts with [[TeXify]] when it comes to comments and quotes. \enlargethispage{2cm} <<[[main]] procedure>>= procedure main(args) local line, chunk_name, kind, keyword, rest, L, p0 local opts, d1, d2, f legit_lang_names := set() # strings that can be language names known_langs := table() # languages whose specs have been loaded language := table() # indexed by chunk name, gives the language used by the chunk <> <<{\TeX} definitions at the top of the output file>> line_num := 0 # line no. in the input file while line := read() do { line_num +:= 1 line ? { keyword := tab(upto(' ')|0) & rest := if tab(match(" ")) then {p0 := &pos; tab(0)} else &null } case keyword of { "@begin" : { rest ? kind := tab(many(&letters)) write(line) } "@defn" : { if kind == "code" then { <> } write(line) } "@use" : { <> write(line) } "@text" : if \kind == "code" then TeXify(rest,L,p0) else write(line) "@nl" : { if \in_comment1 then { # unbalanced comment write("@literal \\endcom{}"); in_comment1 := &null } write(line) } "@index" | "@xref" : { <> } default : write(line) } } end @ @ The pretty-printer must be called with two arguments: <>= opts := options(args, "-lib:-L:-d1:-d2:") libpath := (\opts["lib"] || "/") | stop("mpp: you must specify the full path for the language spec files!") L := \opts["L"] | stop("mpp: you must specify a language as an argument!") f := open(libpath || "legit_lang_names") | stop("mpp: can't open `", libpath || "legit_lang_names", "'!") while line := read(f) do every insert(legit_lang_names, words(line)) if member(legit_lang_names, L) then if /known_langs[L] then load_language_spec(L) else switch_to_language(L) else stop("mpp: language `", L, "' is not in `legit_lang_names'!") d1 := \opts["d1"] | "(" d2 := \opts["d2"] | ")" @ @ To switch languages, chunk name must end with the name of a legitimate language in parentheses. See \chunkref{c:getl}. <>= chunk_name := rest if L := get_language(chunk_name,d1,d2) then { if /known_langs[L] then load_language_spec(L) else switch_to_language(L) write("@language ", map(L, &ucase, &lcase)) } @ @ <>= assert(\L) chunk_name := rest if \language[chunk_name] ~== L then stop("mpp: <", chunk_name, ">'s language already set to `", language[chunk_name], "'!") else language[chunk_name] := L @ @ \section {Language files, translation tables, and interesting tokens} \label{sec:lang} \subsection {Language specification files} Language specification files are named [[Icon_pp_spec]], [[C_pp_spec]], etc. They can contain comments, which are lines beginning with [[#]], and empty lines. The file format is described in the following table\footnote{To understand this better, look at one of the examples.}. $s_i$ is a string, $c_i$ is a character, $C_i$ is one of the standard Icon csets (character sets), such as [[&letters]] or [[&digits]]. \begin{center} \framebox{% \begin{tabular}{ll} [[res_word_chars]]: & $C_1$ $C_2$ \ldots\ $c_1$ $c_2$ \ldots \\ [[comment1]]: & $s_1$ $s_2$ \ldots \\ [[comment2]]: & $s_1$ $s'_1$ \quad $s_2$ $s'_2$ \ldots \\ [[quote2]]: & $c_1$ $c'_1$ \quad $c_2$ $c'_2$ \ldots \\ $\langle$translation table$\rangle$: & see \S\ref{sec:tt} \end{tabular}} \end{center} <>= procedure load_language_spec(L) local name, f, line, wlist, w, w1, e, I name := libpath || L || "_pp_spec" f := open(name, "r") | stop("mpp: can't open file `", name, "'!") <> w1 == "res_word_chars:" | stop("mpp: `res_word_chars:' expected!") res_word_chars := '' <> <> w1 == "comment1:" | stop("mpp: `comment1:' expected!") comment1 := wlist # a list <> w1 == "comment2:" | stop("mpp: `comment2:' expected!") comment2 := [] # a list of pairs while put(comment2, [get(wlist),get(wlist)]) <> w1 == "quote2:" | stop("mpp: `quote2:' expected!") quote2 := [] # a list of pairs while put(quote2, [get(wlist),get(wlist)]) <> close(f) <> known_langs[L] := lang_spec(res_word_chars, special, comment1, comment2, quote2, begin_res_word, begin_special, begin_comment1, begin_comment2, begin_quote2, interesting, trans) end @ @ Get the next non-comment, non-empty line and its words. <>= while line := read(f) do if line ~== "" & line[1] ~== "#" then break wlist := get_words(line); w1 := get(wlist) @ @ Can't use [[variable]] and [[name]] for cset keywords because they are not variables. So <>= every w := !wlist do { if *w = 1 then # character res_word_chars ++:= w else { # cset res_word_chars ++:= case w of { "&letters" : &letters "&digits" : &digits "&lcase" : &lcase "&ucase" : &ucase "&ascii" : &ascii "&cset" : &cset default: stop("mpp: unknown cset in `res_word_chars' line!") } } } @ @ Rather nifty code, using Icon's [[variable]] and [[name]] constructs to avoid a lot of assignments. For every field of [[Lspec]] named $n$, assign its value to the global variable named $n$. <>= procedure switch_to_language(L) local Lspec, n Lspec := \known_langs[L] | stop("mpp : `", L, "'should be known here!") every n := name(!Lspec) do { n ?:= {tab(find(".")+1) & tab(0)} variable(n) := Lspec[n] } end @ @ \subsection {Translation tables} \label{sec:tt} The translation table is now read from the language description file. There are four kinds of translations: make bold (entry starts with a ``$+$''), make slanted (entry starts with a ``\texttt{\~}''), turn into a mathematical symbol (entry starts with a ``\$''), or arbitrary substitition (line starts with a ``.'', followed by a pair of strings, of which the 2nd may be empty). As translations are read, we also create the list of [[special]] tokens, and the cset [[begin_res_word]]. <>= trans := table() # global special := [] begin_special := begin_res_word := '' while line := read(f) do { if line[1] ~== "#" & line ~== "" then { wlist := get_words(line) w1 := wlist[1] case w1[1] of { "+" : w1 ? { # make bold move(1); w := tab(0) trans[w] := "{\\ttb{}" || w || "}" begin_res_word ++:= w[1] } "~" : w1 ? { # make slanted move(1); w := tab(0) trans[w] := "{\\tts{}" || w || "}" begin_res_word ++:= w[1] } "$" : w1 ? { # math special token move(1); w := tab(0); # We don't use ``\$'' in the translation because ``\$'' might be part of the language $L$. trans[w] := "\\(" || wlist[2] || "\\)" put(special, w) begin_special ++:= w[1] } "." : w1 ? { # arbitrary special translation, possibly empty move(1); w := tab(0); trans[w] := \wlist[2] | &null put(special, w) begin_special ++:= w[1] } default : { } } } } special := sort_by_length(special) @ @ \section {Language-independent pretty-printing} \label{sec:ind} Find out what is the language of chunk [[chunk_name]]. It must be a legitimate language name, otherwise the string in parentheses (more generally between the delimiters [[d1]] and [[d2]]) is ignored. \nextchunklabel{c:getl} <>= procedure get_language(chunk_name,d1,d2) local L,i,n n := *chunk_name chunk_name ? { every i := find(" " || d1) # get the last occurrence if \i > 0 then { move(i+(*d1)); L := tab(find(d2)) if &pos = n then if member(legit_lang_names, L) then return L } } end @ @ First we set up the typewriter bold font [[\ttb]], corresponding to pcrb8r, and the typewriter slanted font [[\tts]]. Then we define the macros [[\begcom]] (begin comment) and [[\endcom]]. [[\begcom]] \begin {itemize} \item switches to [[\rmfamily]], \item activates [[$]] by changing its catcode to 3, \item makes the characters ``\texttt{\^{}}'' and ``[[_]]'' active for superscripts and subscripts, \item changes the catcode of the space character to 10. This way comments will be typeset normally, and not as if [[\obeyspaces]] were active. \end {itemize} <<{\TeX} definitions at the top of the output file>>= write("@literal \\DeclareFontShape{OT1}{cmtt}{bx}{n}{ <-> pcrb8r }{}") write("@nl") write("@literal \\def\\ttb{\\bfseries}") write("@nl") write("@literal \\def\\tts{\\slshape}") write("@nl") write("@literal \\def\\begcom{\\begingroup\\rmfamily \\catcode`\\$=3_ \\catcode`\\^=7 \\catcode`\\_=8 \\catcode`\\ =10}") write("@nl") write("@literal \\def\\endcom{\\endgroup}") write("@nl") @ @ Don't output spurious [[@index use]] or [[@xref]] lines when in a comment or quote. ([[@index]] is produced by [[finduses]] and [[@xref]] by [[noidx]].) However, we do want to output [[@index defn]] lines. All of this works only if the language filter is run \emph{before} [[noidx]]. \nextchunklabel{c:cqfix} <>= if (/in_comment1 & /in_comment2 & /in_quote) | match("defn", rest) then write(line) @ @ For each interesting category define a cset containing the characters by which a token in that category may begin and set [[interesting]] to their union. \nextchunklabel{c:disjoint} <>= begin_comment1 := begin_comment2 := begin_quote2 := '' every e := !comment1 do begin_comment1 ++:= cset(e[1]) every e := !comment2 do begin_comment2 ++:= cset(e[1][1]) every e := !quote2 do begin_quote2 ++:= cset(e[1]) @ The token recognition method used in procedure [[TeXify]] assumes that the various subsets of [[interesting]] are mutually disjoint. If this assumption does not hold, the results are unpredictable. <>= I := begin_res_word ** begin_comment1 ** begin_comment2 ** begin_quote2 ** begin_special *I = 0 | stop("mpp: the characters in the set\n", image(I), "\n may begin tokens in more than one interesting category!") interesting := begin_res_word ++ begin_comment1 ++ begin_comment2 ++ begin_quote2 ++ begin_special @ @ \subsection {Formatting a line} This procedure formats [[@text]] lines in the [[noweb]] file. Note that every \TeX{}ified line is a ``literal'' in [[noweb]]'s sense. <>= procedure TeXify(line, L, p0) local token, emb, c, i, q, qs, c_open, q_open, closing static c_close, q_close, TeXspecial initial {TeXspecial := '\\${}&#^_%~'} # The cset of characters treated specially by \TeX. writes("@literal ") while line ~== "" do line ? { if \in_comment1 then { <> } else if \in_comment2 then { <> } else if \in_quote then { <> } else { <> } line := tab(0) # There may be more on the line! } write() end @ @ \nextchunklabel{c:notin} <>= while writes(tab(upto(interesting))) do case &pos+1 of { # The \texttt{&pos+1} is because \texttt{any($C,s,i$)} will produce $i+1$ if $s[i]\in C$. any(begin_res_word) : { <> } any(begin_special) : { <> } any(begin_comment1) : { <> } any(begin_comment2) : { <> } any(begin_quote2) : { <> } default : <> } # Now write out the (uninteresting) rest of the line: writes(tab(0)) @ @ Well, if we got here there's something wrong in the scanning algorithm. [[p0]] is the position in the line of the source file where the argument [[line]] of [[TeXify]] begins. Note: the reported column is in the Emacs sense, i.e. the first character is in column 0. <>= stop("\nmpp: error in procedure TeXify:\n language = ", L, ", input line ", line_num, ", column ", p0+&pos-2) @ @ \subsection {Handling the interesting tokens} \label{sec:it} Check for the situation where we have an ``embedded'' reserved word. E.g. suppose [[when]] is a reserved word and any letter can occur in reserved words. We don't want [[when]] matched in [[so_when]]. <>= emb := any('_', &subject, &pos-1) | &null token := tab(many(res_word_chars)) writes((/emb & \trans[token]) | token) @ @ There are two issues here. Suppose we want [[=]] and [[==]] to be typeset specially, but not [[=-]]. So we put [[=]] and [[==]] in [[special]]. Now what happens when we encounter [[=]]? First, we have to find out if this is really the string [[==]]. So (a) we must match the {\em longest\/} token in [[special]], in case a special token is a prefix of another special token. Second, we must check that we do not have the string [[=-]], because we do not want it to appear in the output as the translation of [[=]] followed by ``[[-]]''. (a) is easily ensured: [[match(!special)]] will match the longest token if the list [[special]] is arranged so that longest tokens come first, as specified in \S\ref{sec:lang}. (b) is a bigger pain. We solve it as follows: in the example given above, we \emph{do} put [[=-]] in [[special]], but \emph{don't} define a translation for it. So <>= if (token := tab(match(!special)) | pos(0)) then writes(\trans[token] | token) else writes(move(1)) @ @ \subsection {Comments and quotes} \label{sec:candq} In principle, comments and quotes could be handled by Icon procedures such as [[bal()]], or the more sophisticated ones in [[procs/scan.icn]]. What precludes this easy solution is the fact that other filters in the [[noweb]] pipeline may \emph{break up} comments and quotes that begin and end on the same line into multiple lines. For example, the [[finduses]] and [[noidx]] filters are language-independent, and so can insert spurious [[@index]] and [[@xref]] lines \emph{in the middle} of commented or quoted text of the target language. This complicates greatly the handling of balanced comments, and especially of unbalanced comments and quotes. In fact, proper handling of unbalanced comments forces procedures [[filter]] and [[TeXify]] to \emph{interact}, as [[TeXify]] cannot detect the end of an unbalanced comment that has been broken up into multiple lines. So [[filter]] and [[TeXify]] interact via the variables [[in_comment]] and [[in_quote]] when handling comments and quotes, and it is [[filter]] that detects the end of an unbalanced comment when it encounters a [[@nl]] line. @ @ If we match a token in [[comment1]], we output it and the rest of the line as is, but in [[\rm]] font. Within a comment, characters special to \TeX\ are active, so \verb+$x^2$+ will produce $x^2$. A problem with this is that if you comment out the (C) line \verb+printf("Hi there!\n")+, \TeX\ will complain that [[\n]] is an undefined control sequence. <>= if writes(tab(match(!comment1))) then { in_comment1 := "yes" writes("\\begcom{}" || tab(0)) break # We let \texttt{filter} detect the end of the comment. } else writes(move(1)) # The character wasn't the beginning of a comment token. @ @ If we are at this point, it is not necessarily true that we have found a comment. For example, in \textsl{Mathematica} comments begin with a [[(]], which may also appear in [[x+(y+z)]]. The additional complexity comes from the fact the we have to handle comments extending over many lines. <>= every c := !comment2 do { c_open := &null writes(c_open := tab(match(c[1]))) & c_close := c[2] & break } if \c_open then { in_comment2 := "yes"; writes("\\begcom{}") <> } else writes(move(1)) # The character wasn't the beginning of a comment after all. @ @ Quoted strings may extend over multiple lines for the reasons mentioned at the beginning of \S\ref{sec:candq}. Except for the formatting, we handle them like balanced comments. The possibility of escaped quotation marks inside the quoted string makes things more difficult. \nextchunklabel{c:quotes} <>= every q := !quote2 do { writes(q_open := tab(match(q[1]))) & q_close := q[2] & break } if \q_open then { in_quote := "yes" <> } else writes(move(1)) # The character wasn't the beginning of a quoting token. @ @ <>= writes(tab(0)) @ @ <>= if writes(tab(find(c_close))) then { # Comment ends here writes("\\endcom{}" || move(*c_close)) in_comment2 := &null } else # Comment doesn't close on this line writes(tab(0)) @ @ After having encountered a quote we write literally, except that we precede every character special to \TeX\ by a backslash and follow it by an empty group\footnote{The empty group is necessary for the characters ``\~{}'' and ``\^{}''.}. Detecting the end of a quoted string is tricky: a [[q_close]] character doesn't do it if it is escaped by a backslash, unless the backslash is itself escaped by another backslash! Below, [[qs]] is the string between quotes, or a piece of it (recall the beginning of \S\ref{sec:candq}). <>= if qs := tab(find(q_close)) then closing := "yes" else qs := tab(0) qs ? { while writes(tab(upto(TeXspecial))) do writes("\\" || move(1) || "{}") writes(tab(0)) } # This took a while to get right. Is there a simpler way to express it? if \closing then { if \qs[-1] then { if qs[-1] == "\\" then { if \qs[-2] then if qs[-2] == "\\" then <> } else <> } else <> } @ <>= {in_quote := closing := &null; writes(move(1))} # \texttt{q\_close} @ @ <>= procedure get_words(s) # Also see \texttt{words} in the \texttt{strings} library. static it local i, L initial{it := &cset -- ' \t,'} # words are separated by blanks, tabs, or commas L := [] s ? while tab(upto(it)) do {i := tab(many(it)); put(L,i)} return L end @ @ Sort a list of strings (or other things with size) by the length of its elements, longest first. <>= procedure sort_by_length(L) local L1, L2, s, T T := table() every s := !L do T[s] := -(*s) L2 := sort(T,4) L1 := [] while put(L1, get(L2)) do get(L2) return L1 end @ @ <>= procedure print_language_spec(L,more) local n, tt, s s := \known_langs[L] | stop("mpp: `", L, "' is unknown!") write("res_word_chars: ", fullimage(s.res_word_chars)) write("comment1: ", fullimage(s.comment1)) write("comment2: ", fullimage(s.comment2)) write("quote2: ", fullimage(s.quote2)) write("special: ", fullimage(s.special)) if \more then { write("begin_quote2: ", fullimage(s.begin_quote2)) } if \more > 1 then { tt := sort(s.trans,1) every write(fullimage(!tt)) } end @ @ \section {Unresolved issues} \label{sec:todo} \begin{enumerate} \item Find a good way to handle indexing and cross-referencing when there are many languages. \item There is a niggling unresolved issue, exemplified by Icon. [[mpp]] translates the symbol ``\&''as ``$\land$'', even though ``\&'' is {\em not\/} in Icon's [[special]]. This happens because ``\&'' is in Icon's [[res_word_chars]], and a translation for it is defined in [[known_langs[icon].trans]]. So when [[TeXify]] encounters it, it recognizes it as an Icon reserved word, and uses the translation defined for it. Now if this translation is not wanted, remove ``\&'' from [[known_langs[icon].trans]] and don't bother me any more. However, if this translation is ok, we have an inconsistency, in that ``\&'' is not in [[special]]. While this is not a real problem, achieving consistency (which may be needed in a more general case) is not so easy. If we add ``\&'' to [[special]], the check in \chunkref{c:disjoint} will fail. To fix this, we could \begin {enumerate} \item Add a constraint to the recognition of a reserved word: it has to be a token of length $>1$. \item Revise the [[case]] structure in \chunkref{c:notin}, as it will no longer work. \end {enumerate} We could also consider having a separate translation table for special tokens. \end{enumerate} @ @ \appendix \section {Index} \nowebindex \end {document}