$Id: INTERNALS,v 1.1 1997/03/26 01:29:39 dps Exp $

Here is how the program works.


reader.cc (1.10)

read_character reads characters from a word document suitably
translated, including dsitingishing between multiple and single ^Gs,
etc.

The output is fetched by chunk_reader::read_chunk_raw that assembles
it into bits ignoring inclusions. chunk_Reader::read_chunk gets these
chunks are parcels them out with inclusion seperated out.

tok_seq::rd_token adds start and end tags for rows, fields, paragraphs
and all the rest storing the tokens in a table on a seperate queue
before transfering them all onto the main queue. tok_seq::rd_token also
keeps track of the size and detects the probable end of the table.

tok_seq::feed_token takes a token off the queue and requests a refill at the
appropiate time. At the end of the document it tests a flag and if the flag
is not set then adds a document end entry (and then feeds it to the caller).

OK, so far? Now the fun begins!

If you look at the outptut now you see horrofic stuff like
<PARAGRAPH>550 *<SPEC>eq \F(foom bar)</SPEC><PARGRAPH>= 42</PARAGRAPH>
so the input is further processed by tok_seq::math_collect().
math_collect() uses saved_tok as a one byte push back mechamism and
will use this token before asking feed_token() for one. Non-paragraphs and non-equations go straight thorugh.

When math_collect sees a paragaph is pears at the next item. If this
is not an equation it just forwards the token and stashes the item it
got in saved_token (saved_token is definately free: either it was used
or feed_token supplied something). If it sees an euqation it calls
math_reverse_scan to work out whether there is any equation in the
string (guesswork but works quite nicely). If math_reverse_scan
decides it is all real text the token is just forwarded (with the
extra token still stashed in saved_tok).

Assuming math_reverse_scan found something to move that material is
moved into the equation and ntok and the current token
modified. saved_token still pointds to ntok so we use the same
structure but new strings. The reduced paragraoh token is returned.

-----

When the code sees an equation special (quite possibly saved_tok from
the paragraph process above) it ask feed_token() for the next two
tokens. The next token is the end token for the special and the one
after that interesting, and will be called T (the token itself is
*ntok in the code).

If T is an equation the end spec token is junked and the two equations
joined.  One of the equations is then junked. The end special is
pushed onto the start of the outpiut for feed_token to find there;
saved_tok is pointed to the expanded equations. The code then returns
to the original read a token state so further aggregation can take
place.

If T is a paragraph then the code uses math_forward_scan to see how
much of that is consumed as part of the equation. If none then the end
special and paragraph tokens are pushed onto the front of the output
queue and saved_tok invalided. The code is then returns the current
(equation special) token. The end special passes straight through and
then the accumulaion can begin again.

If T (a paragraph) is partial consumed the current equation and it is
adjusted and the same processing as if the paragraph had no formula contents.

If T (a paragraph) entirely consumed its contents are added onto to
the text, the paragraph junked, the end spec pushed pack. saved_tok is
pointed to the current, expanded equations.  The code then returns
to the original read a token state so further aggregation can take
place.

The output now contians nice stuff like

<PARAGRAPH><SPEC> 550 * \F(foo,bar) = 29</SPEC></PARAGRAPH> and even
horrors that word veiwer renders as  displayed equations like
<PARAGRAPH><SPEC> 550 * \F(foo,bar) = 29</SPEC><PARAGRAPH>.</PARAGRAPH>


This output is requested by tok_seq::read_token() which is the public
method. It is not devoid of tricks however. Anything other than the
start of a paragragh passes straight through.


When it sees a paragraph it pushes it onto a seperate queue and
acculumates totals of characters and specials in it sees. The loop
exits when any of the following applies:

	The paragaraph character total exceeds then (small, currently 3)
	treshold.

	The end of the paragraph is spotted.

	A non-special, non-pargraph, non-other character is seen (if this
	happen we add the treshold ot the count to be sure o ebing >= to it.


On exit from the loop if the total is less than the critical value the
queue is reveresed and inserted at the front of the output queue minus
the paragraph items. Since the tokens are inserted as the first
character of the ouput they appear in reverse order of insertion (hence
the reverse makes the elements appear it the original order on the output
queue). This deletes that extraneous and wrong full stop, for example.

Otherwise the queue is the elements are transfered to the front of the
output queue in the existing order (this actually just sets a couple
of pointers).

Either way the temporary queue is now empty and is deleted. The first
item dequeued is returned. (This is what rtest2 shows you).


Futurue development will include processing to stop lists and stuff like
that.... as you now know everything is very simple and plain.


OH, yes and the *TeX output format includes plently of context queue
use too... There is also a bit in the ascii output. Overall this tends
towards my idea of a complex AI program using context queues to do the
right stuff about what word throws at it!!

I hope this is now 100% clear.