# JOE: Hacker's Guide
[TOC]
## Character classes
The data structures for representing character classes such as \[a-z0-9\] or
\[\\w\\d\] are in [cclass.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/cclass.c) and [cclass.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/cclass.h). These are used by the syntax
highligher, the regular expression matcher, the keyboard mapping table and
for implementing JOE's ctype routines (such as joe_iswupper).
Structures are provided for sets (is a character in a particular set?), mappings
to integers (which integer is associated with this character?) and mappings to
pointers (which pointer is associated with this character?).
Two different sets of algorithms are provided: one set based on intervals
and another based on radix trees.
### Interval arrays
An interval is a simple range of character codes in the following structure:
struct interval {
int first, last;
};
If A-Z is be represented, then first is 65 and last is 90. If just the
number zero is to be represented, then first and last are both 48.
An array of these represents a set. If the array is sorted, you can use a
binary search to determine if a character belongs to a set. Two functions
supporting this are provided:
/* Sort an struct interval array */
void interval_sort(struct interval *array, ptrdiff_t size);
/* Test if character ch is in a sorted interval array using binary search.
Returns -1 if not found, or index to matching struct interval */
ptrdiff_t interval_test(struct interval *array, ptrdiff_t size, int ch);
These functions are no longer used in the editor since they have been
superseded by radix trees. However, facts about Unicode use arrays of
intervals, see [unicat.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicat.c). Unicat.c is generated by [util/uniproc.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/util/uniproc.c) from the
raw data provided on the [unicode.org](ftp://ftp.unicode.org/Public/UNIDATA/) website. Arrays of intervals provide
the source data for the radix trees. The struct Cclass includes an
interval array as the source data before it is optimized into a struct Rset.
### Interval lists
The next data structure in cclass.c is a linked list of intervals:
struct interval_list {
struct interval_list *next; /* Next item in list */
struct interval interval; /* Range of characters */
void *map;
};
This provides a way to create a mapping between characters and pointers
(void *map). For example, a key sequence mapping table (a KMAP) can be
implemented this way. Map could point to either a MACRO bound to a key, or
to a KMAP for a sub-map (for example a top-level map would have ^K bound to
a submap which has H bound to the help command).
The following functions are provided to handle interval lists:
/* Construct a struct interval_list */
struct interval_list *mkinterval(
struct interval_list *next,
int first, int last,
void *map);
/* Free an interval list (frees all items in the list) */
void rminterval(struct interval_list *item);
/* Add a single interval to an interval list */
struct interval_list *interval_add(
struct interval_list *interval_list,
int first, int last,
void *map);
/* Add an array of intervals (a character class) to an interval list */
struct interval_list *interval_set(
struct interval_list *list,
struct interval *array, int size,
void *map);
/* Look up single character in an interval list, return what it's mapped to */
void *interval_lookup(
struct interval_list *list,
void *dflt,
int ch);
/* Print interval list to the log buffer for debugging */
void interval_show(struct interval_list *list);
Interval_add is a complex funtion which adds a mapping to an interval list.
You pass it the address of the first item in the list (or NULL) and it
returns an edited list. Interval_add will override any mappings which
already exist in the given range. The resulting list is always sorted and
the items in it are non-overlapping. It might add, delete or modify items
to the list.
Interval_set is just like interval_add, but takes an array of intervals (as
previously discussed) instead of a single interval. For example, to create
a mapping from all Unicode numbers to a particular MACRO, you could pass the
Nd_table from unicat.c to it.
Interval_lookup returns the pointer mapped to the character, or if none is
mapped, it returns dflt.
### Radix trees
These sets of radix tree structures and associated methods are provided:
* Rset: for sets
* Rtree: for maps from characters to pointers
* Rmap: for maps from characters to integers
The best way to understand a radix tree is to study one of the lookup
functions (slightly simplified):
void *rtree_lookup_unopt(struct Rtree *r, int ch)
{
int a, b, c, d;
int idx;
a = (ch >> TOPSHIFT);
b = (SECONDMASK & (ch >> SECONDSHIFT));
c = (THIRDMASK & (ch >> THIRDSHIFT));
d = (LEAFMASK & (ch >> LEAFSHIFT));
idx = r->top.entry[a];
if (idx != -1) {
idx = r->second.table.b[idx].entry[b];
if (idx != -1) {
idx = r->third.table.c[idx].entry[c];
if (idx != -1)
return r->leaf.table.d[idx].entry[d];
}
}
return NULL;
}
The idea is to break the 21 bit character code into four parts: a (7 bits),
b (5 bits), c (5 bits) and d (4 bits) and use these to traverse a four-level
tree. The top-level table is embedded in the struct Rtree itself. Tables
from the other levels (the two mid levels and the leaf level) are allocated
out of struct Levels. The top and mid-level tables are arrays of 16-bit
shorts.
The number of levels and the size of each level was chosen as a compromise
between speed and size. We certainly want radix tree lookups to be faster
than interval array binary searches (otherwise why bother?), so we can't
have too many levels. On the other hand, radix trees should not use too
much memory. So for example a single level of 2^21 entries is out.
#### De-duplication
The leaf tables often contain the same data, so we de-duplicate them such
that many of the second to last level table entries index to the same leaf
table. This is such an effective optimization that the resulting radix tree
ends up being around the same size as an equivalent interval array (an
interval array with an array of the same number of entries containing
associated data (ints or pointers)).
The second to last level also has duplicate tables, but not many so it's not
de-duplicated.
De-deplication happens progressively during the build, and as a second pass
optimization step. Why do it progressively if there is a whole second pass?
There is an important reason for this: shorts are used for the second to
last level table indexes. If you don't deduplicate progressively, you can
have more than 64K leaf tables before optimization and overflow the short.
There is a second reason: lower memory usage tends to equate to faster
performance. For this reason it's preferable to keep the memory usage low
during the build.
The progressive deduplication algorithm is simple: if the most recently
completed leaf table matches the previously completed leaf table, we use the
previous one in its place.
There is one big complication: there is no strong concept of a completed
leaf table. A leaf table is assumed to be complete when the last entry has
been written to. But there is nothing preventing previous data from being
overwritten, or for tables to be made in reverse order (this can happen in a
regex character class like [zyxw...]). It means that a de-duplicated table
might have to re-duplicated. There is a reference count on each leaf table
which is incremented every time that table is used during de-duplication.
If the reference count is not equal to one during a table write, it is first
copied to a new table (it's a copy on write scheme).
The second-pass deduplication works like this: a hash table of all leaf
tables is built. If a duplicate table is detected during the hash table
build, it is deduplicated (not added to hash table).
#### Anglocentrism
There is one further optimization: we provide faster access to the first 512
character codes (so that ASCII is fast). To do this, the first of the
second to last level tables is embedded in the struct Rtree. If the code is
below 512, then the upper 5 bits index this table to find the leaf table,
and the lower 4 bits index to leaf table. Thus, there is one local access
and one indirect access for codes below 512.
### Rset
struct Rset radix trees are used for simple sets. In this case there are no
leaf tables. Instead, the second to last level table contains bit maps. If
a bit is a one, then the corresponding character is in the set. Each bit
map is 16 bits: the shorts that make up the second to last level table are
used for these bitmaps instead of as leaf-table indexes.
Here are the provided methods:
/* Initialize a struct Rset */
void rset_init(struct Rset *r);
/* Clear a struct Rset: anything malloced for it is freed, but the
* struct Rset itself is not freed (it could be embedded in another
* structure). */
void rset_clr(struct Rset *r);
/* Test if a character is in a set: this one can be called before
* before rset_opt is called. */
int rset_lookup_unopt(struct Rset *r, int ch);
/* Add a range of characters to a set: ch to che */
void rset_add(struct Rset *r, int ch, int che);
/* Add an array of ranges to a set */
void rset_set(struct Rset *r, struct interval *array, ptrdiff_t size);
/* De-duplicate leaf tables and set up embedded table for quicker
* access for characters below 512 */
void rset_opt(struct Rset *r);
/* Test if a character is in a set: only use this after rset_opt has
* been called */
int rset_lookup(struct Rset *r, int ch);
/* Print Rset in human readable form to startup log for debugging */
void rset_show(struct Rset *r);
Rset_add originally accepted just a single character to add, not a range.
This turned out to be too inefficient: the editor was using a lot of CPU
time during start-up just to build the standard character class tables.
Therefore, it has been modified to loop over the range for a fast inner loop
and to lift function call overhead out of the original loop in rset_set.
### Rtree
struct Rtree provides a mapping from characters to pointers. For example,
this is used in the regular expression code to traverse states.
Here are the provided methods:
/* Initialize a struct Rset */
void rtree_init(struct Rtree *r);
/* Clear a struct Rtree: anything malloced for it is freed, but the
* struct Rtree itself is not freed (it could be embedded in another
* structure). */
void rtree_clr(struct Rtree *r);
/* Test if a character is in a tree: return pointer bound to it
* or NULL. This can be called before rtree_opt. */
void *rtree_lookup_unopt(struct Rtree *r, int ch);
/* Add a binding to from a range of characters to a pointer */
void rtree_add(struct Rtree *r, int ch, int che, void *map);
/* Add a binding to from each range of characters in the given array to a pointer */
void rtree_set(struct Rtree *r, struct interval *array, ptrdiff_t len, void *map);
/* Add a binding to from each range of characters in the given
* interval_list list to its pointer */
void rtree_build(struct Rtree *r, struct interval_list *l);
/* De-duplicate leaf tables and set up embedded table for quicker
* access for characters below 512 */
void rtree_opt(struct Rtree *r);
/* Test if a character is in a tree: return pointer bound to it
* or NULL. Only call this after rtree_opt has been called. */
void *rtree_lookup(struct Rtree *r, int ch);
/* Print Rtree in human readable form to startup log for debugging */
void rtree_show(struct Rtree *r);
They are the same as the rset methods, except that a pointer is provided.
Also, a method is provided to convert an interval_list into an Rtree.
### Rmap
struct Rmap provides a mapping from characters to integers. For example,
this is used for Unicode simple case conversion. The integer provides an
offset to be added to the character to convert it from one case to another.
It's important to use an offset, not a direct replacement character for this
so that leaf nodes have the same values and can be de-duplicated.
Here are the provided methods:
void rmap_init(struct Rtree *r);
void rmap_clr(struct Rtree *r);
int rmap_lookup_unopt(
struct Rtree *r,
int ch,
int dflt
);
void rmap_add(
struct Rtree *r,
int ch, int che,
int map,
int dflt
);
void rmap_set(
struct Rtree *r,
struct interval *array,
ptrdiff_t len,
int map,
int dflt
);
void rmap_opt(struct Rtree *r);
int rmap_lookup(struct Rtree *r, int ch, int dflt);
void rmap_show(struct Rtree *r);
They are the same methods as for struct Rtree, except that you provide
'dflt', the value to return if there is no binding for the character.
### struct Cclass
Finally we have struct Cclass, which is made up of an interval array and a
struct Rset and which provides set operations for character classes. The
interval array is used for the set operations, then the array is optimized
into a struct Rset for quick character lookups.
The Unicode database (made from the data in unicat.c and methods in
unicode.c) uses this data structure. For example, you can get a character
class containing all letters with:
struct Cclass *letters = unicode("L");
The set operations allow you to construct complex classes, for example the
character class for \\i, the initial character of identifiers is contructed
like this:
cclass_init(cclass_alnum_);
cclass_union(cclass_alnum_, unicode("L"));
cclass_union(cclass_alnum_, unicode("Pc"));
/* cclass_union(cclass_alpha_, unicode("Sc")); */
cclass_union(cclass_alpha_, unicode("Nl"));
cclass_union(cclass_alnum_, unicode("Mn"));
cclass_union(cclass_alnum_, unicode("Mc"));
cclass_union(cclass_alnum_, unicode("Nd"));
cclass_add(cclass_alnum_, 0x200c, 0x200d);
cclass_opt(cclass_alnum_);
The following methods are provided:
/* Initialize a character class */
void cclass_init(struct Cclass *cclass);
/* Free memory used by a Cclass (does not free cclass itself) */
void cclass_clr(struct Cclass *cclass);
/* Add a range to a character class */
void cclass_add(struct Cclass *cclass, int first, int last);
/* Remove a range from a character class */
void cclass_sub(struct Cclass *cclass, int first, int last);
/* Merge one character class into another */
void cclass_union(struct Cclass *cclass, struct Cclass *n);
/* Merge an interval array into a character class */
void cclass_merge(struct Cclass *cclass, struct interval *intervals, int len);
/* Subtract one character class from another */
void cclass_diff(struct Cclass *cclass, struct Cclass *n);
/* Compute unicode inverse of a character class, for [^a-z] */
void cclass_inv(struct Cclass *cclass);
/* Lookup a character in a character class using binary search.
Return true if character is in the class. */
int cclass_lookup_unopt(struct Cclass *m, int ch);
/* Optimize a character class for fast lookup with cclass_lookup.
* Generates a radix tree version of the character class. */
void cclass_opt(struct Cclass *cclass);
/* Return true if character is in the class */
int cclass_lookup(struct Cclass *cclass, int ch);
/* Print character class to startup log */
void cclass_show(struct Cclass *cclass);
/* Remap a character class from Unicode to a specific locale */
struct Cclass *cclass_remap(struct Cclass *m, struct charmap *map);
Cclass_remap is used to convert a character class from Unicode to some other
character set (represented by struct charmap *). For example, to get
letters in the KOI8-R 8-bit character set, use:
struct Cclass *r = cclass_remap(unicode("L"), find_charmap("KOI8-R"));
Cclass_remap remembers the conversion to save work, so the input struct
Cclass is expected to remain around forever. Basically this is to be used
only for built-in character classes.
## Regular expressions
As of verison 4.1, JOE uses an enhanced version of [Thompson's NFA matching algorithm](https://swtch.com/~rsc/regexp/regexp1.html).
The code is in [regex.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/regex.c) and [regex.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/regex.h).
The regular expression matcher is a subroutine of the the larger text search
algorithm in JOE. For example, text search will use [Boyer-Moore](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm)
to find a leading prefix of the regular expression before running the
regular expression matcher. A leading prefix is leading text present in
all of the different possible text which can match the regular expression.
For example, in 'hello(foo|bar)', the leading text is 'hello', but it's empty
in 'hello|there'. One of the jobs of the regular expression code is to
identify this leading text.
The API for the regular expression library is simple:
/* Compile an expression s/len into a struct regcomp
* Check regcomp->err for compile errors.
*/
struct regcomp *joe_regcomp(
struct charmap *charmap, /* The character set of the expression */
const char *s, ptrdiff_t len, /* The regular expression */
int icase, /* Set to ignore case */
int stdfmt, /* Set for standard syntax expressions, vs. JOE syntax */
int debug /* Set to have debug information printed to startup log */
);
/* Free a compiled regular expression */
void joe_regfree(struct regcomp *r);
struct regmatch {
off_t rm_so, rm_eo; /* Starting and ending byte offset of a match, or -1 if no match */
}
/* Returns 0 if buffer at p matches compiled expression r
* 'matches' is filled in with submatch addresses
* 'nmatch' has size of matches array
*/
int joe_regexec(
struct regcomp *r, /* Compiled regular expression */
P *p, /* Pointer to edit buffer */
int nmatch, /* Size of matches array */
struct regmatch *matches, /* Locations of sub-matches */
int fold /* Set to ignore case */
);
Joe_regcomp converts the regular expression into an [AST](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
with an operator precedence parser and then generates a byte-code program
representation of the regular expression's [NFA](https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton).
Joe_regexec simulates the byte-code program with an arbitrary number of threads machine.
It feeds characters from the buffer into each active thread until one
of the threads encounters the END instruction, there are no more threads, or
the end of the buffer is reached. A side effect is the buffer pointer __P__
is advanced to the first character not part of the matching text.
If the 'v' flag is included in JOE's search and replace options prompt, the
AST and program are sent to JOE's startup log so that they can be viewed
(with ESC x showlog). For example, here is the log when you try to search
for 'hello\\(foo\\|bar\\)\\*there':
~~~~~
Parse tree:
,
,
,
,
,
,
,
,
,
,
'h'
'e'
'l'
'l'
'o'
*
{
|
,
,
'f'
'o'
'o'
,
,
'b'
'a'
'r'
't'
'h'
'e'
'r'
'e'
Leading prefix 'hello'
NFA-program:
PC INSN
-- ----
0: 'h'
4: 'e'
8: 'l'
12: 'l'
16: 'o'
20: fork 92
28: bra 0
36: fork 64
44: 'f'
48: 'o'
52: 'o'
56: jump 76
64: 'b'
68: 'a'
72: 'r'
76: ket 0
84: fork 28
92: 't'
96: 'h'
100: 'e'
104: 'r'
108: 'e'
112: end
Total size = 116
~~~~~
The "Parse tree" (AST) is printed in an indented format: A line begins with an
operator, then all succeeding lines with deeper indentation are its operands.
The ',' operator means succession: match the first operand, then match the
second operand. The '{' operator means record the submatch of its
argument. The '*' operator means 0 or more matches of its argument. The
'|' operator means match one argument or the other.
The "Leading prefix" is the part of the regular expression which can use
Boyer-Moore search.
The "NFA-program" shows the dis-assembled program. A letter like 'h' is an
instruction to match the letter itself. 'fork' is an instruction which
creates a new thread beginning at the specified program counter value.
'jump' means the thread should goto the specified PC value. 'bra' and 'ket'
indicate that the start and ending location of submatch should be recorded.
Here is the log from the regular expression '-\\i\\c-' to show you how
character classes appear (\\i is the initial character of an identifier, and
\\c is the following character of an identifier):
~~~~~
Parse tree:
,
,
,
'-'
[
[41..5a] [5f..5f] [61..7a]
[
[30..39] [41..5a] [5f..5f] [61..7a]
'-'
Leading prefix '-'
NFA-program:
PC INSN
-- ----
0: '-'
4: class [41..5a] [5f..5f] [61..7a]
12: class [30..39] [41..5a] [5f..5f] [61..7a]
20: '-'
24: end
Total size = 28
~~~~~
Note that this is showing the ASCII character encoding so that the character
classes are not too big. With the UTF-8 character encoding, the character
classes will include the full Unicode definitions of \\i and \\c.
### Functions
/* Parse a character or escape sequence from ptr/len.
* Returns with character or -256 and cat (if not NULL) filled in with a character class
*/
int escape(int utf8, const char **ptr, ptrdiff_t *len, struct Cclass **cat);
Escape reads one character from a string. If the character is an escape
sequence, it's parsed into either a single character or a character class.
Escape is used in other parts of the editor: for example, it's used to
parse character lists of syntax highlighting files.
/* Conventional syntax regex */
static int do_parse_conventional(struct regcomp *g, int prec, int fold);
do_parse_conventional is the standard syntax regular expression parser. It
records the AST in regcomp->nodes.
/* JOE syntax regex */
static int do_parse(struct regcomp *g, int prec, int fold);
do_parse is the JOE syntax regular expression parser. The JOE syntax is
just like the regular syntax except that regular expression special
characters must be escaped.
/* Determine leading prefix of search string */
/* We can use boyer-moore on the prefix if:
Character set is byte coded.
Character set is UTF-8 coded, but no folding is requested.
Character set is UTF-8 coded, folding is requested, but character is below 128 */
static int extract(struct regcomp *g, int no, int fold);
Extract finds the leading prefix for Boyer-Moore from the AST.
/* Convert parse-tree into code with infinite loops */
static void codegen(struct regcomp *g, int no, int *end);
Codegen converts the AST into byte-code. Functions from [frag.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/frag.c) / [frag.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/frag.h)
write instructions to the memory image and handle alignment. They also provide a
mechanism for linking the program (to handle forward jumps). Basically a
linked-list of all the jump intructions which target the same destination is
created right in the memory image, using the jump instruction operands as
the link pointers. When the jump destination is finally known, the
link-list is traversed and the link pointers are replaced with the
destination address.
### Matcher
joe_regexec has the arbitrary number of threads simulator. It has two banks
of threads, A and B, and ping-pongs between them for each input character.
It feeds the next input character to all threads in bank A. It executes as
many instructions as possible for each thread, and stops when the thread
finally accepts the character or rejects it. If the thread accepts the
character, it will be active for the next character, so it is moved to bank
B. Otherwise, the thread will die when we transition to bank B. If a
thread in bank A forks, the newly created thread is appended to bank A so
that the current character is fed to it before switching. Once all threads
have been fed the character, we switch to bank B and feed in the next
character.
This scheme is simple and fast, but there are complications. First has to
do with sub-match addressing. Without modification, the matcher will find
all possible solutions to expressions like 'x(a\*)(a\*)y'. This means that
at the end of the match, there will be a thread for each of these
solutions. If the input is xaaaay, it will find x()(aaaa)y, x(a)(aaa)y,
x(aa)(aa)y, x(aaa)(a)y, and x(aaaa)()y.
Research topic: due to this property, I think the Thompson NFA matcher can
be extended to support back-references.
But we don't want all solutions, we want just one solution, the one where
the left-most submatch is longest: x(aaaa)()y. To select the preferred
solution, joe_regexec has a helper function:
static int add_thread(
struct thread *pool, /* Thread pool */
int l, int le, /* Current start and end of pool */
unsigned char *pc, /* Current thread's pc */
Regmatch_t *pos, /* Submatch addresses */
int bra_no, /* Size of submatch addresses */
int *stack, int sp /* Current thread's stack */
);
This adds the thread to the end of the new pool B (which is at
pool[le]). Before it adds, it checks if there are other existing threads
(between l and le) at the same program counter. If there are, then we pick
the one with the "best" submatch, the one with the longest to the left.
The second complication is JOE's extension \\! (which used to be \\c in
versions before 4.1). \\! is like '.': it matches one character. However,
if that one character is the opening bracket of an expression, it matches
the entire expression. \\! will match all of "(1+(2+3))". To support this,
each thread needs a stack to record its progress in parsing the balanced
expression.
## Unicode database
[unicode.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicode.c), [unicode.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicode.h) and [unicat.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicat.c) provide
JOE's database of Unicode facts. Unicat.c is generated by [util/uniproc.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/util/uniproc.c) from the
raw data provided on the [unicode.org](ftp://ftp.unicode.org/Public/UNIDATA/) website.
There is a table in unicat.c which has the name of, pointer to and size of
each other table.
struct unicat unicat[] = {
{ "Co", 3, Co_table, 0 },
{ "Cs", 1, Cs_table, 0 },
{ "Zp", 1, Zp_table, 0 },
...
{ "Domino Tiles", 1, uniblocks + 241, 0 },
{ "Mahjong Tiles", 1, uniblocks + 240, 0 },
{ "Arabic Mathematical Alphabetic Symbols", 1, uniblocks + 239, 0 },
...
The function unicode looks up one of these tables and returns a struct
Cclass version of it:
struct Cclass *unicode(const char *name);
If a single letter name is passed to the above function, a Cclass containing
all two-letter named tables with the same matching first letter is returned.
So for example, unicode("L") returns a Cclass of all letters, including
"Lu", "Ll", "Lt", etc.
Joe_iswinit initializes Cclasses for JOE's isw ctype functions. The
following functions are provided:
int joe_iswupper(struct charmap *,int c);
int joe_iswlower(struct charmap *,int c);
int joe_iswalpha(struct charmap *,int c);
int joe_iswalpha_(struct charmap *,int c); /* used for \i */
int joe_iswalnum(struct charmap *,int c);
int joe_iswalnum_(struct charmap *,int c); /* used for \c */
int joe_iswdigit(struct charmap *,int c);
int joe_iswspace(struct charmap *,int c);
int joe_iswctrl(struct charmap *,int c);
int joe_iswpunct(struct charmap *,int c);
int joe_iswgraph(struct charmap *,int c);
int joe_iswprint(struct charmap *,int c);
int joe_iswxdigit(struct charmap *,int c);
int joe_iswblank(struct charmap *,int c);
Another table in unicat.c is the width_table. This indicates which
characters are double-wide on terminal emulators. It is used to implement
the joe_wcwidth function:
int joe_wcwidth(int wide,int c);
'wide' should be true if the character set of c is Unicode. Both the
terminal and the file must be Unicode (UTF-8 or UTF-16) or double-wide
characters to be printed.
Another set of tables in unicat.c includes fold_table and fold_repl for case
folding for string comparisons. Fold_table is an array of ranges. If a
character is in one of these ranges, then the corresponding entry in
fold_repl is a string (of 1 to 3 characters) which should replace it for
case folding. The first character of the string is an offset which should
be added to the original character to get the first character of the
replacement.
rtree_fold is an struct Rtree representation of the fold table. It returns
either a single character replacement offset, or an index into fold_repl if the
character is to be replaced by a string longer than one character.
Lowerize can be called to convert a UTF-32 string into its case folded
version:
int *lowerize(
int *d, ptrdiff_t len, /* Destination string and length */
const int *s /* Source string */
);
Unicat.c also provides simple case conversion tables: toupper_table,
tolower_table and totitle_table along with the corresponding replacement
tables: toupper_cvt, tolower_cvt and totitle_cvt.
These are used to implement joe_towupper and joe_towlower:
int joe_towupper(struct charmap *,int c); /* Convert to uppercase */
int joe_towlower(struct charmap *,int c); /* Convert to lowercase */
Digval is provided to return the decimal value of any unicode digit:
int digval(int ch)
This function uses Nd_table from unicat.c. Nd_table has the property that
each range represents a single set of decimal digits. Normally adjacent
ranges would be combined together, but in this case they are separated to
support digval.
Finally unicode.c provides the following function:
int unictrl(int c);
Which returns true if JOE should treat the character as a control character
which should not be sent directly to the terminal. Instead the character
value is displayed in hexadecimal: JOE 3.6 now uses co-routines to help reduce the amount of event-driven
code which has to be written. All previous versions of JOE were entirely in
the event driven coding style, which made user dialog code a chore to write.
Now, with the help of co-routines, dialog code can be written as old
fashioned blocking code which is much cleaner, even though JOE can have
multiple un-finished dialogs on the screen. The coroutine library has only three (now four) functions: co_call() invokes 'func' as a co-routine- it will run on its own stack.
'func' is called immediately upon the call to co_call. All of the arguments
after func are passed as a va_list to the function. co_call calls va_start
and va_end, 'func' should call va_arg once for each passed argument (see
stdarg.h). The co-routine should not yield before it has finished reading
all of the arguments, otherwise va_end will be called early. When 'func'
returns, the co-routine is destroyed and co_call() returns with the return
value of 'func'. If the co-routine yields by calling co_yield, co_call
returns with the return value specified in the call to co_yield. Note that
co_call only ever returns once for a particular call. co_yield() suspends the current co-routine and returns with the specified
return value to the caller (either to co_call or co_resume). co_yield()
writes information about the co-routine to the 'Coroutine' structure whose
address is passed to co_yield. The Coroutine structure's address is used as
a handle for the co-routine and can be passed to co_resume(). co_resume() suspends the current co-routine (henceforth called the
original co-routine) and resumes the specified co-routine. When the
specified co-routine returns (when the function given in the co_call that
created the co-routine returns), the original co-routine is continued and
co_resume() returns with the function's return value. If the specified
co-routine yields, the original co-routine is continued and co_resume()
returns with the return value specified in the call to co_yield. co_suspend() is like co_resume, except that the top-level task is resumed
instead of the calling co-routine. This has the effect of suspending the
entire chain of co-routines which invoked the caller of co_suspend.
Co_suspend stores the suspended co_routine in another co_routine. That
co_routine will resume this one when it finally returns. The co-routine library is structured this way (which is different from
the usual way where you get a co-routine handle immediately upon creation,
for example see Knuth) to encourage proper stack un-winding. Basically, a
co-routine can not be forcibly destroyed. Instead, each created co-routine
is only destroyed when the initial function returns. The co-routine library is based on ucontext.h and the getcontext,
swapcontext, makecontext and setcontext system calls. If the system does
not have these calls, it uses setjmp/longjmp tricks in a highly portable
way: the stacks for each co-routine are allocated off the top of the main
stack. The main stack grows after each allocation. The stacks are never
freed, but they are reused. This works for any machine with a single normal
stack. It will not work on IA64 (two stacks) or Cray (the stack is really a
linked list). I was originally going to use simple cooperative multi-threading, with
only two basic calls: sleep_on() and wake_up() (inspired by old versions of
the Linux kernel) and a top-level scheduler. Sleep_on would suspend the
current thread, and create a new thread which runs just the scheduler.
Wake_up would mark a thread as ready to run, and on the next return to the
scheduler, it would be run. This turns out to not be powerful enough to implement JOE's keyboard
macro language, at least without much rewriting and inelegance. It could be
done, but the keyboard macro player would have to feed events into the
top-level scheduler or would have to be integrated with it, whereas in older
versions of JOE (even without co-routines) the macro player just calls the
edit functions (more or less) directly. The problem is that sleep_on
(called by a function asking for user input) would suspend the task all the
way up to the scheduler including the macro player, thus preventing the
macro player from simulating user input. On the other hand with real co-routines, the macro player will call each
edit function as a co-routine (which I like to think of as a task with a
very clear return point). If the edit function has to yield, it returns
immediately to the macro player for the next macro step. If the yield is
due to a request for input, the macro player supplies it by calling the
recorded edit functions. If the macro wants to allow the user to provide
input (using the "query" function, also known as an interactive macro), it
can itself yield (since it is also running in a coroutine) back to the
top-level edit loop (in actuality, the macro player calls a function which
calls co_suspend() to do this). The macro player is continued when the user
finishes supplying the input (hits the return key at a prompt). Several
interactive macros can be running at once with this scheme. JOE 3.6 uses "semi-automatic" variables for strings. There is a global
"object-stack", accessed through these functions: These functions remember the order of allocation. When an allocated
block is freed, it and all newer blocks are freed as well. For example, in: b, c, and d are all freed, but a remains allocated. The top-level edit
loop is structured to call obj_free after every editor command: Thus, you rarely have to call obj_free- the only time you might want to do
it is if you are calling obj_malloc in a loop. This could cause a large
amount of memory to be allocated, so obj_free() should be called at the
end of each loop iteration. obj_perm() marks an object as a heap object instead of a stack object.
When a heap object is freed, only itself is freed- the stack is not
modified. Also when the stack is freed, heap objects are not touched. JOE uses dynamic strings built on top of the global object stack
functions. These strings automatically resize themselves to fit their
contents. They also know their own length, which allows you to have NULs in
the middle of a string. They also always have a terminating NUL to make
them compatible with C-strings. In fact they have (almost) the same type as
C-strings: unsigned char *. The length is hidden in the obj_malloc data
structure. These functions are provided: Note that it is OK to pass NULL in place of 's' in any of the above
functions. In this case, a new string is allocated. Most of these function return the address of the variable string, in case
it was modified. One exception is vsgets()- it takes an address of a
variable holding the string. It returns the string address, but it also
writes it to the variable directly. Many of the above functions take a memory block, which is specified by an
address and a length. These macros can be used to help construct these
arguments: So for example, vscatz is defined as: JOE also semi-automatic variable arrays of variable strings. These
functions are provided: When an array is marked as permanent, any strings in the array also
marked as permanent, and any string later added to the array are also marked
as permanent. API:
Look at the comments in b.h for more information.
B *bfind(unsigned char *name);
Load disk file into memory buffer 'B'.
bsave(P *p,unsigned char *name,int size);
Write size bytes from buffer beginning at p to disk file
brm(b); Free data structure
Once you have a B you can access the characters in it via P pointers (which
are like C++ STL iterators).
B *b = bfind("foo"); Load file into memory
P *p = pdup(b->bof); Get pointer to beginning of file (duplicate
b->bof which is a P).
prm(p); Free p when we're done with it.
int c=brch(p); Get character at p. int c=pgetc(p); Get character at p and advance it. int c=prgetc(p); Retract p, then return character at it.
int co_call(int (*func)(va_list args),...); Call a function
as a co-routine. int co_yield(Coroutine *t, int val); Suspend current
co-routine and return to caller with given return value. int co_resume(Coroutine *t, int val); Suspend current
co-routine and resume specified one with given return value. int co_suspend(Coroutine *u, int val); Suspend
current co-routine (remembering its suspended invoking coroutines) and
resume the top-level. The current coroutine is stored in u. When u is
resumed and returns, the suspended chain is resumed (the resumer is
continued when the chain tries to continue the top level.
void *obj_malloc(int size); Allocate a block of memory. void *obj_realloc(void *ptr, int size); Change allocation size. void obj_free(void *ptr); Free a specified block and all newer blocks. void obj_perm(void *ptr); Mark a block as permanent, which in effect moves the block from the stack to the heap. int obj_len(void *ptr); Access hidden "length" part of block (string length). int obj_size(void *ptr); Access hidden "size" part of block (allocation size).
unsigned char *a = obj_malloc(10);
unsigned char *b = obj_malloc(10);
unsigned char *c = obj_malloc(10);
unsigned char *d = obj_malloc(10);
obj_free(b);
unsigned char *gc = obj_malloc(1);
execmd(...); /* Execute edit commands */
obj_free(gc); /* Collect garbage */
unsigned char *vsmk(int prealloc); Allocate a zero length string, but with enough space to grow to the specified length. int vslen(unsigned char *s); Return string length or 0 if s is NULL. unsigned char *vsensure(unsigned char *s, int len); Make sure there is enough space for a string of the specified length. unsigned char *vstrunc(unsigned char *s, int len); Set length of string: the string is truncated or extended with spaces to the specified length. unsigned char *vsncpy(unsigned char *s,int offset,unsigned char *block, int block_size); Copy a memory block to a string. The string is extended with spaces if offset > string length. unsigned char *vsdupz(unsigned char *s); Duplicate a z-string to a variable string. unsigned char *vsdup(unsigned char *s); Duplicate a variable string. unsigned char *vscpyz(unsigned char *s,unsigned char *z); Copy a z-string to a variable string. unsigned char *vscpy(unsigned char *s,unsigned char *block,int size); Copy a memory block to a variable string. unsigned char *vscatz(unsigned char *s,unsigned char *z); Append a z-string to a variable string. unsigned char *vscat(unsigned char *s,unsigned char *block,int size); Append a memory block to a variable string. unsigned char *vsadd(unsigned char *s,unsigned char c); Append a single character to a string. unsigned char *vsfmt(unsigned char *s,int offset,const char *format,...); Sprintf() to a variable string. unsigned char *vsgets(unsigned char **sp,FILE *f); fgets() to a variable string (returns NULL on EOF, does not put '\n' in the string).
#define sc(s) (s), (sizeof(s)-1) For string constants. #define sz(s) (s), zlen(s) For C-strings (zlen is the same as strlen). #define sv(s) (s), vslen(s) For variable strings.
unsigned char *vscatz(unsigned char *s, unsigned char *z);
{
return vsncpy(sv(s), sz(z));
}
## Variable arrays of variable strings
unsigned char *vamk(int prealloc); Allocate a zero length array, but with enough space to grow to the specified length. void varm(unsigned char **a); Free an array. int valen(unsigned char *a); Return array length or 0 if a is NULL. unsigned char *vaensure(unsigned char *s, int len); Make sure there is enough space for an array of the specified length. unsigned char *vatrunc(unsigned char *s, int len); Set length of array: the array is truncated or extended with NULLs to the specified length. unsigned char *vaadd(unsigned char **a,unsigned char *s); Append a single string to an array. unsigned char *vasort(unsigned char **a,int len); Sort an array. unsigned char *vawords(unsigned char **a,unsigned char *s,int len, unsigned char *sep, int seplen); Convert a word list in s (words separated by characters given in sep/seplen) into an array of strings. void vaperm(unsigned char **a); Mark an array as preferment (heap instead of stack).
- These return -1 (NO_MORE_DATA) if you try to read end of file or before beginning of file.
- A pointer can point to any character of the file and right after the end of the file.
- For UTF-8 files, character can be between 0 and 0x7FFFFFFF
Publicly readable members of P:
p->byte The byte offset into the buffer
p->line The line number
p->xcol If P is the cursor, this is the column number where the cursor will be displayed on the screen (which does not have to match the column of the character at P).
Some predicates:
pisbof(p); True if pointer is at beginning of buffer
piseof(p); True if pointer is at end of buffer
pisbol(p); True if pointer is at beginning of line
piseol(p); True if pointer is at end of line
pisbow(p); True if pointer is at beginning of a word
piseow(p); True if pointer is at end of a word
More information about character at p:
piscol(p); Get column number of character at p.
Some other ways of moving a P through a B:
pnextl(p); Go to beginning of next line
pprevl(p); Go to end of previous line
pfwrd(p,int n); Move forward n bytes
pbkwd(p,int n); Move backward n bytes
p_goto_bof(p);
p_goto_eof(p);
p_goto_bol(p);
p_goto_eol(p);
pset(p,q); Move p to same position as q.
pline(p,n); Goto to beginning of a specific line.
pgoto(p,n); Goto a specific byte position.
pfind(P,unsigned char *s,int len); Fast Boyer-Moore search forward.
prfind(P,unsigned char *s,int len); Fast Boyer-Moore search backward.
These are very fast- they look at low level data structure and don't go through pgetc(). Boyer-Moore allows you to skip over many characters without reading them, so you can get something like O(n/len).
Some facts:
Local operations are fast: pgetc(), prgetc().
Copy is fast: pset().
pline() and pgoto() are slower, but look for the closest existing P to start from.
The column number is stored in P, but it is only updated if it is easy to do so. If it's hard (like you crossed a line boundary backward) it's marked as invalid. piscol() then has to recalculate it.
Modifying a buffer:
binsc(p,int c); Insert character at p.
bdel(P *from,P *to); Delete character between two Ps.
Note that when you insert or delete, all of the Ps after the insertion/ deletion point are adjusted so that they continue to point to the same character before the insert or delete.
Insert and Delete create undo records.
Insert and Delete set dirty flags on lines which are currently being displayed on the screen, so that when you return to the edit loop, these lines automatically get redrawn.
Internal:
An edit buffer is made up of a doubly-linked list of fixed sized (4 KB) gap buffers. A gap buffer has two parts: a ~16 byte header, which is always in memory, and the actual buffer, which can be paged out to a swap file (a vfile- see vfile.h). A gap buffer consists of three regions: text before the gap, the gap and text after the gap (which always goes all the way to the end of buffer). (hole and ehole in the header indicate the gap position). The size of the gap may be 0 (which is the case when a file is first loaded). Gap buffers are fast for small inserts and deletes when the cursor is at the gap (for delete you just adjust a pointer, for insert you copy the data into gap). When you reposition the cursor, you have to move the gap before any inserts or deletes occur. If you had only one window and a single large gap buffer for the file, you could always keep the gap at the cursor- the right-arrow key copies one character across the gap.
Of course for edits which span gap buffers or which are larger than a gap buffer, you get a big mess of gap buffer splitting and merging plus doubly-linked list splicing.
Still, this buffer method is quite fast: you never have to do large memory moves since the gap buffers are limited in size. To help search for line numbers, the number of newlines '\n's contained in each gap buffer is stored in the header. Reads are fast as long as you have a P at the place you want to read from, which is almost always the case.
It should be possible to quickly load files by mapping them directly into memory (using mmap()) and treating each 4KB page as a gap buffer with 0 size gap. When page is modified, use copy-on-write to move the page into the swap file (change pointer in header). This is not done now. Instead the file is copied when loaded.
## Windowing SystemThere is a tiny object-oriented windowing system built into JOE. This is the class hierarchy:
SCRN A optimizing terminal screen driver (very similar to 'curses'). has a pointer to a CAP, which has the terminal capabilities read from termcap or terminfo.
writes output to screen with calls to the macro ttputc(). (tty.c is the actual interface to the tty device).
cpos() - set cursor position
outatr() - draw a character at a screen position with attributes
eraeol() - erase from some position to the end of the line
SCREEN Contains list of windows on the screen (W *topwin).
Points to window with focus (W *curwin).
Contains pointer to a 'SCRN', the tty driver for the particular terminal type.
W A window on a screen.
Has position and size of window.
Has: void *object- pointer to a structure which inherits window (W should really be a base class for these objects- since C doesn't have this concept, a pointer to the derived class is used instead- the derived class has a pointer to the base class: it's called 'parent').
Currently this is one of:
BW * a text buffer window (screen update code is here.)
QW * query window (single character yes/no questions)
MENU * file selection menu
BW * is inherited by (in the same way that a BW inherits a W):
PW * a single line prompt window (file name prompts)
TW * a text buffer window (main editing window).
WATOM *watom- Gives type of this window. The WATOM structure has pointers to virtual member functions.
KBD *kbd- The keyboard handler for this window. When window has focus, keyboard events are sent to this object. When key sequences are recognized, macros bound to them are invoked.
Some window are operators on others. For example ^K E, load a file into a window prompt operates on a text window. If you hit tab, a file selection menu which operates on the prompt window appears below this. When a window is the target of operator windows is killed, the operators are killed also.
Currently all windows are currently the width of the screen (should be fixed in the future). The windows are in a big circular list (think of a big loop of paper). The screen is small window onto this list. So unlike emacs, you can have windows which exist, but which are not on the screen.
^K N and ^K P move the cursor to next or previous window. If the next window is off the screen it is moved onto the screen, along with any operator windows are target it.
## Macros- add something here.
## Screen update- add something here.
## Syntax HighlighterThere are two parts to the syntax highlighter: the parser (in syntax.c) and the line attribute database (in lattr.c). The parser is simple enough: it is a finite state machine interpreter. You supply a line of text and the starting state, and it returns the starting state of the next line and an array containing the color to be used for each character. Each state has a color. The line is scanned left to right, character by character: the state and the character index a table, which gives an action, which includes the next state.
The action specifies the next state, and also indicates whether some previous N characters (including the current) should be "recolored" with the color of the next state, or whether the color of the current state should be used for the character. The action also indicates whether to move on to the next character, or to stay put and re-evaluate the current character with the new state. This leads to simpler hand-coding of the state machine, but can create an infinite loop. The action also indicates if the character should be added to a string buffer. The string buffer can be used (when specified by an action) as a key for a hash table look-up (typically filled with key-words). If there is a match, the hash-table entry supplies the action instead of the state table.
This parser is fast and simple and powerful enough to lexically analyze more than 40 languages. However a few enhancements were made to improve both the class of languages which can be highlighted and to improve the ease of writing the state machine. To support "here" documents in perl and sh, as well as to simplify matching delimiters (like open with close parenthesis), a string (from the string buffer) can be stored along with the state. This can be matched against the string buffer. So first the terminating word from the here document is stored in the string, and then the state machine looks forward until it finds this string to determine the end of the here document.
The state is only a single integer, not an entire stack. The parser can therefore not handle recursive grammars. Unfortunately, some languages have recursive elements in their lexical structure. One way to deal with this is to support only a very limited recursion depth by tediously repeating a set of states in the hand written state machine N times, where N is your maximum recursion depth. To ease writing this type of thing, and also to allow state machine to be reused in cases where one language is embedded within another (PHP in HTML, for example), a macro system was added to the state machine loader. Recursion is allowed in the macro calls, but is limited to a depth of 5.
As of JOE 3.6, the macro instantiation was replaced with a real stack, so now there is no recursion depth limit.
## Line attribute cacheThe second part of the syntax highlighter is the line attribute cache. In the original implementation, the state of the first line of each window is maintained and the parser is invoked during the screen update algorithm to highlight all of the lines on the screen. This works well when the window remains in place or when it scrolls forward through the file. However, if the window scrolls backward you may have to reparse the entire file or from the nearest previous window if there happens to be one. This can be slow for large files, and certainly wastes a lot of CPU time when scrolling backward through a file.
One attempt to solve this was to instead of starting at the beginning of the file, to instead assume that the line 200 lines (a configurable option) back was at the initial state, parse forward from there and hope for the best. Sometimes the state machine would find itself back in the proper state, but not always- to the point where most syntaxes had specified to just always reparse from the beginning of the file.
Finally, the line attribute cache was added. It stores the parser state at the beginning of each line, from the beginning of the file to last line actually displayed on the screen. This solves the scrolling problem, but introduces a new one: how to keep the line attribute cache up to date. This turns out to not be so difficult. First of all, the states are stored in a gap-buffer, which makes it easy to perform insertions and deletions. Whenever lines are inserted or deleted in the file, the line attribute cache is expanded or contracted to keep in sync. Now, when any change is made to the file, we reparse from just before the start of the change until the state of the beginning of a line again matches the corresponding entry in the cache. If a match is found, the rest of the cache is assumed to be correct. Some changes will require a large amount of reparsing, but many will not. In any case, the the highlighting on the screen is always correct.
## Filesb.c | Text buffer management |
blocks.c | Library: fast memcpy() functions (necessary on really old versions of UNIX). |
bw.c | A class: text buffer window (screen update code is here) |
charmap.c | UNICODE to 8-bit conversion functions |
cmd.c | Table of user edit functions |
coroutine.c | Coroutine library |
dir.c | Directory reading functions (for old UNIXs). |
hash.c | Library: automatically expanding hash table functions. |
help.c | Implement the on-line help window |
i18n.c | Unicode character type information database |
kbd.c | Keymap data structure (keysequence to macro bindings). |
lattr.c | Line attribute cache |
macro.c | Keyboard and joerc file macros |
main.c | Has main() and top level edit loop |
menu.c | A class: menu windows |
obj.c | Semi-automatic objects, plus dynamic string and array library |
path.c | Library: file name and path manipulation functions |
poshist.c | Cursor position history |
pw.c | A class: prompt windows |
queue.c | Library: doubly linked lists |
qw.c | A class: single-key query windows |
rc.c | Joerc file loader |
regex.c | Regular expressions |
scrn.c | Terminal update functions (like curses) |
selinux.c | Secure linux (for RedHat) functions |
syntax.c | Syntax highlighter |
tab.c | Tab completion for file selection prompt |
termcap.c | Load terminal capabilities from /etc/termcap file or terminfo database |
tw.c | A class: main text editing window |
ublock.c | User edit functions: block moves |
uedit.c | User edit functions: basic edit functions |
uerror.c | User edit functions: parse compiler error messages and goto next error, previous error |
ufile.c | User edit functions: load and save file |
uformat.c | User edit functions: paragraph formatting, centering |
uisrch.c | User edit functions: incremental search |
umath.c | User edit functions: calculator |
undo.c | Undo system |
usearch.c | User edit functions: search & replace |
ushell.c | User edit functions: subshell |
utag.c | User edit functions: tags file search |
utf8.c | UTF-8 to unicode coding functions |
utils.c | Misc. utilities |
vfile.c | Library: virtual memory functions (allows you to edit files larger than memory) |
w.c | A class: base class for all windows |