Terms¶

Mappings¶

To tell tersen how you want to abbreviate things, you create a tersen dictionary which contains one or more mappings. A mapping describes a correspondence between a source (a long word or phrase that may occur in the input text) and a destination (the abbreviated version you want to replace that long word or phrase with).

Tokens¶

Tokens are the units of input text tersen attempts to match with your dictionary mappings as it decides what to replace. Outer tokens are strings of characters separated by whitespace. Each outer token has an associated inner token, which is the part of the outer token beginning at the first inner-token character and ending at the first non-inner-token character. Inner-token characters are the alphanumeric characters in your system locale, -, ', and ’. The portions of an outer token that are not part of the inner token are the initial and final portions, or, together, the token perimeter.

Only the inner token is used when searching for possible replacements, but the token perimeter is preserved in the output (e.g., with the mapping presented above, if Internet, was found in the source, I.N., would appear in the destination).

Note

The definition above allows the final portion of the token to contain alphanumeric characters. This might happen, for instance, if your source contains a phrase like Wolfram|Alpha – Wolfram would be matched, while |Alpha would be ignored and passed through verbatim.

Sources¶

A source may consist of one token or several tokens. When tersen encounters an input token that could begin multiple mappings, it looks ahead and picks the longest possible match. For instance, if your dictionary contains sources for both Internet and Internet Protocol, the phrase the Internet is becomes the I.N. is, while Internet Protocol becomes IP (never I.N. Protocol). The “longest possible match” is the one that consumes the most tokens from the input right now; it is not necessarily the way of dividing tokens that produces the shortest output. (tersen is greedy; it will never backtrack over a replacement it has already made, even if another division of tokens could produce a shorter output. While that gives up a small amount of possible compression, it keeps tersen fast and simple and makes replacement behavior more predictable.)

A source may consist of only inner tokens; that is, it cannot contain punctuation other than hyphens or apostrophes. So you cannot have a source of, say, #&$%, or St. Paul. If such a source is found, a warning will be printed and that mapping will be ignored. However, this does not preclude matching phrases that have punctuation in the middle; if a multi-word phrase matches input when ignoring the punctuation in the input, the replacement will still be made and any medial punctuation will disappear. For instance, if you have a mapping from St Olaf to STO, and the text St. Olaf is found in the input, it will be replaced with STO. One could imagine a case where this would do the wrong thing, such as 222 Somewhere St., Olaf City, CA, but in general this is unlikely (and it will always be possible to fool tersen in some edge cases, because language is complicated!).

Only one mapping may be present in the dictionary for each source. If the same source is mapped more than once, the mapping that comes physically first in the source file wins. For mappings beyond the first, tersen will print a warning, unless the duplicate is programmically generated by an annotation or the - flag is used on the entry (see later for more on annotations and flags). This behavior can be customized via the mapping_conflicts hook, so you can have later mappings overwrite earlier ones, for example.

Destinations¶

A destination may be any string containing any characters, including whitespace and punctuation, with the sole exception of the newline and the at-sign (@), which both indicate the end of the destination string (the at-sign additionally applies an annotation). Leading and trailing whitespace in the destination string are ignored.

If a destination is longer (consists of more characters) than its corresponding source, tersen will print a warning but still use the mapping. This behavior can be customized via the mapping_verbosens_text hook.