chomski
Paradigm | scripting language |
---|---|
Designed by | mj bishop |
First appeared | 2007 |
Typing discipline | none; all data is treated as a string |
OS | Cross-platform |
Website |
bumble bumble |
Major implementations | |
bumble | |
Influenced by | |
Sed, Awk |
chomski virtual machine (named after the noted linguist Noam Chomsky) and pp (the pattern parser) refer to both a command line computer language and utility (interpreter for that language) which can be used to parse and transform text patterns. The utility reads input files character by character (sequentially), applying the operation which has been specified via the command line or a pp script, and then outputs the line. It was developed from 2006 as a Unix and Windows utility, and is available today for Windows and Linux systems. Pp has derived a number of ideas and syntax elements from Sed, a command line text stream editor.
Features
The chomski language uses many ideas taken from sed, the Unix stream editor. For example, sed includes two virtual variables or data buffers, known as the "pattern space" and the "hold space". These two variables constitute an extremely simple virtual machine. In the Chomski language this virtual machine has been augmented with several new buffers or registers along with a number of commands to manipulate these buffers.
The chomski virtual machine includes a tape data structure as well as a stack (data structure), along with a "workspace" (which is the equivalent of the sed "pattern space" and a number of other buffers of lesser importance. This virtual machine is designed specifically to be apt for the parsing of formal languages. This parsing process traditionally involves two phases; the lexical analysis phase and the formal grammar phase. During the lexical analysis phase as series of tokens are generated. These tokens are then used as the input for a set of formal grammar rule. The chomski virtual machine uses the stack to hold these tokens and uses the tape structure to hold the attributes of these parse tokens. In a pp script, these two phases, lexing and parsing, are combined in one script file. A series of command words are used to manipulate the different data structures of the virtual machine.
Purpose and Motivation
The purpose of the pp tool is to parse and transform text patterns. The text patterns conform to the rules provided in a formal language and include many context free languages. Whereas traditional Unix tools (such as awk, sed, grep, etc.) process text one line at a time, and use regular expressions to search or transform text, the pp tool processes text one character at a time and can use context free grammars to transform (or compile) the text. However, in common with the Unix philosophy, the pp tool works upon plain text streams, encoded according to the locale of the local computer, and produces as output another plain text stream, allowing the pp tool to be used as part of a standard pipeline.
The motivation for the creation of the pp tool and the chomski virtual machine was to allow the writing of parsing scripts, rather than having to resort to traditional parsing tools such as Lex and Yacc.
Usage
The following example shows a typical use of chomski, where the -s option indicates that the chomski expression follows:
cat inputFileName | chomski -s '/(/ { until ")"; print; } clear;' > outputFileName
In the above script, only text within brackets would be saved in the output file.
Under Unix (and Windows), chomski can be used as a filter in a pipeline:
generate_data | chomski -s '/x/{clear;add "y";}print;clear;'
That is, generate the data, and then make the small change of replacing x with y.
Several commands can be put together in a file called, for example, substitute.chom and then be applied using the -f option to read the commands from the file:
cat inputFileName | chomski -f substitute.chom > outputFileName
Besides substitution, other forms of simple processing are possible. For example, the following uses the plus and count commands to count the number of lines in a file:
cat inputFileName | chomski -s '[-n]{plus;} <>{count;print;}'
This example used some of the following metacharacters and language features:
- The square Brackets (
[]
) indicate the matching of a character class. - The
-n
string matches a newline character. - The
<>
string matches the end of the input stream (text file). - The curly braces (
{}
) follow tests and group multiple statements. - The semi-colon (
;
) terminates all statements,
Complex chomski constructs are possible, allowing it to serve as a simple, but highly specialised, programming language. Chomski has only one flow control statement (apart from the test structures <>
, []
, //
etc.), namely the check command, which jumps back to the @@ label (no other labels are permitted).
History
The idea for chomski arose from the limitations of regular expression engines which use a line by line paradigm, and the limitations on parsing nested text patterns with regular expressions. chomski evolved as a natural progression from the grep and sed command. Development began approximately in 2006 and continues sporadically.[1]
Limitations
Chomski is not a general purpose programming language. Like sed it is designed for a limited type of usage. It currently does not support unicode strings, since the current implementation uses standard C character arrays. Chomski does not currently have a debugger for debugging complex scripts.
See also
References
- ↑ Developer’s (M.J. Bishop) personal recollection