Canterbury corpus
The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.[1]
Contents
In its most commonly used form, the corpus consists of 11 files, selected as "average" documents from 11 classes of documents,[2] totaling 2,810,784 bytes as follows.
Size (bytes) | File name | Description |
---|---|---|
152,089 | alice29.txt | English text |
125,179 | asyoulik.txt | Shakespeare |
24,603 | cp.html | HTML source |
11,150 | fields.c | C source |
3,721 | grammar.lsp | LISP source |
1,029,744 | kennedy.xls | Excel spreadsheet |
426,754 | lcet10.txt | Technical writing |
481,861 | plrabn12.txt | Poetry |
513,216 | ptt5 | CCITT test set |
38,240 | sum | SPARC executable |
4,227 | xargs.1 | GNU manual page |
See also
References
- ↑ Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92.
- ↑ Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN 9781846286032.
External links
This article is issued from Wikipedia - version of the 9/30/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.