Technology behind 9CHRIS
9CHRIS uses a series of custom programs to find and download individual volumes from the Internet Archive (IA), process them to find the title pages of documents, further process them to generate keywords and create pages for this wiki, and upload the result to this web-based system. A brief outline of each step is below.
Internet Archive Scanning/OCR
The Internet Archive uses an automated scanning machine with two cameras and open-source software to take photographs of each page of a book, opened on a scanning cradle. Then they use ABBYY Fine Reader OCR software to automatically convert it into readable text. They make several formats available, including the entire volume in .PDF and also in .djvu formats.
Finding and downloading the volumes
The 9CHRIS technology used a heuristic to find and download each volume from archive.org. This technique was based on observations of variations of naming practices by IA for the volumes.
Most volumes had a name like so: govuscourtsca9briefs1025
. However,
some small variations exist. Some volumes had "brief" instead of
"briefs" before the number. Most volumes had a 4-digit number, but
some in the 900-series had just three. Some volumes had an "x"
appended to the number, and others had "a" or "b" in a series as well.
In short, the program for downloading had to try several combinations,
and keep track of those that didn't work, as it stepped sequentially
through the volumes.
Analyzing the volumes
The essential idea for making it possible to find the documents within volumes was to think of how they were constructed to be used by humans. When binding, Hastings library staff clearly intended to leave the multi-colored soft covers of each document in place. This would allow a user to flip quickly to a section of the volume to see what was there. Based on this idea, the 9CHRIS software was created to try to identify those title pages automatically by looking for different color pages, but only those with a reasonable amount of text on them. (I couldn't look for the text itself because the OCR was so unreliable, but the quantity of text was still useful.)
After much trial and error, eventually I discovered that a combination of information points could be used to help classify each page within a volume. The mean color value and number of words (from OCR) proved to be most helpful. After these were calculated for each page in the volume, the resulting log was sorted by mean color value, and it was noted that most of title pages grouped together more or less. Figuring out where to cut this off was a challenge, and eventually it became clear that a color value threshold could be established at a point where many sequential pages had a large amount of text, indicating that these pages were ordinary text. Then, pages before the cut-off were evaluated and saved if they looked promising. This method produced generally good results, though it struggled if the title pages were close to the same color as text pages, or if there were lots of darker-colored pages (e.g. photographs or drawings).
Creating the wiki
The wiki platform ikiwiki
was chosen as a way of deploying the
information, as the title pages need substanial cleanup before they
could be entered into a traditional database. Ikiwiki has several
important advantages, including the use of Markdown, a simple markup
language, the use of git as the back-end version control system, and
the fact that ikiwiki is compiled so that only static pages are
served, which is of considerable importance to the speed of a wiki
this size.
Branchable.com was chosen as a host -- founded by the author of ikiwiki, Branchable.com had reasonable prices and specialized in hosting ikiwiki sites.
Creating keywords
Using the title pages identified in a previous stage of processing, the text between title pages (i.e. the text of each document) was extracted and analyzed for recurring keywords. These are what is left after the document is broken into individual words, sorted for frequency, and a list of "stop words" (i.e. common words) are removed. A similar process was used to develop key phrases (bi-grams), using the same stop word list. The stop word list used was a custom one, developed by running the algorithm over a sample of representative texts and adding meaningless words that turned up until most results seemed to be informative. I plan to re-evaluate the stop words lists, adding additional words, and regenerate the keywords in a future phase of the project, so suggestions for additional stop words are welcomed.