Digital Text Analysis

University of Chicago Photographic Archive, apf2-03320, Special Collections Research Center, University of Chicago Library

For overviews of and introductions to the concepts mentioned here, see our guides on how text is digitized and searching for digitized research materials.

As noted in the guides, most text digitization is done on a large scale due to the labor and resources required; most scholars seeking to digitize texts will collaborate with staff at the Library, Humanities Computing, and/or the Research Computing Center. If you plan to conduct some small-scale text digitization of your own, however, in many cases you will need to use optical character recognition (OCR) software to make the text useful. If you have scanned your text into a pdf, Adobe Acrobat will automatically run its OCR capability on the document, generating machine-readable text. Adobe Acrobat is available to many University of Chicago faculty members via their departments. However, Acrobat's technology may not suffice for many academic projects. Acrobat recognizes only a limited number of contemporary languages; if your text is not in one of these languages, uses nonstandard spelling or typefaces, or contains text in multiple languages, a more robust program will be necessary. ABBYY FineReader is a commercial OCR product and much simpler to use, capable of recognizing 190 languages (and in some cases, handwriting). However, FineReader must be purchased, though a free trial version is offered.

Once a machine-readable version of the printed text has been created, literary scholars can use a number of different algorithmic methods to illuminate language and spur further study. The HathiTrust digital library has created a corollary resource system, the HathiTrust Research Center, allowing scholars to build corpora from the collections and perform different types of analyses on the text, including word frequency counts or classification. However, these tools can only be used on materials in the HathiTrust collections. Voyant is a web-based system of tools allowing researchers to load any text and perform basic analytics and visualizations of word frequencies.

Both of these tools are accessible to researchers with no programming background. For many digital text analytics methods, however, some knowledge of programming is required. While the most common tools for more sophisticated digital textual analysis, such as the programming languages R and Python, are simple yet powerful, they require that a scholar invest time and effort in learning programming basics. However, many scholars at the University are experts in this field; browse People for other researchers, or contact the digital humanities team for support. 

Image: 
University of Chicago Photographic Archive, apf2-03320, Special Collections Research Center, University of Chicago Library