Tutorials

Research Computing Center Computational Scientist Jeffrey Tharsen teaches workshops on various digital humanities topics, including text analysis and visualizations. Below are resources and walkthroughs from these workshops.

Introduction to the Digital Humanities

Digital Humanities is a growing, interdisciplinary endeavor aimed at uniting technology and humanistic inquiry. This tutorial is a basic theoretical and methodological introduction to using and developing digital toolkits, digital texts, and digital media; it covers corpus-building, basic text analysis strategies, digital cartography, web-based development, high-performance computing, and creating custom algorithms and data visualizations.

Level:
Introductory

Handout

Introduction to Digital History

Digital History Handout: Data Visualization

Digital History Handout: OCR

Digital History Handout: Text Visualization

Digital History Handout: Network Visualizations

Introduction to Natural Language Processing and the Natural Language Toolkit

Natural Language Processing (NLP) has become an indispensable tool for the computational analysis of language, providing mechanisms for determining a wide range of linguistic features evidenced in digital texts. This workshop provides a basic introduction to NLP and the NLTK toolkit, plus hands-on exercises for tokenization and automated parts-of-speech tagging, named entity recognition, coreference resolution and dependency parsing.

Level:
Intermediate

Handout

Advanced Natural Language Processing and the Natural Language Toolkit

This workshop introduces advanced Natural Language Processing strategies and how to deploy the NLTK toolkit within Python frameworks. We will review both standard NLP tools (automated tokenization and POS tagging, named entity recognition, coreference resolution and dependency parsing) and also provide hands-on training in writing custom algorithms in Python and developing specific training sets for use with the NLTK toolkit and other frameworks (including machine learning with scikit-learn and advanced discourse and sentiment analysis).

Level:
Advanced

Handout

Text Analysis for Non-Western Scripts

While there has been much progress in developing digital humanities methods for English, and to a certain extent for other Western languages like French, Spanish and German, the ability to use similar techniques for languages based in “non-Western scripts” (such as Hindi, Chinese, Japanese, Persian, Hebrew or Arabic) has only recently become a reality. This workshop is designed to help users overcome the various initial hurdles one typically encounters when working with sources written in non-Western scripts and to introduce specific methodologies for optical character recognition (OCR), textual analysis and natural language processing (NLP) in non-Western languages.

Level:
Intermediate

Handout

Text Analysis and Visualization Strategies for Digital Humanists

This workshop provides a general introduction to the variety of modern toolkits and platforms most effective for creating different types of visualizations from textual source materials. It will primarily review toolkits and platforms which require no knowledge of code-writing, but also touch on the ranges of potential platforms available for users who have some basic training in computer programming. From close analysis of a single text up to large-scale stylometry and techniques for visualizing massive corpora of millions of texts, from simple text mining to literary analysis to linguistics (and even data-driven digital cartography), the goal of this workshop is to introduce the wide range of methods and platforms available so that users can make informed decisions when choosing visual strategies to complement individual research projects and pedagogical goals.

Level:
Introductory

Advanced Methods for Text Visualization

This workshop provides a deep dive into cutting-edge modern toolkits and platforms (e.g. BLAST, Stylo, PhiloLogic, scikit-learn, PCA, Lexos and D3) for creating different types of visualizations from textual source materials. It will use both toolkits and platforms that require no knowledge of code-writing and others appropriate for users who have some training in Python. From the morphology of a simple phrase to large-scale stylometry and techniques for visualizing intertextual relationships in corpora of tens of thousands of texts spanning centuries or millennia, from collocations to generative approaches to phonology and rhetoric, this workshop introduces a range of new methods that researchers will be able to employ in individual research projects at any scale and for any source language or script.

Level:
Intermediate

Handout

Designing Interactive Data Visualizations (D3/Javascript)

This workshop provides a theoretical and practical introduction to data visualization strategies with a focus on the types of interactive visualizations available within the Data Driven Document (D3) web-based platform for Javascript (see github.com/mbostock/d3/wiki/Gallery for a range of examples). It discusses general best practices for graphic presentation of data, then evaluates the exciting capabilities of the D3 toolkit as we review the wide variety of data visualizations it makes possible, from simple graphs and charts to three-dimensional projections, network maps and data-driven digital cartography.

Level:
Advanced

Handout

Advanced Interactive Data Visualizations (D3/Javascript)

This workshop provides a theoretical and practical introduction to advanced data visualization strategies with a focus on the types of interactive visualizations available within the Data Driven Document (D3) web-based platform for Javascript (see github.com/mbostock/d3/wiki/Gallery for a range of examples). It discusses general best practices for graphic presentation of data, then walk through hands-on exercises for designing interactive maps, chord diagrams, force-directed interactive network visualizations and visualizations for molecular and atomic structures.

Level:
Advanced

Handout

Advanced Interactive Visualizations for Data Analysis (Bokeh + TensorSpace)

Modern computational systems for data analysis now allow developers to create a wide range of interactive visualizations to expose and explore their datasets. One of the most common and powerful of these at present for Python and R is Bokeh, which contains thousands of interactive data visualizations that can be harnessed and deployed relatively easily (see https://docs.bokeh.org/en/latest/docs/gallery.html for a few examples of these). In the first part of this workshop, it will explore the ranges of interactive visualizations available in Bokeh, discuss best practices for choosing the most effective and intuitive methods for visualizing specific types of data, and walk through the steps needed to deploy these visualizations on the web so that others can use them. The second part of the workshop then turns to TensorSpace and methods for visualizing the data produced by neural networks (like Tensorflow), focusing on the ways that the data is rendered and transformed within each layer as it works toward generating the end result. With TensorSpace we can now visually interrogate the data I/O that the neural network uses and generates at each step of the process, providing a full picture of how the NN produced its final models and results.

Level:
Advanced

Handout

Deep Learning Frameworks for Natural Language Processing: BERT and GPT-2

This workshop provides a basic introduction to two current cutting-edge deep learning frameworks for generating custom text based on a custom source. It will teach 1) how to set up and use pre-existing deep learning models on the university’s high-performance computing (HPC) cluster; 2) how to train models for BERT and GPU-2 on any corpus of documents, optimizing parameters to get the best results; 3) how to leverage these models to create predictive text in any genre (including poetry), and 4) discuss methods and strategies to evaluate accuracy and variation in the output.

Level:
Advanced

Handout

Natural Language Processing with GPT-3 by OpenAI

This workshop introduces participants to the wide range of functions made possible by OpenAI’s GPT-3, widely regarded as the best large model framework for NLP (natural language processing) and NLU (natural language understanding) to have emerged in recent years. After a brief discussion of GPT-3’s architectural framework (large-scale generative transformer-based neural network) it will explore some of its current applications, including chatbots and Q&A algorithms, multilingual translation and data conversion, classification, summarization and text generation, and how to fine-tune these types of models to produce optimal results for specific NLP/NLU tasks.

Level:
Intermediate

Handout

Digital Textual Analysis for Large-Scale Repositories (HathiTrust and others)

This workshop provides an overview of ways to access the holdings of various large-scale digital textual repositories; attendees will learn how to perform basic textual analyses of corpora, using both ready-made analytical methods and custom NLP algorithms. With over 16.7 million volumes, the HathiTrust Digital Library (HTDL) is now the largest single archive of digitized reference and literary works in the world, but due to privacy restrictions, accessing and performing analyses of these holdings is highly constrained. It will guide users through the steps needed to access the HTDL and perform analyses using their custom Python API.

Level:
Introductory

Handout

Introduction to the RCC for Scholars in the Social Sciences and the Humanities

The Research Computing Center (RCC) offers high-end computational resources that make possible new forms of humanistic and social science research. As the diverse fields aligned with digital media and digital data sources continue to grow, researchers in the social sciences and the humanities are now regularly employing algorithms, tools and computing platforms to rapidly and systematically process and analyze textual corpora of all types and sizes, to analyze social and linguistic networks, to explore digital imagery, to create complex and interactive data visualizations, and to extract data from a wide variety of archival sources. This introductory workshop is designed to welcome you to basic mechanisms and toolkits for computing as well as provide an introduction to working with large datasets on university computational resources.

Level:
Introductory

Handout

Top: Digital reconstruction of “Raided Village” mural, Temple of the Warriors, Chichen Itza, by Magdalena Glotzer, AB ’19 (2019), based on a 1931 reconstruction by Ann Axtel Morris and diagrams from Morris, Earl Halstead. The Temple of the warriors at Chichen Itzá, Yucatan. Washington: Carnegie Institution of Washington, 1931.