Artificial Intelligence + Internet Archive, or AI and IA: New Technology for Old Stuff
I am excited to announce the recent integration of AI-driven processes in Special Collections’ digital archiving work alongside the expertise of our very human archivists, librarians, and student workers.
Making use of OpenAI’s Whisper library for audio transcription and leveraging the ChatGPT API to transcribe handwritten documents and generate summaries has allowed me to make major improvements to our digital collections hosted by the Internet Archive.
At the same time, I’ve been making use of Microsoft’s Copilot to write a suite of Python scripts and other custom tools to process archival backlogs at a faster rate than ever.
Some examples of what I’ve been able to accomplish:
- Added over 4,500 new items to Internet Archive over the past two months.
- Generated selective summaries for 2,997 issues of the Campus and 199 issues of Middlebury Magazine.
- Created 1,679 new transcriptions and summaries for handwritten letters and other documents.
- Transcribed and summarized 423 historic WRMC radio recordings and 209 lectures from the Digital Lecture Archive.
- Compiled transcripts and summaries of 189 lectures from the Bread Loaf Writers’ Conference archives, with hundreds more in progress.
To put these projects in perspective, it takes a human transcriber an average of four hours to transcribe one hour of audio. If each of the WRMC recordings is an hour, that’s about 3,000 hours of labor. Our students work about 10 hours per week, so a project like this would take one student 12 years! In comparison, using the Whisper library to transcribe an hour of audio takes roughly ten minutes (along with the very real electricity and environmental costs).
These new, carefully labeled AI summaries appearing across our digital collections will make searching easier for humans and machines alike, amplifying historical voices from the past. I’m hopeful that these innovations will empower researchers across our campus and the globe to more easily discover Middlebury’s rare and unique Special Collections.
Patrick Wallace is the Digital Projects & Archives Librarian and oversees the digital side of Special Collections and the College Archives. Patrick’s life outside of work is mostly dedicated to film photography, video art, electronic music production, bicycle repair, performance driving, and cats.