Library facilities are open to those with UCSF ID, though selected spaces remain closed with services available remotely. See timeline for reopening.

people working on laptop computers
Kate Tasker
Kate Tasker
Kate is the Industry Documents Library Managing Archivist.

Data Science Fellowship at UCSF Industry Documents Library

The UCSF Industry Documents Library is seeking applicants for a summer data science fellowship, to assess the impact of transcription accuracy on text analysis of digital archives. This Senior Data Fellow position will take a leading role in a collaborative project under the supervision of staff from the Industry Documents Library and the UCSF Library Data Science Initiative.

Senior Data Science Fellow

Library overview

The UCSF Industry Documents Library (IDL) is a digital archive of more than 15 million documents created by industries which impact public health. It contains previously internal records from the tobacco, opioid, drug, chemical, food, and fossil fuel industries. Since the IDL was established in 2002, the documents have been used by researchers, journalists, lawyers, policymakers, community advocates, and others in more than 1,000 publications. These publications have supported significant scientific and investigative research that has facilitated efforts to reduce smoking and related diseases, saving millions of lives worldwide.

The IDL also contains thousands of audiovisual materials, including recordings of internal focus groups and corporate meetings, depositions of tobacco industry employees, Congressional hearings, and radio and TV cigarette advertisements.

The UCSF Library Data Science Initiative (DSI) serves as a campus hub for education and support in data science. Its mission is to build computational and data skills in the UCSF community by providing education and resources to trainees, faculty, and staff.

Fellowship overview

This fellowship will support a project to compare human-evaluated transcripts with computer generated transcripts for text and audiovisual materials in the IDL collections. Through tagging, human transcription, and computer-generated transcription, the team will assess how accuracy may differ between media or document types, and how and whether this difference is more or less pronounced in certain categories.

Through the identification of transcript accuracy in different media types in the collections, we will attempt to provide guidelines to researchers and technical staff for proper analysis, measurement, and reporting of transcript accuracy when working with digital media.

Position overview

The Senior Data Science Fellow will:

  • Assist with designing the project, including gathering project requirements and needs
  • Provide guidance to two Junior Data Science Fellows
  • Tag videos with a pre-defined list of categories
  • Review text extracted from video with Google Auto ML
  • Run Uberi/Speech Recognition programs on videos in the archive to extract text
  • Run sentiment analysis and/or topic extraction on the text extracted from videos
  • Study the sentiment/topics produced by Google Auto ML and Uberi in each category of video and gather statics

What you will be learning

  • Natural Language Processing (NLP) tools in the areas of speech to text, sentiment analysis, and topic modeling
  • Design and carry out a case study and present finding
  • Digital archival methods and practices
  • Participate in staff meetings
  • Attend data science workshops and classes (in-person and virtual options available)
  • Receive mentorship and training from data scientists, programmers, and librarians from the Data Science Initiative and Industry Documents Library

Who we are looking for

  • Must be enrolled in a degree/license program in a 2 or 4 year institution, graduate school, vocational school, etc
  • Interest in digital curation and collection building for libraries and archives
  • Two years or more of programming knowledge/experience preferred
  • Proficiency in one of the following programming languages preferred: Python, R, Java
  • Familiarity with Natural Language Processing (NLP) tools preferred
  • Excellent analytical and writing skills
  • High level of accuracy and attention to detail
  • Ability to work independently

Compensation and work environment

This position is fully remote and includes a total of up to 160 hours (approximately 20 hours/week for 8 weeks). These work hours are flexible and can be arranged to suit student schedules and course requirements. The Fellow should ideally be available to start by June 1, 2022.

Fellows are paid at least a minimum SF wage, currently $16.32 an hour.

How to apply

Please email a cover letter, contact information for two references, and resume to Kate Tasker, Industry Documents Library Managing Archivist, at kate.tasker@ucsf.edu. The position is considered open until filled.

Photo by Mikael Blomkvist from Pexels

Get newsletters on selected topics that matter to you.