Digging into Data Sets at the Industry Documents Library

Here at the UCSF Industry Documents Library (also known as IDL), we’re always looking for ways to make our collections more accessible and searchable for anyone who wants to use them. With nearly 15 million documents collected from the tobacco, drug, chemical, food, and fossil fuel industries, IDL supports a huge range of research – whether you’re a post-doc examining e-cigarette marketing methods, a faculty member researching the effects of environmental toxins on reproductive health, or an investigative journalist looking into the influence of industry lobbyists on public health policy.

All of our documents are fully indexed with detailed metadata and retrievable through full-text search and structured queries, but it can still be an enormous and time-intensive challenge to identify specific topics and sources contained in our seven terabytes of data.

To help our users navigate this ever-growing archive, the Industry Documents Library website features different search options where you can enter keywords, construct Boolean search queries, and sort and filter results based on relevance, date, and type of document. Often these search strategies still return tens of thousands of potentially relevant hits, so we’re also exploring other tools to assist users as they dig through the results.

Developments in computational analysis and natural language processing provide exciting inspiration to us in thinking about the ways programmatic tools can be used to enhance industry documents research. Public health and data science projects at Virginia Tech, the Wellcome Library, and here at our own UCSF Archives & Special Collections demonstrate the potential of big data to enable new investigations, such as mapping networks of people and organizations, examining the frequency and relationships of specific terms and phrases, and visualizing data in new and creative ways.

Fascinating analysis using the Truth Tobacco Industry Documents (TTID) collection has been done by Stephan Risi and Robert Proctor at Stanford University, with their Tobacco Analytics project. They’ve produced several case studies using large data sets investigating topics such as when the term “addiction” was first widely applied to smoking and nicotine, how euphemisms have been used in marketing tobacco products to teenagers, and the differences in rhetorical strategies deployed by plaintiffs and defendants in tobacco-related litigation.

The Industry Documents Library provides access to all our collections via the search interface on our public website, and also makes data completely available through our API. Users can access the API to send queries directly to our server and to export documents in a number of different formats to their own site or system. We’ve also recently created specific data sets for each collection, so users can easily download metadata and OCR text in a CSV file to analyze with their own programming tools.

With so many documents to explore, what will you dig into in the Industry Documents Library data sets?

This article from the UCSF Industry Documents Library, part of the UCSF Archives & Special Collections, wraps up our October Archives month programming. Sign up for Archives Digest for the latest news and events from the UCSF Archives.

Stay informed about what matters most to you.