| |||||||||||||||||||||||||||
|
Best Practices for Indexing and Abstracting Tobacco Industry Documents
Revised
June 29, 2001
This guide will provide you with a basic orientation to the standards endorsed by the UCSF Tobacco Control Digital Archive. Note: these are the practices we use in house to create augmented records. These may not directly reflect all collects presently supported by UCSF, or the records in the Legacy Tobacco Documents Library, because the data sources were created in such varied ways. Please contact us with further questions.
Groups worldwide are collecting and indexing tobacco industry documents for purposes ranging from health policy development, litigation, development of tobacco control policy, and analysis of scientific work. Since the needs and methods of these groups are as diverse as the topics themselves, a set of commonly used standards must be employed to ensure consistency of document description that will enhance the accessibility of these documents well into the future.
These standards encompass both intellectual and technical components. They are:
Records are composed of fields, each of which contains one item of information. In this case, each record describes one tobacco document. The record structure is the way fields are laid out and defined in a database. Any database can support the record structure defined here. Choose which fields you need for your project (note that 15 of these are required for compliance with UCSF standards) and a layout for efficient data entry. Guidelines for entering data into the fields are given below, in Field Definitions. A database schema is available from UCSF upon request.
The Field Definitions were jointly developed by staff of the UCSF Library/CKM Tobacco Control Archives and Tobacco Documents Online. A full listing, including examples, is available at http://www.library.ucsf.edu/tobacco/fielddef.html (also attached) and at http://tobaccodocuments.org/tcml.php (where XML formatting of these may also be viewed). They incorporate fields currently in use in various tobacco document indexing projects, and those in use by the industry itself.
The field definitions define the intellectual components of the record (title, thesaurus terms, date, and so forth) as well as providing a template for the record structure of a database to hold the information. The format of the database is up to the creator-Microsoft Access, Filemaker Pro, and MySQL have all been used. Further specifications are available from UCSF.
The choice of fields to be used is also up to the creator to decide upon, based on project needs and other constraints. Few projects will need to use all 52 fields. Below is a listing of Required, Recommended, and Optional fields.
Items with an asterisk (*) have authority lists affiliated with them at. An authority list contains the accepted terminology for the field. Consistency in such matters is essential to ensure strong data that can be most efficiently searched. Authority lists for those fields you plan to use in your project can be requested from UCSF.
Document ID
This field, used as the "key" in most databases, is used to numerically identify document records in a database. It is used internally in each database, but will not be sustained when data is integrated into other databases.
Example: 4400
Start Bates
The Bates Number is the number, usually 10 digits long, stamped on each page of a document when it is entered in a trial. Bates numbers are sequential. Sometimes they begin with letters, such as TIOK or TIFL, which "name" the trial (i.e., TIFL=Tobacco Industry, Florida). Start Bates is the number that appears on the first page of the document.
Example: 202712056
Sometimes a document will have more than one Bates number. Use the Alias Bates Start/End field to enter the other start and end Bates numbers.
End Bates
The Bates number that appears on the last page of the document. If the document is only one page long, this number is the same as the Start Bates number, but should still be entered in this field.
Example: 202712059
Alias Bates Start/End
These fields are filled in when a document has more than one set of Bates numbers.
Title
The title of the document. If there is no obvious title, create one inside brackets. (The brackets indicate that the title is fabricated.) Attempt to name any project, subject, or other relevant information to assist in contextualizing the information.
Example: [Letter from R.E. Thornton to S. Chilcote regarding Project ASSIST].
Additionally, sometimes it is helpful to add a bracketed subtitle if the given title of the document seems not to give enough information to a researcher, or if a whole series of documents seem to share the same title.
The author of the document. The format for naming should be as formal and complete as possible.
Example: Hayward, Simon P.
There will be many variations on names, as well as misspellings and other errors. For example:
- Merlo, Ellen
- Merlo, E.
- Merlso, Ellen S.
- Merlo, E.S.
- Merlo, Ellen D.
Authority control work - human effort to cohere, correct, and streamline the names list, as well as technology to assist in indexing and identifying names - is under development and will be released to users when feasible.
Until that time, indexer effort is extremely important. Enter the name as accurately as possible and reflect the document as closely as possible. Unlike the Corporate Author field, do not ever assume you know the full or correct name. Keep in mind that many tobacco companies were headed by a series of family members who have little differentiation between their names. IMPORTANT NOTE: The UCSF indexing system does not use punctuation of any kind. Therefore, we enter the name elements into separate fields like so: "Merlo" "Ellen" "S".
A document may also have more than one author. List authors in the order they appear on the document. If a document has no personal author, look for a Corporate Author (see below).
The corporation or organization that authored the document. This differs from the Company field in that the Corporate Author signifies the actual creator of the document, not just who produced it for the courts. For example, Philip Morris may be the producing company of a document authored by the Tobacco Institute. Or, the British-American Tobacco Company may produce a document from the Federal Trade Commission or the New York Times. Sometimes the corporate author will be the only author of a document.
Identify the Corporate Author as completely as it is identified in the document. The format for naming organizations should be as formal and complete as possible.
Example: People's Drug Stores
There will be many variations on names, as well as misspellings and other errors. For example:
- People's Drug Stores
- Peoples
- Peoples Drug, Inc.
- Peoples Drugs
Authority control work - human effort to cohere, correct, and streamline the organizations list, as well as technology to assist in indexing and identifying organizations - is under development and will be released to users when feasible.
Until that time, indexer effort is extremely important. Enter the organization as accurately as possible and reflect the document as closely as possible. However, in order to supply as many access points as possible, if the document refers to a shortened version of an organization and you are positively sure about the longer version, add both.
Example: The document refers to ITL and you know that it means Imperial Tobacco Limited. Add both ITL and Imperial Tobacco Limited.
Do not assume you always know the full or correct name of the corporation. Keep in mind that many organizations' names differ only slightly, e.g. Imasco Limited and Imasco USA.
If a document bears more than one Corporate Author, list these in the order they appear on the document.
The company or organization that is named as the recipient of a document or that is in a CC ("carbon copied") or BCC ("blind carbon copied") field. Sometimes the corporate recipient will be the only recipient of a document.
Identify the Corporate Recipient as completely as it is identified in the document. Follow the same guidelines as for Corporate Author.
If a document bears more than one Corporate Recipient, list these in the order they appear on the document.
Copied (CC)
Individuals named in a CC ("carbon copied") or BCC ("blind carbon copied") field. Follow same guidelines as with Author.
When organizations are named in a CC ("carbon copied") or BCC ("blind carbon copied") field, put these in Corporate Recipient. Follow the same guidelines as with Corporate Author.
Company
The tobacco company that produced the document for the courts in the litigation process. This will usually be identifiable as part of the litigation process or the process that retrieved the document. If the document was not produced by a company (for example, a newspaper article), do not enter data into this field.
Page Count
The number of pages in a document. Identify as 1, 48, etc, not as Bates numbers.
Format
The format in which you hold the document: print, tiff file, gif file, PDF, other.
Comments
Notes on context or other aspects of a document. (Currently at UCSF, this field is for internal use only.)
Note: Information regarding physical aspects of the document (condition, marginalia, etc.) should be entered in the Characteristics field.
Abstract
A brief description of the document's form and content--approximately 150 words or 5-10 sentences. The abstract provides both an overview of the information as well as conveying the language, emphasis, and priorities of the writer/industry.
Thoroughly read the document. Take notes on scrap paper as you go, noting subjects covered or referred to, and the tone of the document. Note any phrases or quotations you may want to use. While title, author, and document type are recorded in other fields, it is acceptable to note these at the beginning of an abstract to help frame the information to follow. It is strongly recommended that the abstract be in present tense, and that each sentence begin with an active verb (Reports, States, Suggests, etc.). A list of useful active verbs is attached. This practice helps objectivity and clarity in the writing, and reduces the use of extraneous language.
Keep in mind that your objective is to describe the document, not to report all the information it includes. While it is often challenging to present an unbiased summary, remember that the purpose of an abstract is to provide an objective description of the content. Be careful not to infer meaning from the content. Focus on the points emphasized by the writer, and use what, why, and how to frame the abstract.
Examples:
Memo from Roger Mozingo to Regional Vice Presidents and Regional Directors regarding Fire Codes. Reports efforts in Kansas and Pennsylvania to restrict public smoking by passing or amending municipal fire codes. Defines and clarifies industry policy on the issue: not opposing measures to restrict lighted smoking materials in areas with flammable materials but opposing the restriction of smoking in public places where no imminent danger of flammability exists. Reminds State Activities Division staff to refer all calls on this matter to Public Relation Division to "prevent quotes (and misquotes) from coming back to haunt us."
Memo from William Kloepfer to Kornegay, Mozingo, Chilcote, Henderson, and Milway regarding Coalition of Smoking and Health agenda broken down into federal, state and local initiatives in categories of health, tax, environmental, fire and advertising. Denotes some items being pursued aggressively. Notes examples of other action items within the anti-smoking movement, mentioning aircraft smoking prohibitions and surveys of magazine advertising anti-smoking content.
PM memo on Task Force agenda. Gives agenda for meeting with Covington and Burling to discuss study, sound science coalition, scientific programs research and GEP criteria, and review of legislative initiatives.
Thesaurus Terms
Proper assignment of thesaurus terms requires thorough familiarity with the ANRF-UCSF Thesaurus. Keep in mind an upper limit of about ten thesaurus terms to provide subject access to a document. Remember to consider that researchers may be seeking the document in a way quite different from how you found it. Describe the document, not only your path to it, or the context of your project.
For more information, see the detailed discussion of the ANRF-UCSF Thesaurus at the end of this document.
Named Persons
Names (as complete as represented in the document) of all individuals named in the document. The author, recipient, etc., should be entered in the appropriate fields and need not be re-entered here. Follow same guidelines as with Author.
Named Organizations
Names (as complete as represented in the document) of all organizations (companies, groups, etc.) named in the document. The company, corporate author, etc., should be entered in the appropriate fields and need not be re-entered here. Follow same guidelines as with Corporate Author.
Division
Names the division of the company that produced the document. Enter the division as it appears on the document.
As in the Corporate Author field, if the document refers to a shortened version of a division and you are positively sure about the longer version, add both.
Example: The document refers to the division as R&D and you know that it means Research and Development. Add both R&D and Research and Development.
Type
The type of document.
Identify the overall type of document and then add any other terms that are significant and might be helpful to researchers.
Example: The document is a scientific report with many illustrative graphs and a lengthy bibliography. Add: Report-Scientific, Chart/Graph, and Bibliography.
Source
The source from which the document was received.
Example: NAAG data, Physicians for Smokefree Canada.
Brand
Cigarette, cigar, and smokeless tobacco brands are listed here. Name the brand family, not the specific brand.
Judgment call: If a document consists primarily of just a listing of many brands without any pertinent information about each, it is not necessary to enter each brand name. Add just the first 10 brand names. Use the thesaurus term "Brand Names" (and others) to reflect the document. Discuss questions with other indexers.
This field is used to describe the "physical" characteristics of a document.
A note on the characteristic DUPLICATE: The definition for an actual duplicate is very stringent. The documents have to be exactly the same, with the same content, marginalia, stray marks, etc. If there are any minute differences between documents, they are not duplicates.
Language
The language in which the document is written. The default is English. Enter (or pick from authority list) the language in which the majority of the document is written. Sometimes, in the case of a bilingual document, two languages may be chosen.
URL
If the document resides on a website, enter the URL of the document reference here.
Region
The geographical area(s) discussed in the document. Use this field only if the document is primarily about the region.
Example: The document is about marketing in Latin America. Add Latin America. The document discusses legislation issues in Canada. Add Canada.
UCSF uses the ISO region list.
If a state is discussed, add it, and if a city or town is discussed, add this also.
Example: California or Boca Raton, Florida
Judgment call: If a document consists primarily of just a listing of many regions without any pertinent information about each, it is not necessary to enter each region. Use the thesaurus terms "International Trade," "International Level," and/or "Multinational Corporations" (and others) to reflect the document. Discuss questions with other indexers.
Keyword/Identifiers
Non-standard terms used to express concepts in the document. Use this field to list names of bills, court cases, and other elements that are not indexed in any other field. List terms which are not in the thesaurus; this field may also be used to hold terms which are being proposed for inclusion in the thesaurus. Keep in mind an upper limit of about ten keyword terms to provide subject access to a document.
Provenance
The history of ownership of the document.
Example: Philip Morris/NAAG/UCSF.
Litigation Usage
This field indicates what trials this document has been used in; such data is supplied by the industry, and need not be augmented by the indexer except under special circumstances.
Coded Industry Operations
This field names the project or plan code name. For example, Project Kestrel was the reference for a youth marketing campaign.
Attachment
This field may hold the link or reference (Bates numbers) to the document to which this document attached, if known.
Document Quotes
Enter, in quotation marks, the quote from the document. It is recommended that you identify the speaker, the page number, and any other information need to frame the quote in context, as well.
Area
The area in which the document was located by the industry. Industry-supplied data, which need not be augmented.
Request Number
A legal reference. Industry-supplied data, which need not be augmented.
Privilege
A legal reference. Industry-supplied data, which need not be augmented.
Site
A legal reference. Industry-supplied data, which need not be augmented.
Other locations
Other places, whether physical or digital, where this document resides, if known.
Example: TDO, ANR Tobacco Industry Tracking Database.
Box Number
The box number in which the document resides at the Minnesota or Guildford Depository. Industry-supplied data, which need not be augmented.
Court Date
The date upon which this document was introduced in court. Industry-supplied data, which need not be augmented.
Witness
The name of the witness with whose testimony this document was introduced. Industry-supplied data, which need not be augmented.
Indexer Initials
The initials of the individual who added data to the record.
More Specifics
What to do with typos
If you're 100% sure that the name or organization you're adding is a victim of a typo, add it just the way it appears on the document and also add the term that you think is meant.
Example: You've just read many letters going back and forth between Paul Ryan and S.J. Green. In the next letter, the subject matter is the same but the addressee is Piul Ryan. In the Recipient field, add both "Ryan, Piul" and "Ryan, Paul."
What to do with long bibliographies
When confronted with a bibliography of 10-20 citations, there is no need to add every single author. Just add the first author listed in each citation-this should give researchers adequate access. If the bibliography goes on for several pages and you are using OCR technology, check the OCR to see how clean it is. If it's reading the names fairly well, just add the first 10 names, and let the OCR do the rest.
The UCSF/ANRF thesaurus was originally created by staff of Americans for Nonsmokers' Rights, a California-based advocacy group, for use with their Tobacco Industry Tracking Database. In 2000, UCSF entered into a licensing agreement, through which UCSF retains the right to expand and otherwise modify the original purpose. While this thesaurus, in various incarnations, has been in use at the UCSF Tobacco Control Archive for some time, it is presently undergoing evaluation to identify areas in need of further expansion. For example, technical and scientific terminology may be incorporated to provide more detailed subject access to highly technical documents.
Using the thesaurus is exacting work that requires some time for orientation to the structure and terminology used.
Example:
Standards and practices for this are under development.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||