Database Indexing in Practice: DNSA

My previous post on database indexing covered key concepts.

Background on The National Security Archive

The National Security Archive (the Archive) is a non-profit organization that was founded in 1985 by a group of investigative journalists and scholars. Their mission was to check the rising amount of government secrecy and to legally provide public access to declassified government documents, giving the rest of us an insider look into how decision-making is made at the highest levels.

A timeline of the Archive’s history shows the impact that the D.C.-based non-profit organization has had since its first years, as well as its development of Track II diplomacy efforts.

In all the Archivists’ work, they have relied most heavily on documents obtained through the Freedom of Information Act (FOIA). Those documents are maintained by the Archive in their physical storage, and a subset of relevant documents are curated into comprehensive sets (i.e., collections — which consist, on average, of around 2,000 documents) that get published by ProQuest to enable remote, researcher access.

Those document sets are indexed and published in an online database.*

Digital National Security Archive (DNSA)

DNSA provides online access to more than 140,000 declassified government documents, the vast majority of which are from the U.S. (through FOIA), but some have been provided by foreign governments.

Documents that weren’t born digital (such as letters vs. emails) are scanned when they are released by the responsible agency, and made available to the public.

DNSA, since 1989, has published more than 56 topic-based **collections, and the Archive’s team of indexers produce two new collections each year.

Before the indexers get to work, however, each collection is compiled and curated by subject-area scholars. An introduction to the collection provides a timeline of events and the “big picture” story that put the individual documents into context.

DNSA allows anyone with a subscription to search for and access the individual documents that shaped responses to issues and events such as the Cold War, Iran-Contra crisis, Vietnam War, the post-9/11 U.S. invasions and wars in Iraq and Afghanistan, U.S. policy toward Mexico and other countries, environmental and energy policies, and more.


*Some documents are made freely available on the Archive’s website, and a Virtual Reading Room allows users to search for specific (Russian and American) documents using text/title keywords and date filters.
Additionally, the documents found on the Archive’s Russian Page number around 20,000 and come from archives in Russia and Eurasia. Their scope includes topics from Perestroika to Chernobyl, from Georgia and Afghanistan, to Cuba, and more. They include Brezhnev diaries and Volkogonov collections. Documents found on the Russian Page are separate from those indexed in DNSA.

**The first DNSA collections were published in print and microfiche but have since been scanned and published in the DNSA ProQuest database.


Indexing Documents for DNSA Publication

The indexers at DNSA don’t just index documents, they catalog and write abstracts for them, as well. Each individual document gets the same basic metadata:

  • date (all cables also include timestamp information)

  • title

  • author & recipient names, including personal names or organization/bureaucracy

  • organization names (index terms)

  • subject/key words (index terms)

  • abstract (with a limited word count)

While there might be some additional information, such as citations noting the source of the document (e.g., NARA, a presidential library, or somewhere else), these searchable fields are what makes DNSA such a valuable asset to researchers of all levels.

Indexing as Part of a Team

When I started as an Indexing Librarian for DNSA in the Spring of 2013, we were a team of three who would each claim a portion of the current set of documents. We would work individually to catalog, index, and provide abstracts for those documents, periodically submitting what we had done so far to a teammate. That teammate would provide copyediting and check for consistency among keywords. The abstracts would also be edited for typos and readability. The marked-up catalog records would be returned to the indexer to make the appropriate corrections. This process would be repeated until the whole document set was catalogued.

Once a week, a unified list of all the keyword (and subject) terms that were being used in the current document set, and we would gather around a central table to discuss the list, the context in which the terms were being used, and determine whether that was the best use of the term or if there might be a more appropriate term — whether another term was better in the sense of consistency with previous sets or accuracy within the context of the current set.

The goal of these discussions was to ensure that potentially ambiguous terms were being used consistently among the three of us and also that the terms were being used as consistently as possible in relation to previously published sets. We would turn to various reference sources, including the UN Thesaurus, specialized dictionaries, and even DNSA itself, to see if certain terms were previously used, and in what context.

Eventually, another coworker joined the team part-time to help us with editing our work. She reviewed all the indexers’ records and, as a result, was able to provide insights into our unique approaches to indexing as well as where inconsistencies were an issue.

In the final stages of each document set’s publication process, we would work together to double-check certain metadata fields for consistency of style and accuracy, and to ensure that each document and catalog record was accurately paired. The final step was submission to ProQuest.

The Importance of Being Consistent

The most important thing for database indexing is consistency — consistency in the way metadata is recorded over time, across the many document sets in DNSA, as well as consistency among indexers within a set. It is crucial, as I have mentioned previously, because only with consistent application of indexing terms will database users retrieve all the relevant results.

Authority Control

In order to achieve accuracy and consistency, we used authority records for nearly every type of indexing term. We had authority files for all keyword/subject terms, for names — personal and organizational, for locations of the records (NARA, FOIA reading rooms, etc.), and even for document types.

An extensive physical and digital reference collection was a crucial tool for our daily work.

The thesaurus for DNSA subject terms was (and almost certainly still is) based on the UN Thesaurus; personal name authority records were based on as reputable a source as possible — from directories like the Foreign Service Lists to the LoC Authorities, etc. Additionally, all geographic (place) terms, organization names, and events that made their way into the indexing received authority records.

Organization terms, including names for specific government offices, could be among the trickiest. Government agencies especially have a habit of renaming or reorganizing; they also tend to have complex structures and organizational charts are not always available. Some intense research could go into creating those authority records.

Scope notes gave context to many authority records, allowing us to confirm that we were choosing the correct term when indexing a document — providing clarity for similar names of people or events, or definitions for potentially ambiguous concepts.


Since I left the Archive in December 2019, my DNSA coworkers have migrated to a different database management system. I don’t know what their new system is, but considering their dedication to information access, quality, and consistency, I am certain they found a software that allows for a continuation of the precise searchability that DNSA users appreciate.

Previous
Previous

Why Index an E-Book?

Next
Next

2023 in Review