Project Outline

This outline, presented to the Archives as a refinement of the original proposal, summarises the context, aims and outcomes of the project.

As archives are increasingly digitised, so their collections become available as rich, and very large, datasets. Individual records in these datasets are readily accessible through search interfaces, such as those the Archives already provides. However it is more difficult to gain any wider sense of these cultural datasets, due to their sheer scale. Conventional text-based displays are unable to offer us any overall impression of the millions of items contained in modern collections such as the National Archives. Searching the collection is something like wandering through narrow paths in a forest: what we need is a map.

This proposal is to research and develop techniques for visualising, or mapping, archival collections in a way that supports their management, administration and use. The specific aim is to develop techniques for revealing context: the patterns, high-level structures and connections
between items in a collection.

The practical outcomes of the project will be prototype interactive, browsable maps of the National Archives collection that apply these techniques at different structural levels:

  1. A map of the whole collection, at Series level, will show the "big picture": the size, scope and historical distribution of different series, the relations between series, and their corresponding Agencies and functions.
  2. A more detailed map will focus, as a test case, on a single series (A1), accumulating data from individual records to reveal the distinctive "shape" of that series.
The issue of navigating large digital collections is current and significant; interestingly some
prominent American researchers have recently announced a broadly related project. This project is highly innovative; by supporting it, the Archives would take a leading position in the field. The project would be extensively documented and well disseminated, drawing an international audience.


  • A prototype browsable map showing the structure of the whole National Archives collection at a Series level, including the relationships between Series, collecting and controlling Agencies, and functions.
  • A prototype map of a single series, linking to and contextualising individual items in the series.
  • A set of sketches: static and dynamic visualisations that demonstrate a range of different approaches.
  • A set of techniques and approaches for creating interactive maps of archival datasets. These will be applicable across the archives sector, and among other institutions dealing with digital collections.
  • Documentation and dissemination of the project to an international audience.


Self Organising Maps come to mind, I wonder if they would be useful in this project?

31 July 2008 at 2:11 pm  

Thanks, Jonathan. From what (little) I understand SOMs are designed to map high-dimensional data with lots of continuous parameters. My sense is that archival structures / records are high dimensional but non-continuous or even non-parametric. In other words, how would a SOM deal with a "parameter" for a record or series like "name of creating agency"?

31 July 2008 at 2:31 pm  

One can measure characteristics of a record, for example word frequencies, and construct a vector which represents relevant features of the record as continuous parameters for the purposes of generating a map where similar records are near each other.
Have you heard of this idea, one way of getting a measure of the similarity of two documents is to compare the file size of the two documents "zipped" or compressed together and separately, the presence of common substrings leads to better compression.

31 July 2008 at 2:56 pm  

Sounds great, but I don't think I'll be working with full text of the items themselves... as far as I know that data doesn't exist. I could conceivably OCR it, but that's a whole other kettle of fish. The .zip methodology is ingenious!

31 July 2008 at 4:24 pm  

Template based on Cutline port by Blogcrowds