I recently gave this presentation at the National Digital Forum 2011 in Wellington. It proposes a way to think about collection interfaces through the concept of generosity - "sharing abundantly". The presentation argues that collection interfaces dominated by search are stingy, or ungenerous: they don't provide adequate context, and they demand the user make the first move. By contrast, there seems to be a move towards more open, exploratory and generous ways of presenting collections, building on familiar web conventions and extending them. This presentation features "generous interfaces" by developers including Icelab, Tim Sherratt and Paul Hagon, and it includes a preview of some work I am currently doing with the National Gallery of Australia's Prints and Printmaking collection, in collaboration with Ben Ennis Butler.

commonsExplorer

Although the Visible Archive project wound up months ago, its visualisation techniques live on. In particular I've been developing and adapting the title-word-frequency interface of the A1 Explorer, and trying it out on a range of different datasets. One of these spinoff projects - the commonsExplorer - has finally launched. Here, some documentation, reflection and rationale.

commonsExplorer 1.0
My colleague Sam Hinton and I began work on this as a project for MashupAustralia late last year. Our initial focus was the Flickr set of the State Library of NSW, and our aim was a rich, dynamic, "show everything" interface, building on the A1 Explorer work, but with image-based content. Some months later, having totally missed our original deadline, the scope had broadened out to the whole (amazing) Flickr Commons.

The explorer consists of a three-pane interface. The term cloud shows the 150 most frequently occurring words in the titles (not tags) of the current set of images. This will look familiar to anyone who's played with the A1 Explorer. It uses the same co-occurrence visualisation, and the same blocking / focusing navigation, with a few UI refinements. After some strong user feedback, I added a "back" button to step the navigation back one state. It also uses left and right-clicks, rather than modifier keys, to block or focus words. Applying this title-word approach to different sets has shown up its strengths, and a few weaknesses.


Its strengths are that titles and co-occurrence are a reliably rich cue for content, and that for most collections, thanks to the wonder of Zipf's law, the top-level cloud of 150 words will "cover" (refer to) more than 75% of the images in the set - even in a collection numbering in the thousands. Often, in smaller collections, the coverage is more than 95%. One question I haven't answered yet is how to communicate this idea of coverage to the user, and how to make those images not in the top level cloud, more immediately discoverable. Because after all, sometimes it's the outliers or exceptions in a collection, that we are interested in.

The bottom pane is the thumbnail grid, which is where most of the new stuff is. The grid is an attempt at a "show everything" image visualisation that can scale from tens to thousands of elements. As the number of elements grows, the grid size decreases to fit in the available space. Rather than scale images down, we simply crop the thumbnails - the intention isn't to represent the whole image but to provide some rich but unstructured visual clues: a sort of visual core sample through the whole set. The results show how this can help reveal structure within the collection. Different photographic processes are instantly apparent - monochrome, sepia, cyanotype, stereoscopic, Kodachrome. Other similarities also pop out, even in small tiles - landscapes vs portraits, for example.


This "clue" approach actually sums up our visualisation approach nicely. The Explorer presents us with a rich mass of partial information - or rather data: linked fragments of titles, and of images. Moments of discovery come when we see those fragments unified in a source image: the fragments are contextualised and become more meaningful. This contextual information then propagates back to the fragmentary display - when it works best there is a feedback loop from discovery to context and back to discovery. I've argued for a distinction between data and information, which is relevant here: these fragments are data points, abstracted and decontextualised. Information occurs only when we link and interpret those fragments - and it happens strictly on the human side of the screen.

Another feature of the grid that isn't immediately obvious is chronological sorting. Many collections, including the SLNSW set we started with, include dates in image titles. We look for those dates and sort dated images first in the grid. This approach is simple, and prone to the occasional false positive, but it degrades gracefully, and adds a usable layer of structure to the grid layout. Why not use Flickr's "date taken" field instead? Most Commons collections don't set it, so instead it gives the date uploaded. For the same reason we decided not to use tags, or attempt to scrape data from descriptions: these fields are inconsistent across the Commons - some images have no tags, others have dozens. Title and thumbnail seem to be the richest data that is always available.


Sam Hinton did the heavy programming work that makes the grid go. The main technical challenge we faced was memory usage: loading 700 tiny images just eats memory in Processing / Java. Sam devised a system for stashing the square thumbnails locally, optimising memory and acting as a cache to speed up loading. Drawing thousands of little images to the screen also raised performance issues - we draw to a single offscreen PGraphics context, then draw that to the screen.

In the end I think we've done what we set out to do - make a rich experience that encourages an understanding of context, and enables discovery in large collections. We've also shown that this approach is broadly applicable - if you've got a large image collection where you think it might apply, let us know. Most importantly though, try it out and let us know what you think.

Download commonsExplorer for Mac | Windows | Linux (1Mb)

A1 Explorer Screencast

Series Browser Screencast

In the last post I outlined the main approach used to visualise Series A1 - a word frequency cloud based on item titles, showing co-occurrences between terms. Here I'll show how that was expanded into an interactive tool for exploring the Series, all the way down to images of the documents themselves. If you're impatient, skip straight to the latest sketch for Mac, Windows and Linux (1.8Mb Java executables).

To turn the text cloud visualisation into a general-purpose interface, I added the ability to focus on terms (where focus means include only items containing that word). Exclusion and focus have an additive relationship; so I can exclude one term to create a subset of items, then focus on a second term to show only items in that subset, with a given term. I can also exclude, or focus on, multiple terms to further refine a subset. A simple interface allows for terms to be removed from any point in the sequence; so for example I can exclude all "naturalisation" items, then focus in on a second term (in the grab below, "immigration"). While this navigation technique isn't perfect, it is simple and scalable. We can rapidly move from Series level to small groups of items - in the grab below, we have zoomed from 65k items to 233 items, in two clicks. With this iterative navigation process, the co-occurrence display in the cloud becomes a useful way to scope or preview term relationships, and inform the next focus or exclusion.

items_nav_grab
The new visualisation element here is a simple histogram, showing the number of items with start dates in each year of the Series. The histogram visualises the current subset; so refining the text-cloud display also modifies the histogram; as well, hovering on a term in the cloud shows the relative distribution of that term, in the histogram. The date histogram becomes a powerful tool for exploration and discovery in this display. For example in the grab above, there's a big spike in the histogram in 1927: why? Hovering over the most prominent words in the cloud, we get a sense of their different distributions; for example "restriction" appears mainly between 1900 and 1915, whereas "deportation" occurs almost exclusively in items starting in 1927, and makes up most of the spike. Simply clicking either a term, or a histogram column, fills the lower pane with a list of relevant items, and from there we can explore much deeper - more of that later.

darwin_correlations
A second example of how the text cloud, co-occurrences, and histogram can combine to reveal patterns in the dataset, and prompt discoveries in the content of the series. Focusing in on "Darwin" reveals another big spike in the histogram, this time in the year 1937. In this case, the co-occurrences and the distribution of terms give an accurate preview of what the items in that spike are about: a cyclone hit Darwin. The text cloud even reveals the month of the event, and again the item listing shows fine-grained confirmation of the pattern.

The final challenge in this process was to zoom in again, to the level of the individual document. The Archives has digitised a significant chunk of its records - it currently stores 18.2 million images which are accessable through the search interface of RecordSearch. With the invaluable help of the Archives' Tim Sherratt, I can access these images dynamically by passing item details - barcode and page number - to an Archives PHP script. Because the dataset I am working with does not record the number of digitised pages, this is a two-stage process: first, query RecordSearch for the item details, and scrape out the number of digitised pages (shown in the right hand column of the items listing). Then, when an item in the list is clicked, load and display the page images.

This involved getting around a couple of little technical issues. The loading of the images was surprisingly straightforward. Processing's requestImage() function happily grabs an image from the web without bringing the entire applet to a halt. Loading the pages data was slightly harder, because loadStrings() does halt everything while it waits; and in this case, I wanted to load up to 14 URLs at a time. Java threads provided the solution - another case where Processing's ability to call on Java for backup was extremely useful.

unoske_shimomura_grab
The first time I successfully loaded one of these scans - a crinkled, typewritten page, encrusted with notes - was a real thrill. What this shows is that given the opportunity, interactive visualisation can provide not only insights into the structure and content of an archival collection; it can also provide an interface to the (digitised) collection itself. If the text cloud and histogram visualisations hint at historical events in the items content, the page images let us verify or explore their leads in the primary sources. For example in the 1927 deportation items above, the digitised documents reveal cases where recent migrants were deported to their country of origin because of mental illness. The Immigration Act (1901-1925), quoted in these documents, gives the Minister the power to deport recent arrivals who are convicted criminals, prostitutes, or "inmates of insane asylums." Not what I expected to find - but that's good, if the aim here is exploration. There's an amazing wealth of material in here - and it is beautifully material: the screen grab above shows a page from item 1921/22488, which documents the theft of a pearling lugger by its (indentured) Japanese crew. This page shows a handprint of one of the men, Unoske Shimomura, taken in 1914.

You can download the A1 Explorer applet for Mac, Windows and Linux (each is a 1.8Mb Java executable). System requirements are pretty minimal, though you will need a network connection to load images. One more caveat: the user interface is very rudimentary - again UI is not the focus of my research here - so below is a quick cheat sheet that should be enough to get you going. I'd love to hear your feedback on it, or any interesting discoveries you've made.

A1 Explorer Cheat Sheet

Text Cloud view

  • hover over words to see correlations, item distributions and numbers
  • click a word to load a list of its items into the lower pane
  • to exclude a word and regenerate the cloud hold down '-' and click the word
  • to focus on a word hold down '+' and click the word
  • to remove a focused or excluded word, click on it in the central info bar
  • use the up and down arrow keys to scroll through the items list in the lower pane
  • click on an item in the list to load its page images and switch to document view
Document view
  • page through the document with the left and right arrow keys
  • drag the page image to move it
  • press 'Z' to zoom the image up
  • press 'H' to load a higher-resolution image of the current page
  • press 'T' to revert to text-cloud view

The final phase of the project was to focus in on Series A1, and explore techniques for visualising the items it contains. First, a few basic stats on the task at hand. A1 contains some 64,000 registered Items, dating largely from the period 1903-1939. It was recorded to by Agencies including the Department of Home Affairs, the Department of the Interior, and the Department of External Affairs. In the dataset I am working with, each Item has a title, contents start and end dates, a control symbol, and a barcode. Other than dates, the most informative data here about the contents of the item, is the title. That raises some interesting problems: the title is a more or less unstructured field of text. Titles range from "August ZALEWSKI - naturalisation." to "International conference re Bills of Exchange [0.5cm]" and "Northern Territory. Pastoral Permit No.256 in the name of C.J. Scrutton."

The initial approach was to use simple word-frequency techniques to gain a sense of the range and distribution of text in the titles. If we take all 64,397 titles, and split them into their constituent words, and exclude some uninteresting words ("of","and","to","with","the","for", "from") the 150 most frequently occuring words look like this. Note that here text size is proportional to the square root of the word count - in other words text area is proportional to word frequency.


Naturalisation and certificate jump out fairly dramatically. In fact looking at the numbers, over 47000 items contain "naturalisation" - that's around 73% of the Series. Some 17,500 items contain "certificate" - 27%. A quick inspection of the records verifies this impression: the vast majority of the records listed are naturalisation certificates, or similar documents. Also notable in this image is the large number of names. Browsing the records suggests that these appear because the naturalisation documents always include the applicant's name. But underneath these layers are a wide range of more descriptive terms: "war", "papua", and "immigration", for example. Despite the dominance of the naturalisation records, the coverage of this list - the number of items with title words appearing in it - is quite high: over 60,000 of the 64,000 records are represented here, about 94%.

The text cloud gives an effective overview of the collection titles, compressing a huge mass of textual content into a single screen. But 6% of that content is unrepresented here; if this is our interface to the collection, that 6% is effectively invisible. As an initial experiment, I regenerated a cloud that excluded all items containing "naturalisation". The resulting cloud (below) covers some 14,500 items; as expected the names have all but disappeared, but more interestingly there are a rich set of new descriptive terms that were previously buried under the naturalisation records. If we add the coverage of this cloud and the first (14,571 plus the 47,058 containing "naturalisation") we get a total coverage of about 96%; so some, but not all, of that invisible 6% is now represented.


The other addition here uses interaction to extract more information from the cloud. One disadvantage of text clouds is the way they relentlessly decontextualise, breaking the local relations between terms. The lines between terms here - displayed on rolling over each term in the cloud - are an attempt to restore some of that context. They show links between terms that occur together in Item titles; so in the image above we can see that "new" occurs with "guinea" very frequently (not suprisingly). More informative though is that "employment" and "staff" are also correlated. Note also that "papua" is not strongly correlated with "guinea" - a bit of history explains why; Papua and New Guinea were administratively separate until 1945. So here a simple interactive visualisation device adds new context to the display and prompts new questions about the content.

In the next post: expanding these techniques into an interactive browser that can take us from a whole-Series view, to an image of a specific document, in a few clicks.

Template based on Cutline port by Blogcrowds