Inside A1 - Text Clouds from Item Titles

The final phase of the project was to focus in on Series A1, and explore techniques for visualising the items it contains. First, a few basic stats on the task at hand. A1 contains some 64,000 registered Items, dating largely from the period 1903-1939. It was recorded to by Agencies including the Department of Home Affairs, the Department of the Interior, and the Department of External Affairs. In the dataset I am working with, each Item has a title, contents start and end dates, a control symbol, and a barcode. Other than dates, the most informative data here about the contents of the item, is the title. That raises some interesting problems: the title is a more or less unstructured field of text. Titles range from "August ZALEWSKI - naturalisation." to "International conference re Bills of Exchange [0.5cm]" and "Northern Territory. Pastoral Permit No.256 in the name of C.J. Scrutton."

The initial approach was to use simple word-frequency techniques to gain a sense of the range and distribution of text in the titles. If we take all 64,397 titles, and split them into their constituent words, and exclude some uninteresting words ("of","and","to","with","the","for", "from") the 150 most frequently occuring words look like this. Note that here text size is proportional to the square root of the word count - in other words text area is proportional to word frequency.

Naturalisation and certificate jump out fairly dramatically. In fact looking at the numbers, over 47000 items contain "naturalisation" - that's around 73% of the Series. Some 17,500 items contain "certificate" - 27%. A quick inspection of the records verifies this impression: the vast majority of the records listed are naturalisation certificates, or similar documents. Also notable in this image is the large number of names. Browsing the records suggests that these appear because the naturalisation documents always include the applicant's name. But underneath these layers are a wide range of more descriptive terms: "war", "papua", and "immigration", for example. Despite the dominance of the naturalisation records, the coverage of this list - the number of items with title words appearing in it - is quite high: over 60,000 of the 64,000 records are represented here, about 94%.

The text cloud gives an effective overview of the collection titles, compressing a huge mass of textual content into a single screen. But 6% of that content is unrepresented here; if this is our interface to the collection, that 6% is effectively invisible. As an initial experiment, I regenerated a cloud that excluded all items containing "naturalisation". The resulting cloud (below) covers some 14,500 items; as expected the names have all but disappeared, but more interestingly there are a rich set of new descriptive terms that were previously buried under the naturalisation records. If we add the coverage of this cloud and the first (14,571 plus the 47,058 containing "naturalisation") we get a total coverage of about 96%; so some, but not all, of that invisible 6% is now represented.

The other addition here uses interaction to extract more information from the cloud. One disadvantage of text clouds is the way they relentlessly decontextualise, breaking the local relations between terms. The lines between terms here - displayed on rolling over each term in the cloud - are an attempt to restore some of that context. They show links between terms that occur together in Item titles; so in the image above we can see that "new" occurs with "guinea" very frequently (not suprisingly). More informative though is that "employment" and "staff" are also correlated. Note also that "papua" is not strongly correlated with "guinea" - a bit of history explains why; Papua and New Guinea were administratively separate until 1945. So here a simple interactive visualisation device adds new context to the display and prompts new questions about the content.

In the next post: expanding these techniques into an interactive browser that can take us from a whole-Series view, to an image of a specific document, in a few clicks.


Hi Mitchell, Have you looked at SunBurst visualizations (see )? Both the text clouds and the series links seem like they might be a good fit. Piotr Adamczyk, an analyst at the Metropolitan Museum of Art, has been doing some work with SunBursts:

29 July 2009 at 10:14 pm  

Thanks Mark, great to see Piotr's work. I had seen Sunbursts but hadn't thought of applying them here. They seem to need a strict heirarchical structure, which is not the case with the Series links. Maybe you can say more about how they could be used for the text clouds? I can't quite imagine it, because again the structure isn't exclusively heirarchical but highly overlapping. i.e. each term is linked to a set of items in a non-exlusive way (many terms will refer to each item)

30 July 2009 at 2:17 pm  

Template based on Cutline port by Blogcrowds