Showing posts with label interactive. Show all posts
Showing posts with label interactive. Show all posts

The latest step in building a browsable Series-based visualisation has been to add in Agency data. The previous post made a first step towards integrating Agencies into the visualisation - essentially using their ID codes to colour the Series squares. But Agencies also offer a powerful way to add context to our visual exploration of the Archives collection, as this post will show. To skip straight to the latest visualisation, download the executables for Mac, Windows or Linux (about 5Mb each - and needs 1280x1024).

To begin with I wanted to get a sense of the quantitative relationships between Series and Agencies. After converting the Series dataset to JSON and ingesting it to Processing, I generated some simple "utility" visualisations.


This graph shows all 9000 or so Agencies in my dataset, ranked in order of the number of Series that they record into. Agencies are arranged from left to right along the x axis; number of Series recorded is graphed on the y axis (click the graph to see a larger version over on Flickr). This shape is a classic power law distribution: there are a very small number of Agencies recording a large number of Series, and conversely, very many Agencies recording very few Series. For example, there are only around 100 Agencies that record to more than 100 Series; and the vast majority of Agencies record to very few series (less than 10, say).


This graph shows the same relationship from the other side: here Series are ranked along the x axis, and the y axis shows the number of Agencies recording into each Series. Again this is a power-law distribution, but the quantities are much smaller. We can see that around two thirds of all Series have only one recording Agency; almost all have fewer than 10; and a tiny number have as many as 45.

Where does this get us, in terms of making a visualisation of the whole collection? It shows that Agencies offer a useful way to break the collection up into manageable-sized subsets; the vast majority of Agencies record to fewer than 100 of the 57.5 thousand series. That's a significant refinement. At the same time most Agencies record to more than Series: so Agencies should be able to show relationships between usefully-sized groups of Series.

The next step was to integrate the full Agency data with the previous Series visualisations; this was relatively straightforward, and again HashMaps were invaluable in cross-linking Series and Agency data. I rebuilt the floating caption display, to handle a complete listing of recording Agencies and their titles. This alone adds a wealth of context to the visualisation: Series with relatively generic titles ("Correpsondence files") are brought into focus with a descriptive list of their Agencies.


For each browsed Series, we can then show other Series recorded to by its Agencies. In the interactive sketch we can select a Series with a mouse click, turn on the Agency display (hit "A"), then scroll through the agencies with the arrow keys. The floating caption box allows us to investigate highlighted Series, select them in turn, and so on. The result is contextually rich and far more browsable than before. The scale of these Agency-based groups is, as the first graphs show, an effective way to break the collection down. The highlights give a slight counterbalance to the size-based bias of the packed-square visualisation, leading us out into smaller series. Also, because the floating caption shows all the Agencies for each highlighted Series, we build up a sense of the range of related Agencies in a certain area; so highlighting CA 51, the (mid-century) Department of Immigration (Central Office), and browsing its Series, reveals other immigration-related Agencies. Have a browse: download the visualisation as a Java executable for Mac, Windows or Linux (about 5Mb each - and needs 1280x1024).


Finally, these latest visualisations also include an important tweak in the packed-square visualisation model. Tim Sherratt commented on an earlier post that the "hollow box" metaphor is potentially misleading, because it's based only on the ratio between shelf metres and recorded items. In other words, the way that a "hollow" suggests un-registered items is just not right. While browsing these visualisations I came across another, more serious problem with the "hollow" approach. Because the overall size of a square is determined by its shelf space, it's possible to have very small squares that represent Series with many thousands of recorded items; as many or more than physically larger Series. The solution is simple, once you think of it: visualise both items and shelf metres. Now, the area of the inner (brighter) square is proportional to items; while area of the outer (duller) band is proportional to shelf metres. The result is that Series that are physically small, but contain many items, suddenly grow in size ( in the visualisation these appear with very thin borders). Interestingly some very recent Series pop out, including a couple documenting the 2005 UN Oil-for-Food / AWB inquiry: with zero shelf metres, I wonder if these are "born digital" records?

This will be the final step, for the moment, in visualising the whole collection. With a public lecture at the Archives coming up I need to move on to the Items level, visualising the contents of A1. More on that shortly.

[update - the links to the executables were broken, sorry: fixed now (11 May)]

Series Links

After a hiatus over summer and the start of the academic year, I finally have some more progress to report. Using the packed square visualisations as a base, I've been adding more data elements from the Series dataset, and working towards visualising relationships between Series and the Agencies that generate their content. This has taken longer than planned due to more data-plumbing issues, which I'll come to later.

The Archives' Series data records two kinds of links: Series-Agency and Series-Series. The latest sketches make a start in visualising both of these. Here colour (or more accurately hue) is derived from the first listed Recording or Controlling Agency. As the CRS Manual explains, the Recording Agency generates the records; while the Controlling Agency is the "agency currently responsible for some or all of the functions or legislation documented in records." In either case, given that there are some 9000 Agencies involved here, how do we visualise this link? For the moment I'm doing it in the simplest possible way: low Agency numbers have low hue values (red), while high Agency numbers have high hue values (blue to purple). There are a number of problems with this - notably that it's impossible to tell the difference between, for example, CA 11 (Treasury 1901-1976) and CA 12 (the PM's Department 1911-1971) - which is a very significant difference. These two images show the difference between visualising Recording Agency (top) and Controlling Agency (bottom).

grid_squarepack_links_recording

grid_squarepack_links

Series data also records links to other Series, which come in three flavours: Succession (between previous and subsequent series), Controlling (where one Series acts as an index or register for another) and Related (for other relationships). In this dataset (57.5k Series) there are some 7.5k succession links, 6.2k controlling series links, and 25k related series links. My initial attempt to render all of these (by just drawing a line between linked Series) resulted in a giant, unreadable cloud. A simpler and more legible approach is to only draw links for one Series at a time.

In the latest interactive sketch, a single Series' links are drawn as coloured lines: controlling links are red, succession links are blue, and related Series links are yellow. Clicking a Series selects it and draws its links, rendering linked series in colour while dimming the rest to grey (clicking the Series again unselects it and returns to Technicolor mode). This begins to show the potential for a visual interface to the collection, I think. Here's the applet - note that it's fairly screen and memory-hungry. Feedback welcome, as always.


There are a few changes behind the scenes here as well. As outlined earlier, XML has been a mixed blessing: easy to use and human-readable, but the file sizes are large, and the DOM parsing method used in Processing is memory-hungry and slow. For these sketches I've switched to JSON, a simple, lightweight data format with its own Java library. So far, JSON is working nicely; its file sizes are around half those of the equivalent XML files, parsing is much faster, and the parsing code is simpler and neater. This thread has lots of useful info on implementing JSON in Processing.

HashMaps are the other new toy here. I'd never quite found a use for them until now, but because they easily connect an object (in this case a Series) with an index string (in this case a Series ID), they are essential here for building Series-Series links. I simply store each Series' links as a list of ID strings, then to draw the link, feed each ID into a HashMap to access the whole Series object. Thanks to @blprnt and @toxi for reminding me why I needed HashMaps!

Next: digging deeper into the complexities of Agency-Series relations.

Packing Them In

Up to this point the grid visualisations have taken a very simple approach to space: dividing it up equally among the data points, and then using hue and brightness to show attributes such as shelf metres and items. This has the advantage of simplicity, but it has a major disadvantage too: it's attempting to represent size (shelf metres or number of items) using other means. Why not just use size for size? Read on for the blow-by-blow account, or skip straight to the end result: the latest interactive sketch.

Before Christmas I had a first stab at this problem. The approach was basic, as usual. Maintaining the chronological ordering of the series, I drew each series as a square with area proportional to number of items. The packing procedure was simply: starting where the previous series is, step through the grid until we find a big enough space to draw the current series. The result looked like this:

After weeks of regular grids, this was a sight to see. The distribution of the sizes of series (overall and through time) is instantly apparent. This ultra-simple packing method is far from perfect, though, as you can see from all the black gaps. Because it tiles one series at a time, in strict sequence, and only searches forwards through the grid, gaps appear whenever a large square comes up as the search scrolls along to find a free space.

The main restriction here is the chronological ordering of the series. I need to maintain that ordering, but at the same time I need to be able to pack the squares more efficiently, which means changing the order. Luckily there's a loophole: as the first histogram showed, many series share the same start date. So we can change the sequence of those same-year series, without disrupting the overall order. We can pack them starting with the biggest squares and pack in the smaller ones around them. The latest sketches use this method, which can be described in pseudocode:

  1. Make a list of series with a given start year
  2. Working from biggest to smallest, pack each series into the grid, from a given start point: restart the search from the start point each time.
  3. Keep track of the latest point in the grid that this group occupies. For the following year, start from this point.
This improves the packing dramatically:

In this image square area is mapped to shelf metres; as in the earlier sketch hue is derived from the series prefix (roughly A = red, Z = blue). One artefact is apparent here - those lines of squares graded by size occur when nothing gets in the way of the packing process. As a byproduct of this, the biggest squares in those sequences often mark the start of a new year in the grid.

The latest sketches integrate both shelf metres and described items, and finally add interaction to this visualisation. To combine metres and items the squares are drawn as above, with area proportional to shelf metres; then overlaid with a second grey square, whose size is inversely proportional to the number of items in the series. The result is that series with many items are full of colour, and series with few items have large "hollows" and narrow coloured borders.

Again, there are relations between series here that are instantly apparent. It's easy to see those series that have lots of shelf metres but relatively few items, as well as even medium-sized series with many items. I couldn't find A1 in the earlier grids (though Tim Sherratt from the Archives could); it is much more prominent here. Tim also pointed out that B2455, one of the big series of WWI service records, didn't jump out of the grids: it's very prominent here. As well that cluster of post-War migration series spotted in the items grid reappears here. Promising signs for the usefulness of this visualisation.

All this is best demonstrated in the interactive version, which like the previous grids adds a caption overlay and some year labels on the vertical axis. Browse around and see what you can find - feedback very welcome.

After being completely buried under end-of-year admin for a few weeks, it's great to be back to work on this project. I've been working on plumbing in the latest dataset from the Archives, which has doubled in size to around 57,500 series. In an attempt to create a browsable overview of the whole collection, I have been developing the earlier grid sketches, feeding in more data, and extra parameters. Also new in this dataset are two interesting features of archival series: items - number of catalogued items in the series - and shelf metres - the amount of physical space the series occupies. In this interactive browser, you can navigate around the whole collection, and switch between modes that display these parameters.

gridbrowser_topright
A brief explanation. Like the earier grid, series are sorted by start date (still contents start date, rather than accumulation, for the moment) then simply layed out from top left to bottom right. In this version I've added some year labels on the Y axis, which show the distribution of the series through time. Hue is mapped directly to date span: red series have a short date span, blue have a long span. The four modes in this interactive change the mapping for brightness. In the default display brightness is mapped to items (I); M switches the brightness key to shelf metres; P shows items per shelf metre; and S switches the brightness key off (showing span/hue only).

Both these new parameters have a wide range and a very uneven distribution, and as you can see in the visualisation there are many series with zero items and/or zero metres. In fact around 30000 series (over half this collection) have zero digitised items; while around 2600 have between 100 and 1000 items, and 13 have more than 10000 items. Around 20000 series have zero shelf metres, around the same number have 0.1-1m, around 10000 have between 1m and 10m, and the rest have more than 10m - with a couple of dozen series with more than 1km of shelf space! It's important to remember, as Archives staff have mentioned to me, that items here refers to digitised items. Series with zero listed items aren't empty, they just haven't been digitised. Similarly I suspect that a value of zero shelf metres suggests that the data doesn't exist. Even if it can't be taken at face value, items is an interesting metric because the Archives digitises records largely on the basis of demand from users; so a series that is frequently requested is more likely to be digitised. Items, then, is partly a measure of how interesting a series is, to Archives users.

gridbroswer_detail
The items view of the grid allows us to see, for example, that there are more digitised items in series commencing in the 20s and 30s, than there are in series commencing in the 60s and 70s. We can also see a dense band of well-digitised series from the late 90s onwards. I don't know, but I'd suspect that these are "born digital" records - no digitisation required. The most striking feature of the items graph is the narrow red streaks around 1950: these are Displaced Persons records from 1948-52, each series corresponding to a single incoming ship (above). These records show up here because they are well digitised (interesting) but also because there are many sequential series forming visual groups. There are other pockets of "interestingness", but they are less obvious. This reveals one drawback of this grid layout, which is that related series are not necessarily grouped together. I'm hoping to address this when I start looking at agencies, functions, and links between series.

A few technical notes. After running into problems storing data in plain text, I changed the code to read the source XML in, pick out certain fields or elements, and write the data back out as XML. I used Christian Riekoff's ProXML library for Processing, which makes the file writing part very easy (Processing's built-in XML functions don't include a file writer). This worked well, except when it came to exporting web applets, which just refused to load. Rummaging around in the console log, and turning on Java's debugging tools (thanks Sam) showed that the applet was running out of memory while trying to load the XML - admittedly a fairly hefty 27Mb uncompressed. So for the web version at least, I have reverted to storing the data as plain text, which immediately reduced file size and loading time by a factor of 4, and solved the applet problem. Since then Dan and Toxi have suggested alternative ways of handling the XML, such as SAX, which streams the data in and generates events on the fly, rather than loading the whole XML tree into memory before parsing it. I'll be looking into that for any serious web implementation of this stuff.

Finally, with almost 60000 objects on the screen, this visualisation raises some basic computation and design issues. Even using accelerated OpenGL, this is a tall order; I found I was getting around one frame per second on a moderately powerful computer. I have solved the issue here with a simple workaround (thanks Geoff for this one) - pre-render an image of the grid, then overlay the interactive elements. Performance issue solved. But there are some limitations: this approach means the grid layout is fixed. It's a significant move away from a truly "dynamic" visualisation, where all the elements are drawn on the fly. For visualisations at this scale, I don't think there's any other way, but as the design develops I'll be trying to push back towards the live, dynamic approach, as the dataset permits.

I've been considering how to develop the stack histograms, but meantime decided to quickly trial a completely different approach to visualising the Series dataset. I don't want to get carried away with one metaphor / approach, when there may be others worth exploring. So, in this visualisation some 27000 series are layed out in a simple grid. Series are ordered by (contents) start date, and sequenced left to right, top to bottom. As in the last histograms, date span is mapped to hue, so long spans are blue, short spans are red. I've been having some weird issues with web applets so far, but this one seems to work (without OpenGL), so there's also an interactive version to play with.


This layout has a number of advantages over the stack approach. The primary one is visual density. This layout makes it possible to see all the series, in a single visual field. In the examples here the grid is 200 columns wide and around 135 rows high; each series is a 4 x 4 pixel square. Even allowing for 40000 series in an expanded dataset (more of which soon), this scale is functional. A related advantage is browsability. In the interactive version of this sketch, we can simply mouse over series to see their details; a usable, if still unstructured way to browse the collection.

The grid throws away the emergent histogram-form of the stack approach. However many related structures are still apparent: for example the pattern of long-span series having early start dates is clear; and the interactive version also reveals the date distribution; the reddish band in the middle of the grid is the wave of short series around WWII. One thing on the list to try is add a date key to the vertical axis. This would effectively show the same thing as the tall peaks of the original histogram: the relative numbers of series commencing over time. The grid simply structures space according to the data elements (the series), so that the relation of date to visual space becomes nonlinear; but the relationship is still there and easily revealed.

Next on the list of things to try is a word-frequency visualisation based on series titles. This should provide a way to browse the grid more effectively; after that, I need to get to work on a new, expanded dataset with more series, but also useful quantitative measures like shelf space and digitised items, for each series. Then, more layers of structure and browsability: relationships between series, agency and function.

I've been developing the year-span histograms posted earlier. In these sketches, the series are again represented as single horizontal lines that correspond to their date spans. To address the problem of series overlapping each other, this sketch sorts and stacks the series into a single big, non-overlapping heap. The method is fairly simple. First, sort all the series by their span, longest to shortest (this involved learning to implement Java's Comparable function). Then, place the series in the stack, longest to shortest, bottom to top. A simple 2D array is used to keep track of series positions and check collisions; if a collision is found, simply try the next row up and repeat until a space is found.


The result is almost, but not quite, a histogram (because the packing isn't perfect - there are some gaps). Unlike the earlier sketches though, it visualises the total number of series spanning a given year, rather than just the commencing year; this seems a more generally useful feature to visualise. It's interesting to note though that some of the features obvious in the commencing year histogram are less clear here - notably the spikes around Federation and the Wars.


The real payoff for the stacking is that now we have a potential interface to the entire collection, at series level. Adding interaction makes it easy to browse the visualisation by year, showing the relation between series in that year and the total collection. Sheer scale is still a problem. This "heap" is more than 10,000 series high - too big to usefully show every series even at one pixel each. Interaction allows zooming and panning (above), which helps. Next, I'd like to be able to filter the heap down to a more manageable size, to a point where this can become the interface to browse through individual series.

Template based on Cutline port by Blogcrowds