Paul Pierce "The Structure of my Online Attic"
Paul Pierce's presentation “The Structure of my Online Attic”
The Structure of my Online Attic
Paul Pierce
For the Computer History Museum workshop
"The Attic and the Parlor"
May 5, 2006
Abstract
The web site for my computer collection has "parlor" and "attic" sections. The "attic" works on the database I'm building for the collection. This paper describes some aspects of the database, how it works with the web site and the direction I think its headed in.
Introduction
The web site (http://www.piercefuller.com/collect/index.html) for my computer collection is now one of the older sites on the web. It was originally all hand built and hosted at my first Internet provider, which has now disappeared into one of the large national ISP's. Now I host it myself through a static IP provided by a new local provider, from a Linux server running first Apache and now my own experimental web server in Java.
The original hand built part still exists in nearly its original form with most of its original content. I feel it corresponds to the parlor notion of this conference, a rather musty parlor used only by guests and rarely inhabited by its owner. Everything there was selected and organized with purpose but little has been changed since. If you visit the site, you will find the original pages have a light blue background.
Even before building the first web site I started to inventory the collection. The first burst of activity happened during the move out of the last leased space to our building where the collection now resides. I assigned numbers to artifacts and their packing boxes and pallets, took pictures, and made notes on which artifacts and (roughly) what documents ended up in which boxes. Later I started to build a database for the inventory and set it up to drive a section of the web site I call the "library", with a light green background. My father had transcribed my notes into comma-delimited form and I used that for the initial data.
The library is like the attic, in that everything I toss into the database shows up there, and unless I go back and change the entries everything that goes in just sits there the way it landed. But unlike the traditional notion of an attic, the library has quite a bit of structure and is easy to browse (if you're determined and patient.) This is particularly important since there is no search capability yet. Also its more active than your typical attic, so maybe its more like a garage shop where I putter with my stuff.
In the rest of the paper I'll explore the structure of the database, how it is projected onto the web site and some of the underlying design goals.
Database Structure
The database consists, essentially, of a set of XML documents. Each XML document represents an artifact or an artifact classification or a way of presenting an artifact or the information it contains. Each XML document is also a node in the tree that is the classification hierarchy (that makes it possible to browse the library on the web site) and/or a node in other graphs of interrelationships.
The XML structure is designed so that, although each node is normally in a separate file, any or all nodes can be combined into a single file if the need arises.
Artifact Nodes
Artifact nodes are the main type of node. Here is a sample document for an artifact that is a reel of tape. This document is in file ktul155.xml:
<collection xsi:schemaLocation="http://www.pcm.org/xml/collection.xsd collection.xsd"> <artifact id="http://www.pcm.org/library/magtape7.xml"/> <artifact id="http://www.pcm.org/library/system.xml"/> <artifact id="http://www.pcm.org/library/ibm1410.xml"/> <artifact id="http://www.pcm.org/library/ktul155.xml" description="PR155 from Tulsa" date="1998-02-07T20:00:00Z"> <note date="2003-05-16T16:00:00Z"> This tape was borrowed and is not part of the collection. </note> <field name="label1" description="Label"> FROM TULSA PROG# 1410-PR-155 FEAT 9026 IDENT Basic 7/800 V--M-- archive </field> <relate to="http://www.pcm.org/library/system.xml" how="class"/> <relate to="http://www.pcm.org/library/magtape7.xml" how="class"/> <relate to="http://www.pcm.org/library/ibm1410.xml" how="for"/> </artifact> </collection> |
Each artifact element (when filled out and not just a forward reference like the first three here) describes either a real artifact or a node in the classification hierarchy, which from the point of view of the database amounts to the same thing. Here we have a real artifact. The id attribute contains the formal URL identifying the artifact, of which only the last part between the '/' and '.xml' is interesting; the rest is boilerplate. The description attribute gives the short name of the artifact. If there is more to say about it there is a discussion element under the artifact element.
Some elements have a date attribute. This is always the time that the element was inserted or last modified, not any time or date intrinsic to the artifact. That would appear in a field element.
There can be several different kinds of elements under the artifact element. Field elements hold structured descriptive information, such as the label text above. They can also contain things like dimensions or provenance information. Fields are inherited down through the classification hierarchy, as for instance the label field is inherited from the top1media node through magtape and magtape7. (Some of the top nodes have strange names because links to nodes are presented in alphabetical order, but I want them to appear in a different order.)
Note elements are for informal, unstructured information about an artifact. They are intended to be added, modified and deleted independently of the formal artifact data.
The classification hierarchy and other interconnecting structure of the database is contained in relate elements. Each relate element refers to the id of another artifact and says how its connected. For the classification hierarchy, the how attribute is "class". All relate links point up to the node's parent. This way its easy to add a new child without modifying the parent.
Other Nodes
There are several types of nodes used for presentation of artifact information.
A pu矔lish node describes how to present a file relating to or derived from an artifact. The two main uses are to present .pdf files made from scanning an artifact that is a manual or other physical document, and to present files made from reading an artifact that is a data storage medium, as for example a .bcd file from reading a 7-track magnetic tape. The publish node might be combined with elements such as scaninfo/scanprocess or datafile that hold information used to construct the data.
An image node has a similar function but has a radically different form. It describes an image of an artifact in different sizes (its rendering element.) The different sizes are automatically used in different contexts on the web site. I think its worth considering how to generalize the publish node so it can take over the role of the image node, as described later.
Onto the Web
The (experimental) web server uses an experimental associative database that holds the critical interconnecting information from all the XML documents. The library section of the web site consists of some Java code, much like a servelet, that inspects each URL it gets and figures out what to display.
Apart from the top page and a special "Whats New" page, each library page corresponds to an artifact or image node and the artifact id is part of its URL. The library code reads in the XML document and, for artifacts, displays all the information there. Then it looks in the database for other nodes interconnected with this one through their relate elements and displays appropriate links to them.
The "Whats New" page simply supplies links to the 100 or so artifacts with the most recent date attributes.
It should be noted that the same (or a better) web site could be built from the same XML data using commonly available web development tools. There isn't supposed to be anything about the data that requires use of my experimental software.
Design Considerations
So why this structure? And why not just use a commercial museum database, after all, they've thought all this through, haven't they? Of course the primary answer is that I wanted to. I enjoy designing and writing software, and I like learning about the problems that must be solved. But there are some potentially more useful reasons as well.
The American Association of Museums annual meeting was in Portland a couple of years ago, so I joined up and went to it. I was able to get a cursory look at some of the museum database software available on the expo floor, and was not enormously impressed compared to what I had already implemented at the time. Also I went to one talk where the speaker went into detail about how she was able to subvert the existing schema of her database in order to accomplish a slightly unusual task related to moving her museum's collection. From this I resolved to make my database a bit more open-ended so I wouldn't have to go through similar gyrations myself. This is one reason for the general purpose field element, instead of specific predefined elements. The other is that I know so little about museum database schema requir矔ments that I couldn't predefine an adequate set if I wanted to.
I enjoy writing software more than editing HTML or XML, so I've tried to automate as much as possible. I've automated much of the post-processing associated with scanning manuals, reading tapes and preparing photos for the web, and I'm working on bar-code automated shelving to keep track of where everything is. All this is supposed to integrate with the main database so eventually I can scan something and a while later it just shows up on the web site. There are still some missing pieces but its getting closer.
An important part of the design is its explicit support for presentation of an artifact and its derived or contained information. This information is at the heart of the issue with collecting software, so although the concept must be familiar to all attendees I'll fill the next section bloviating about it. Maybe some part of my perspective will not have occurred to everyone else already.
Embedded Information
In the world of the art museum a painting is an important kind of artifact. A painting is actually a system of artifacts of different importance. Most important is the painted canvas itself. Secondary artifacts are the frame and the framing materials. These may be of great historic and aesthetic interest but can be removed or replaced without changing the essential identity of the object. When on display the environment, position and especially lighting also come into play.
The only way to fully experience a painting is to view the original artifact properly displayed. This experience can include not only the view of the painted image but also an intangible connection, through the physical artifact, to the artist. However, it is also possible to experience a painting through a reproduction of the image, and a good reproduction can convey most of the complete experience, including the artist's intent. For centuries its been common practice to reproduce fine art paintings in different ways, such as replicas for the home or cuts or photographs in art education books.
So the physical artifact of a painting contains as its most important aspect embedded information in the form of an image. Like other forms of information the image can be reproduced, transmitted and stored, with varying degrees of fidelity. Yet the information is not the artifact, and particularly in the art world the original artifact is always much more valuable than its embedded information.
Books are another kind of sometimes venerable artifact that contain embedded information in the form of text and graphics. Unlike paintings, books are rarely unique original artifacts. There are typically many copies of the first edition of a book, each a unique instance but very similar to all the others except for condition, and each carrying exactly the same embedded information. And then there are often additional editions such as mass market paperbacks with the same content or later revisions with slightly different content. Of these, only those instances that are first editions in good condition of a popular title are likely to have value as artifacts in themselves, for the rest their value lies mainly in how well they carry and present their embedded information, for that is the primary purpose of a book.
The information in a book originated in a manuscript, physical or electronic, subjected to editing such that a published edition might well be a more authoritative source than any of the original materials. Like the image of a painting, the information from a book can be reproduced, copied, transmitted and stored physically or electronically, with varying degrees of fidelity as to font, layout, etc. Unlike a painting, this typically does not diminish the value of the information very much.
Software is some矔hing at a new point further along in the same continuum. Even more so than in a book, the information itself is the essential artifact. Unlike a painting or a book, the information that is software is also a device that can be made active by running it on an appropriate computer system.
Because of these similarities with long existing artifacts, we can look to museum treatment of paintings, books and other artifacts with embedded information for ideas on how to treat software. Conversely, some of the ideas we come up with for software might prove to be valuable new treatments for these older artifacts.
Presenting Embedded Information
I've tried to explicitly recognize the importance and characteristics of embedded information in the design of my database. The key to this is the publish node.
Each publish node represents a single rendering of one single extract of embedded information from an artifact. For example, an old tape might have two "load points", with different data following each. One publish node might present the data at the first load point in my ".bcd" format. Another publish node might present the data at the other load point, or might present the same data in the SIMH ".tap" format.
Like artifact nodes, publish nodes contain a relate element (with how="subject") that links them to their artifacts. In the future other relate elements could link them in other ways, for instance a set of renderings of related information could be linked to a publish node for automatic inclusion in a ZIP archive.
Summary
My online "attic" is intended to be a way to present the ongoing inventory of my collection and in particular to expose the embedded information in my artifacts. Right now this amounts to little more than making raw bits available, but I hope to expand its capabilities to make the information ever more accessible to a reasonably wide range of visitors.