Tuesday, 23 November 2010

Analog tools


Sometimes, you just can’t beat an olde-worlde paper notebook. Highly portable, great screen resolution, excellent, intuitive user interface and infinite battery life.

Only problem: it’s hard to back up. On the other hand, it’ll still be readable in 200 years. Which is more than can be said for any of my digital data.

Saturday, 20 November 2010

Digital Humanities

The New York Times has an interesting piece about the renewal of interest in the Digital Humanities.

The next big idea in language, history and the arts? Data.

Members of a new generation of digitally savvy humanists argue it is time to stop looking for inspiration in the next political or philosophical “ism” and start exploring how technology is changing our understanding of the liberal arts. This latest frontier is about method, they say, using powerful technologies and vast stores of digitized materials that previous humanities scholars did not have.

The article goes on to describe a few interesting projects. For example:

In Europe 10 nations have embarked on a large-scale project, beginning in March, that plans to digitize arts and humanities data. Last summer Google awarded $1 million to professors doing digital humanities research, and last year the National Endowment for the Humanities spent $2 million on digital projects.

One of the endowment’s grantees is Dan Edelstein, an associate professor of French and Italian at Stanford University who is charting the flow of ideas during the Enlightenment. The era’s great thinkers — Locke, Newton, Voltaire — exchanged tens of thousands of letters; Voltaire alone wrote more than 18,000.

“You could form an impressionistic sense of the shape and content of a correspondence, but no one could really know the whole picture,” said Mr. Edelstein, who, along with collaborators at Stanford and Oxford University in England, is using a geographic information system to trace the letters’ journeys.

He continued: “Where were these networks going? Did they actually have the breadth that people would often boast about, or were they functioning in a different way? We’re able to ask new questions.”

One surprising revelation of the Mapping the Republic of Letters project was the paucity of exchanges between Paris and London, Mr. Edelstein said. The common narrative is that the Enlightenment started in England and spread to the rest of Europe. “You would think if England was this fountainhead of freedom and religious tolerance,” he said, “there would have been greater continuing interest there than what our correspondence map shows us.”

Saturday, 13 November 2010

Hacking the Library -- ShelfLife@Harvard

What is Shelflife?
Shelflife is web application that uses what libraries know (about books, usage and comments) to allow researchers and scholars to access the riches of Harvard’s collections through a simple search.

Researchers will be able to access, read about, and comment on works using common social net- work features. ShelfLife will bring Harvard results to the forefront of the research process, allowing users to easily access and explore our vast collections.
What makes it unique?

Shelflife is designed to help you find the next book. Each search will retrieve a unique web page providing key information about the thing searched, including basic information, fluid links to related neighborhoods, and analytic data about use, all presented in a clean graphical format with intuitive navigation with discoverability in mind.

From the Harvard Library Innovation Lab. The site provides no information about ShelfLife beyond the above, but Ethan Zuckerman, who's a Berkman Fellow at the moment, has a useful blog post reporting a presentation by David Weinberger and Kim Dulin, who co-direct the project.
Libraries tend to be very knowledgeable about what they hold in their collections. But they’re much less good about helping people discover that information. There are few systems like Amazon or Netflix recommendations that help scholars and researchers discover the good stuff within libraries. Dulin argues that librarians have been pretty passive in the face of new technology – they’ve purchased fairly primitive systems and had to buy back their content from the companies who build those systems.

Researchers tend to start with Google, Dulin tells us. They might move to Google Books or Amazon to find out more about a specific book. And perhaps a library will come into play if the book can’t be downloaded or purchased inexpensively. Libraries would like to move to the front of that process, rather than sitting passively at the end. And lots of libraries are trying to take on this challenge – new librarians often come out of school with skills in web design and application development.

The Lab hopes to bring fellows into the process, much as Berkman does. It works to build software, often proof of concept software. And innovation happens on open systems and standards, so libraries and other partners can adopt the technology they’re developing.

Two major projects have occupied much of the Lab’s time – Library Cloud and ShelfLife, both of which Weinberger will demo today. There are smaller applications under development as well. Stackview allows the visualization of library stacks. Check Out the Checkouts lets us see what groups of users are borrowing – what are graduate divinity students reading, for instance. And a number of projects are exploring Twitter to share acquisitions, checkouts and returns.

Weinberger explains that ShelfLife is built atop Library Cloud, a server that handles the metadata of multiple libraries and other educational institutions and makes that metadata available via API requests and “data dumps”. Making this data available, Weinberger hopes, will inspire new applications, including ones we can’t even imagine. ShelfLife is one possible application that could live atop Library Cloud. Other applications could include recommendation systems, perhaps customized for different populations (experts, versus average users, for instance.)

Turns out the ShelfLife is in a pre-Alpha state of development. The metaphor behind it is the "neighbourhood" -- i.e. clusters that a given book might sit within.
We see a search for “a pattern language”, referring to Christopher Alexander’s influential book on architecture and urban design. We see a results page that includes a new factor – a score that indicates how appropriate a title is for the search. We can choose any result and we’ll be brought into “stack view”, where we can see virtual books on a shelf as they are actually sequenced on the physical shelf. Paul explains that it’s actually much more powerful than that – many books at Harvard are in a depository and never see the light of a shelf. And many colelctions have their own special indices – the virtual shelf allows a mix of the Library of Congress categories with other catalogs.

The system uses a metric called “shelfrank” to determine how the community has interacted with a specific book. The score is an aggregate of circulation information for undergraduates, graduates and faculty, information on whether the book has been assigned for a class, placed on reserve, put on recall, etc. That information exists in Library Cloud as a dump from Harvard’s HOLLIS catalog system – in the future, the system might operate using a weekly refresh of circulation data. The algorithm is pretty arbitrary at this point – it’s more a provocation for discussion than a settled algorithm.

Ethan reports some of the Q&A and generally does a great job of writing up the event. His post is worth reading in full.

A systems view of digital preservation

The longer I've been around, the more concerned I become about long-term data loss -- in the archival sense. What are the chances that the digital record of our current period will still be accessible in 300 years' time? The honest answer is that we don't know. And my guess is that it definitely won't be available unless we take pretty rigorous steps to ensure it. Otherwise it's posterity be damned.

It's a big mistake to think about this as a technical problem -- to regard it as a matter of bit-rot, digital media and formats. If anything, the technical aspects are the trivial aspects of the problem. The really hard questions are institutional: how can we ensure that there are organisations in place in 300 years that will be capable of taking responsibility for keeping the archive intact, safe and accessible?

Aaron Schwartz has written a really thoughtful blog post about this in which he addresses both the technical and institutional aspects. About the latter, he has this to say:

Recall that we have at least three sites in three political jurisdictions. Each site should be operated by an independent organization in that political jurisdiction. Each board should be governed by respected community members with an interest in preservation. Each board should have at least five seats and move quickly to fill any vacancies. An engineer would supervise the systems, an executive director would supervise the engineer, the board would supervise the executive director, and the public would supervise the board.

There are some basic fixed costs for operating such a system. One should calculate the high-end estimate for such costs along with high-end estimates of their growth rate and low-end estimates of the riskless interest rate and set up an endowment in that amount. The endowment would be distributed evenly to each board who would invest it in riskless securities (probably in banks whose deposits are ensured by their political systems).

Whenever someone wants to add something to the collection, you use the same procedure to figure out what to charge them, calculating the high-end cost of maintaining that much more data, and add that fee to the endowments (split evenly as before).

What would the rough cost of such a system be? Perhaps the board and other basic administrative functions would cost $100,000 a year, and the same for an executive director and an engineer. That would be $300,000 a year. Assuming a riskless real interest rate of 1%, a perpetuity for that amount would cost $30 million. Thus the cost for three such institutions would be around $100 million. Expensive, but not unmanageable. (For comparison, the Internet Archive has an annual budget of $10-15M, so this whole project could be funded until the end of time for about what 6-10 years of the Archive costs.)

Storage costs are trickier because the cost of storage and so on falls so rapidly, but a very conservative estimate would be around $2000 a gigabyte. Again, expensive but not unmanageable. For the price of a laptop, you could have a gigabyte of data preserved for perpetuity.

These are both very high-end estimates. I imagine that were someone to try operating such a system it would quickly become apparent that it could be done for much less. Indeed, I suspect a Mad Archivist could set up such a system using only hobbyist levels of money. You can recruit board members in your free time, setting up the paperwork would be a little annoying but not too expensive, and to get started you’d just need three servers. (I’ll volunteer to write the Python code.) You could then build up the endowment through the interest money left over after your lower-than-expected annual costs. (If annual interest payments ever got truly excessive, the money could go to reducing the accession costs for new material.)

Any Mad Archivists around?

Worth reading in full.

LATER: Dan Gillmor has been attending a symposium at the Library of Congress about preserving user-generated content, and has written a thoughtful piece on Salon.com about it.

The reason for libraries and archives like the Library of Congress is simple: We need a record of who we are and what we've said in the public sphere. We build on what we've learned; without understanding the past we can't help but screw up our future.

It was easier for these archiving institutions when media consisted of a relatively small number of publications and, more recently, broadcasts. They've always had to make choices, but the volume of digital material is now so enormous, and expanding at a staggering rate, that it won't be feasible, if it ever really was, for institutions like this to find, much less, collect all the relevant data.

Meanwhile, those of us creating our own media are wondering what will happen to it. We already know we can't fully rely on technology companies to preserve our data when we create it on their sites. Just keeping backups of what we create can be difficult enough. Ensuring that it'll remain in the public sphere -- assuming we want it to remain there -- is practically impossible.

Dan links to another thoughtful piece, this time by Dave Winer. Like Aaron Schwartz, Dave is concerned not just with the technological aspects of the problem, but also with the institutional side. Here are his bullet-points:

1. I want my content to be just like most of the rest of the content on the net. That way any tools create to preserve other people's stuff will apply to mine.

2. We need long-lived organizations to take part in a system we create to allow people to future-safe their content. Examples include major universities, the US government, insurance companies. The last place we should turn is the tech industry, where entities are decidedly not long-lived. This is probably not a domain for entrepreneurship.

3. If you can afford to pay to future-safe your content, you should. An endowment is the result, which generates annuities, that keeps the archive running.

4. Rather than converting content, it would be better if it was initially created in future-safe form. That way the professor's archive would already be preserved, from the moment he or she presses Save.

5. The format must be factored for simplicity. Our descendents are going to have to understand it. Let's not embarass ourselves, or cause them to give up.

6. The format should probably be static HTML.

7. ??

Sunday, 7 November 2010

Put not your faith in cloud services: they may go away

From John Dvorak:
I have complained about the fly-by-night nature of these companies for years, but my concern now seems misplaced. I was concerned about operations that you depend on for deep cloud services. This means complex programs running on the cloud with no real alternative. Over time, I've tended to see these companies as more stable than the "Use our free service. You won't regret it!" model.

I was taken to task by numerous vendors who kept telling me that I was full of crap, because cloud services are professionally managed, and nobody could do the job—whatever the job was—better than a room of pros. With the cloud, the pros would also keep the data safe.

Yeah, until they were all laid off, and the service shut down!

Now here's the problem I am experiencing second-hand. The audio podcast I do with Adam Curry, the No Agenda Show (Google it), has been using Drop.io to store podcast album cover images for convenience. They will all be destroyed, as well as the accumulation of links, tips, curiosities, and other valuable information, in the next few weeks.

Looking back on the idea of using this service, I didn't fully consider the ramifications of its discontinuance despite my skepticism about cloud services in general. You know, this was just a lot of weird stuff thrown into a bin. But once it was discontinued, it was apparent what you are left with: dead links.