Brett and I met on Thursday to talk about the project. The notable issue that came up is the degree to which LISinfo could borrow from a similar project I’m working on, Helios, a discovery layer we’re using at Drexel to make it easier to browse the DVDs in our catalog. It has also been used at some other places listed at the Google code page.
At the moment, the main difference between the projects is that Helios doesn’t have any record-editing function. Drexel maintains its records in III’s Millennium ILS and I import them into Helios. That model won’t work for LISinfo. We need to be able to edit records.
We have three options:
- Continue to develop Helios and LISinfo as separate, but similar, projects, swapping code where appropriate.
- Use Helios as the discovery layer, and develop an LISinfo catalog application that complements it.
- Rely entirely on Helios. Develop a full catalog application in Helios that serves LISinfo’s needs, but maintains enough flexibility for current Helios uses to continue.
The thing is, Helios already has the beginnings of a catalog app, but it’s
really just a per-record view of the index. That’s because Helios was
originally developed as an OPAC, so there are multiple
paradigms to consider.
The first paradigm is that of the OPAC, in which applications are split between the staff/admin and public. This is also the way most web projects are built. In this paradigm, an application or group of applications handles the search page, the results page, and the view of each record, while another application or group of applications handles the “behind the scenes” interfaces for editing records and altering the site. If LISinfo follows this paradigm, it should be fairly easy to make sure we provide a consistent look and feel across the site.
The second paradigm splits the applications between the discovery layer and the various resources the discovery layer indexes, such as the catalog. This is the search engine model, as well as the one employed by federated search applications. The main benefit is the loose coupling between the search/results application and the various applications that display
and maintain the indexed resources. It’s obviously less centralized, and considering that we plan to keep all of the records for LISinfo in one place, this paradigm might not be a good fit. However, it is the direction in which I’ve been pushing Helios.
If I limit the Helios project to the role of a discovery layer, it becomes one app. I agree with the Unix philosophy—each tool should do one thing well—but I’d rather make Helios more comprehensive and simplify the goal of each app within Helios. I think with some careful planning I can satisfy both paradigms and create something flexible enough to use for LISinfo as well as various library collections.
I think the best fit, then, would be to maintain the discovery app, but include
an optional record-level display within it. That’s three views (index, search,
record), all pulling information from the Solr index. It will serve the role
of the OPAC as well as the discovery layer. A separate app, named “cataloging,”
will handle the staff interface for the creation and maintenance of records.
The cataloging app will include views for both batch and individual editing of
records, and will either update the index directly or alert the discovery app
to do so. Each one should contain hooks for interoperability, but be able to
operate independently if needed, as if the cataloging app is just another
resource for the discovery layer to display.
Library catalogs are generally inventories of media types: “Hello, patron, this is our book catalog.” Of course, these days what we actually say is something more along the lines of, “This is our book catalog, into which we’ve shoehorned some audio or video recordings… and, oh yes, you can find information on serials, as long as you don’t want information about the articles that actually appear within the serials.”
To do proper research in a particular area, you have to look in lots of places. In general, these places are organized by media type, not by information domain. That’s why we
copy each other’s create subject guides, and its even how we organize our subject guides: look here for books that cover this information domain, look here and here and here and here and here and here and here for articles, and look here for databases, and look here for news, etc.
So far, we’ve managed to extended this sort of thinking to the web: go here if you want to find information on conferences, go here if you want to find blogs, look here and here and here if you want information on journals (and god help you if you want to do a comprehensive literature review).
LISinfo will catalog an information domain rather than a media type or types. It will allow us to dig deeply into any area of LIS quickly, easily, and comprehensively. Want to know everything about Philadelphia, or some aspect of Philadelphia, as it relates to LIS? Want to know everything that’s been written on a subtopic within LIS that you find interesting? We’ll have it. Not immediately, but sooner than you think.
So you hate the idea. Or think you’ve seen it in a million other places in a million other forms. Or think it sounds nice but can’t possibly work. And you most certainly won’t be impressed with the first version of LISinfo. Apparently, these are the reactions that Paul Graham always gets as well. As he wrote recently in “Six Principles for Making New Things“:
I like to find (a) simple solutions (b) to overlooked problems (c) that actually need to be solved, and (d) deliver them as informally as possible, (e) starting with a very crude version 1, then (f) iterating rapidly.
When I first laid out these principles explicitly, I noticed something striking: this is practically a recipe for generating a contemptuous initial reaction. Though simple solutions are better, they don’t seem as impressive as complex ones. Overlooked problems are by definition problems that most people think don’t matter. Delivering solutions in an informal way means that instead of judging something by the way it’s presented, people have to actually understand it, which is more work. And starting with a crude version 1 means your initial effort is always small and incomplete.
At first, LISinfo will be small and incomplete and you’ll look at it and think we’ve been wasting our time developing a solution to something that no one sees as a problem. On top of that, we seem to be depending on the kindness of strangers to populate and maintain LISinfo’s database.
Of course, the database isn’t LISinfo’s or ours, it’s everyone’s. And those strangers will soon realize that their work isn’t merely altruistic. Before long, once the data is there, LISinfo will be incredibly useful to those of us in the LIS field. And then it will be up to all of us to share the idea of the information domain catalog with people in other fields.
Really, it’s a simple idea. Simple enough, we hope, that we can’t mess it up.
I’ve been working on the flat file record store. The idea is we’ll have a bunch of JSON records that will be indexed with Solr. Those files sit in groups of up to a thousand in up to a thousand directories, two levels deep. In more graphical terms:
records/000/000/000 <-- record with id# 000000000 /000/000/001 <-- record with id# 000000001 ... /001/204/586 <-- record with id# 001204586
I really have no idea how well this is optimized for file IO, but I appreciate the simplicity of it. I’m excited to see the performance with a few million records. Comments about alternatives or potential pitfalls are welcome.
The most interesting thing about the LISInfo project, for us as its creators and, we hope, for its users, is that it’s big. Really, really big. Our goal for the LISInfo project is to organize a catalog that includes access points for anything a reasonable person might want to know about library and information science. That’s a lot of metadata. Which means that not only will the data store be big, but the culture around LISInfo is going to have to be big because it’s going to take a lot of volunteers to steward the data.
We thought at first we could build our project in Flamenco, Marti Hearst’s very cool faceted interface. But after using Flamenco for our first prototype, we learned that this sort of thing, a multi-editor database designed to be editable via the web, wasn’t what she had in mind when she created Flamenco.
Our next move was to look at a standard Django installation, but that proved limiting as well. PostgreSQL is wonderful, and SQL is capable of mapping the diverse set of objects (e.g. faculty members at accredited LIS programs, journals, scholarships, conferences) and relationships (e.g. “will be speaking at” or “is the publisher of”) we have in mind, but it isn’t pretty. Attempts to create facets across all those relationships would result in some long and slow SQL queries.
So we’re going Solr. More specifically, we’re creating an interface between Django, which is the framework we’re using to create our user and administrative interfaces, and Solr, which will make finding objects in LISInfo faster and more efficient than any other model we’ve found so far. We’re still not sure how the database/index hybrid will play out, but investigating that intersection should be instructive.
It appears we have to choose a license in order to make our work on LISInfo useful. Without one, contributors won’t know what rights they have. Nor will end-users of our code, data, or documentation.
What we want is a license that:
- We understand.
- Makes everything we produce as free as possible.
- Has been vetted by an organization we trust like OSI or FSF.
- Is in common use, ideally by several of the software packages we’re using for LISInfo.
- Is recognized by as much of the world as possible.
- Covers everything we produce, including code, documentation, and databases. Though we would be fine with a set of licenses that cover everything, just so long as we understand which licenses to use for each aspect of the project and we’re able to explain those licenses to everyone who contributes to the project or who wants access to our code, text, data, etc.
We thought we would just place everything we did into the public domain. That seemed simple enough, until we learned that public domain is an American/Commonwealth concept. So much for that idea.
Looking over our key building blocks, Django and PostgreSQL are BSD-licensed. Our blog software, WordPress, uses GPL, as does advisory board member Joshua Ferraro’s Koha. Another board member, DeWitt Clinton, uses Creative Commons licenses for his projects, such as OpenSearch and Delancey. Solr, which looks like it will be a key component in our software, uses Apache. One of the freest major open source licenses is X11 (popularly known as the MIT license), while there’s a good case to be made for the crass but clear WTFPL.
And all that’s just for the code. For the documentation and the data itself, we have to consider Creative Commons, GPL, and the documentation licenses supported by GNU, which is Wikipedia‘s choice, and FreeBSD. And kudos to Talis for supporting the Open Database License, so we’re going to take a look at that as well.
We haven’t decided anything, but we will soon. Suggestions on this one would really be appreciated.
Welcome to the LISInfo Log, which should serve as a record of sorts for the thinking going on behind our development of LISInfo. I’ll be posting more soon on my staggering steps into an approach I believe will be a hybrid of database and search engine, tables and flat files.
For now all I’ll say is best of luck to us.