Museum Documentation – Moving from the Closed, to the Open World

open world

The Open World

As cultural heritage organisations increase their engagement with the Web and the ‘open’ digital world it is becoming increasingly important that they don’t simply apply the same methods and practice that they currently continue to use in their internal and closed world. This is particularly important for the documentation of cultural objects which must radically change if museums are to become valued open world digital organisations.

Current collection management systems are based on standards and techniques designed for a closed world environment. They record information in ways that, when combined with the knowledge of internal curators and experts, are useful for internal purposes. However, the technical transfer or publishing of this data to the Web effectively creates a flat linear resource which is separated from this internal knowledge, significantly limiting its uses and value.

Using knowledge representation methods that attempt to transfer some of the missing context and semantics enhances the data considerably but these methods (Semantic Web ontologies) could provide a far better representation if the original method of documentation  was not so affected by the closed world mindset. However, in terms of existing documentation this is the legacy that we have, and no-one involved in the past in defining collection digitisation anticipated or understood the potential that open world environments might provide for collection data. Revisiting that documentation raises a number of issues.

In new digitisation projects, however, these closed world mindsets no longer have to be applied. Yet in the same way that much of the cultural heritage world has so far failed to realise the potential of the Web beyond electronic publishing for human consumption and replicating the same things they did with hard copy publishing, we also seem intent on continuing with the same type of closed world documentation even though we know that open world requirements and benefits are different. Even for special projects we assume that we need to use the same approach and documentation standards (albeit with different tools) that we are using with our internal collection system – a misplaced assumption about making new digitisation conform to legacy data and systems.

This is a big mistake. When we look at new digitisation we need to use approaches that enhance the possibilities of the data, not give it the same ‘closed world’ limitations. We need new approaches to documentation that are not based on the premise of creating an internal inventory catalogue, but rather ones that directly embed more of the experts knowledge (curators, archivists, librarians, academics) into the data and therefore provide a richer source for knowledge representation methods that can benefit a wider range of users – including cultural institutions themselves.

Museum curators need to understand these new possibilities, take the initiative and insist that documentation and technical departments that are still working with legacy closed world standards and approaches do not continue to limit the possibilities of new data. Ultimately, the way that we document objects in museums needs to change to reflect the fact that we no longer digitise simply to keep an internal record, but instead to provide a valuable, rich and engaging resource for a range of different uses. Without recognising these necessary changes we risk having a far larger legacy of data that we will inevitably need to re-visit.

Dominic Oldman

 

 

 

How should we treat data? Like we were Humanists

broken relationships

*

It strikes me how we (digital humanists) have a very different relationship with structured data compared to the one we have with text and literature. This seems to me to be reflected in the way that we treat data. While many different initiatives around the world attempt to bring data from cultural organisations together, we seem intent on accepting a narrow view about the possibilities of data, computers and the interaction of people, and as a result are happy to ignore the possibilities and benefits of capturing the context (and meaning) attached to data by the experts (people) who produced it and who continually update and develop it. If humanist researchers digitise a book to learn more about it, isn’t the objective to discover more, to discover the hidden relationships and meanings and make connections with other evidence that we have? Do we seek to exclude the elements of it that would give us this insight and throw them away? If not, then why do we accept this situation with cultural data?

Many cultural information systems were designed as closed systems to be used internally in union with the knowledge of the institution and its experts. The original data schemas were often produced to create a functional inventory or reference, and as an internal system they offer a valuable resource – but they are used in combination with other internal knowledge about the data (a knowledge built up over time). If you separate data from its institutional knowledge and context then you lose this essential part of the overall ‘information system’. This is why, when we represent data it should not be just a technical process. It should involve and add institutional knowledge to ensure that the data carries with it as much of this additional and valuable local meaning as possible. Data providers, the institutions themselves, could be providing data that is far more expressive and far more likely to help people (researchers, teachers, ‘the public’ and the institutions themselves) understand their relationship with the past – the type of representations that we take for granted when working on digital literature projects.

“…what would we be without memory? We would not be capable of ordering even the simplest thoughts, the most sensitive heart would lose the ability to show affection, our existence would be a mere never-ending chain of meaningless moments, and there would not be the faintest trace of a past.”   Max Sebald (The Rings of Saturn)

Let’s not make data meaningless and technical, devoid of memory and perspective. Let’s treat it in such a way that it can also evoke meaningful and long lasting memories, and let’s allow it to make connections between different memories (perhaps ones separated by time and place) many of which have been long since forgotten and locked away in our knowledge/memory silos. Let’s use data to produce powerful narratives about history – like we do with literature. Let’s treat data like we were humanists.

For a more formal version of this blog see: http://www.dlib.org/dlib/july14/oldman/07oldman.html 

 

*By Kathy Kimpel (Flickr: IMG_0327) [CC-BY-2.0 (http://creativecommons.org/licenses/by/2.0)]

The Costs of Cultural Heritage Data Services: The CIDOC CRM or Aggregator formats?

Martin Doerr (Research Director at the Information Systems Laboratory and Head of the Centre for Cultural Informatics, FORTH)
Dominic Oldman (Principal Investigator of ResearchSpace, Deputy Head IS, British Museum

June 2013

Many larger cultural institutions are gradually increasing their engagement with the Internet and contributing to the growing provision of integrated and collaborative data services. This occurs in parallel with the upcoming so-called aggregation services, seemingly striving to achieve the same goal. At a closer look however, there are quite fundamental differences that produce very different outcomes.

Traditional knowledge production occurred in an author’s private space or a lab with local records, notwithstanding field research. This space or lab may be part of an institution such as a museum. The author (scholar or scientist) would publish results and by making content accessible it would then be collected by libraries. The author ultimately knows how to interpret the statements in his/her publication and relate that to the reality referred in the publication, from the field, from a lab or from a collection. Many authors are also curators of knowledge and things.

The librarian would not know this context, would not be a specialist of the respective field, and therefore must not alter in any way the content. However, (s)he would integrate the literature under common dominant generic concepts and references, such as “Shakespeare studies”, and preserve the content.

In the current cultural-historical knowledge life-cycle, we may distinguish three levels of stewardship of knowledge: (1) the curator or academic, (2) the disciplinary institution (such as the Smithsonian, the British Museum or smaller cultural heritage bodies) (3) the discipline-neutral aggregator (such as Europeana or IMLS-DCC).  Level (2) typically acts as “provider” to the “aggregator”.

Obviously, the highest level can make the least assumptions about common concepts, in particular a data model, in order to integrate content. Therefore, it can offer services only for very general relationships in the provided content. On the other side, questions needing such a global level of knowledge will be equally generic. Therefore, the challenge is NOT to find the most common fields in the provider schemata (“core fields”), but the most relevant generalizations (such as “refers to”, avoiding overgeneralizations (such as “has date”). These generalizations are for accessing content, but should NOT be confused with the demand of documenting knowledge. At that level some dozens of generic properties may be effective.

The preoccupation of providers and aggregators with a common set of fields has the result that they only support rudimentary connections between the datasets they collect and as a result reduce the ability for researchers to determine where the most relevant knowledge may be located. As with the library, the aggregator’s infrastructure can only support views of the data (search interfaces) that reflect their own limited knowledge because the data arrives with little or no context and over-generalized cross-correlations (“see also”, “relation”, ”coverage”).

The common aggregation process itself strips context away from the data creating silos within the aggregator’s repository. Without adequate contextual information searching becomes increasingly inadequate the larger the aggregation becomes. This limitation is passed on through any Application Programming Interfaces that the aggregator offers. Aggregators slowly begin to understand that metadata is an important form of content, and not only a means to query according to current technical constraints. Some aggregators, such as the German Digital Library, store and return rich “original metadata” received from providers and derive indexing data at the aggregator side, rather than asking providers to strip down their data.

The institution actually curating content must document it so that it will not only be found, but understood in the future. It therefore needs an adequate [1] representation of the context objects come from and their meaning. This representation already has some disciplinary focus, and ultimately allows for integrating the more specialized author knowledge or lab data. For instance, chronological data curves from a carbon dating (C14) lab should be integrated at a museum level (2) by exact reference to the excavation event and records, but on an aggregator level (3) may be described just by a creation date.

A current practice of provider institutions to manually normalize their data with millions of pounds, dollars or euros directly to aggregator formats appears to be an unbelievable waste of money and knowledge. The cost of doing so exceeds by far the cost of the software of whatever sophistication. It appears much more prudent to normalize data at an institutional level to an adequate representation, from which the generic properties of a global aggregator service can be produced automatically, rather than producing, in advance of the aggregation services, another huge set of simplified data for manual integration.

This is precisely the relationship between the CRM and aggregation formats like the EDM. The EDM is the minimal common generalization at the aggregator level, a form to index data at a first level. The CRM is a container, open for specialization, for data about cultural-historical contexts and objects. The CRM is not a format prescription. Concepts of the CRM are used as needed when respective data appear at the provider side. There is no notion of any mandatory field. Each department can select what it regards as mandatory for its own purpose, and even specialize further, without losing the capacity of consistent global querying by CRM concepts. CRM data can automatically be transformed to other data formats, but even quite complex data in a CRM compatible form can effectively be queried by quite simple terms [3].

Similarly, institutions may revise their data formats such that the more generic CRM concepts can automatically be produced from them, i.e., make their formats specializations of the CRM to the degree this is needed for more global questions. For instance, the features of the detailed curve of a C14 measurement are not a subject for a query at an institutional level. Researchers would rather query to retrieve the curve as a whole.

The British Museum understands this fundamental distinction and therefore understands the different risks and costs. This means both the long term financial costs of providing data services, important to organizations with scarce resources, but also the cost to cultural heritage knowledge communities and to society in general. As a consequence they publish using the CRM standard. They also realize that data in the richer CRM format is much more likely to be comprehensible in the future than in “core metadata” form.

Summarizing, we regard publishing and providing information in a CRM compatible form [2] at the institutional or disciplinary level to be much more effective in terms of research utility (and the benefits of this research to other educational and engagement activities). The long-term costs are reduced even with further specializations of such a form, and the costs of secondary transformation algorithms to aggregation formats like EDM are marginal.

Dominic Oldman

 

[1]  Smith B. Ontology. The Blackwell Guide to the Philosophy of Computing and Information., pages 155–166, 2003. Floridi, L. (ed). Oxford: Blackwell.

[2] Official Version of the CIDOC CRM, the version 5.0.4 of the reference document.
Nick Crofts, Martin Doerr, Tony Gill, Stephen Stead, Matthew Stiff (editors), Definition of the CIDOC Conceptual Reference Model, December 2011.
Available: doc file (3.64 Mb), pdf file (1.56 Mb)

[3] Tzompanaki, K., & Doerr, M. (2012). A New Framework For Querying Semantic NetworksMuseums and the Web 2012: the international conference for culture and heritage on-line. April 11-14, San Diego, CA, USA

 

The British Museum, CIDOC CRM and the Shaping of Knowledge

At the British Museum we are fast approaching a new production version of our currently beta Semantic Endpoint. The production version will remove some of the current restrictions and provide a more robust environment to develop applications against. It will also come with much needed documentation detailing a new mapping to the CIDOC CRM (Conceptual Reference Model) prompted by feedback received from the current version and by requirements to support the ResearchSpace project.

The use of the CIDOC CRM itself has raised questions and criticisms, mostly from developers. This comes about for a variety of reasons; the lack of current CRM resources; a lack of experience of using it (an issue with any new method or approach); a lack of documentation about particular implementations; but also, particular to this type of publication, a lack of domain knowledge by those creating cultural heritage web applications. The CRM exposes a real issue in the production and publication of cultural heritage information about the extent to which domain experts are involved in digital publication and, as a result, its quality.

The debate about whether we should focus on providing data in a simple format for others to use in web pages and at hack days, against a richer and more ontological approach (requiring a deeper understanding of collection data) is one in which the former position is currently dominant. To support this there are some exceptional projects using simple schemas designed to achieve specific and collaborative objectives. However, many linked data points lack the quality to be more than basic information jukeboxes that, in turn, support applications with limited usefulness and shelf life. In short, the current cultural heritage linked data movement, concentrating on access (a fundamental objective), may have ignored some of reasons for establishing networks of knowledge in the first place.

The British Museum’s source of object data has its stronger and weaker elements but it has descriptions, associations and taxonomies developed over the last 30 years of digitisation. In order to exploit this accumulated knowledge and provide support for a wide range of users, including humanist scholars, it needs to be described within a rich semantic framework. This is a first step to developing the new taxonomies needed to allow different relationships and interpretations of harmonised collections to be exposed. Semantic data harmonisation is not just about linking database records together but is about exploring and discovering (inferring) new knowledge.

The full power of the CRM comes when there is a sufficient mass of conforming data providing a coverage of topics such that the density of information and events generates a resource from which the inference of knowledge can occur. Research tool-kits built around such a collaboration of data would uncover new facts that could never be discovered using traditional methodologies. In this respect it is an ontology tailor made for making intelligent sense of the mass of online cultural heritage data. Its adoption continues to grow but it has also reached a ‘chicken and egg’ stage needing the implementation of public applications to clearly demonstrate its unique properties and value to humanities research.

By bringing data together in a meaningful way rather than just treating it as a technical process or act of systems integration we can start to deconstruct the years of separation and institutional classifications designed to support narrower curatorial and administrative aims. Regardless of the resources available to research projects, this historical limitation, and the lack of any cost effective digital solution, has made the problem of asking a broader range of questions a difficult challenge. But to ask the broader questions that may lead to more interesting, valuable and sustainable web applications, requires appropriate semantic infrastructures. The CRM provides a starting point.

The publication of BM data in the CRM format comes from a concern that many Semantic Web / Linked Data implementations will not provide adequate support for a next generation of collaborative data centric humanities projects. They may not support the types of tools necessary for examining, modelling and discovering relationships between knowledge owned by different organisations at a level currently limited to more controlled and localized data-sets. Indeed, the proliferation of different uncoordinated linked data schemas may create a confusing and complex environment of mappings between data stores and thereby limit the overall effectiveness of semantic technology and produce outputs that don’t push digital publications much beyond those achieved using existing database technology.

The CRM is difficult not because of what it is (a distillation of existing and known cultural heritage concepts and relationships) but because it requires real cross disciplinary collaboration to implement properly – and this type of collaboration is difficult. The aim of the British Museum Endpoint is to deliver a technical interface but also to demystify the processes underlying the implementation of the CRM as well as the BM’s CRM mapping itself. By doing this the Endpoint should support a wide range of publication objectives for different audiences, a wide range developers with varying experience and domain knowledge and crucially fulfill the future needs of humanities scholars.

In particular the aim is to raise the bar on what can be achieved on the Internet and allow researchers to transfer data modelling techniques, that are currently only serviced by specialist relational database models, into the online world. These techniques will allow scholars, with access to CRM aligned datasets, to make sense of and tackle ‘big data’ littered with many different classifications and taxonomies and allow a broader, specialist and contextual re-examination of historical data and historical events.

Dominic Oldman