The Costs of Cultural Heritage Data Services: The CIDOC CRM or Aggregator formats?

Martin Doerr (Research Director at the Information Systems Laboratory and Head of the Centre for Cultural Informatics, FORTH)
Dominic Oldman (Principal Investigator of ResearchSpace, Deputy Head IS, British Museum

June 2013

Many larger cultural institutions are gradually increasing their engagement with the Internet and contributing to the growing provision of integrated and collaborative data services. This occurs in parallel with the upcoming so-called aggregation services, seemingly striving to achieve the same goal. At a closer look however, there are quite fundamental differences that produce very different outcomes.

Traditional knowledge production occurred in an author’s private space or a lab with local records, notwithstanding field research. This space or lab may be part of an institution such as a museum. The author (scholar or scientist) would publish results and by making content accessible it would then be collected by libraries. The author ultimately knows how to interpret the statements in his/her publication and relate that to the reality referred in the publication, from the field, from a lab or from a collection. Many authors are also curators of knowledge and things.

The librarian would not know this context, would not be a specialist of the respective field, and therefore must not alter in any way the content. However, (s)he would integrate the literature under common dominant generic concepts and references, such as “Shakespeare studies”, and preserve the content.

In the current cultural-historical knowledge life-cycle, we may distinguish three levels of stewardship of knowledge: (1) the curator or academic, (2) the disciplinary institution (such as the Smithsonian, the British Museum or smaller cultural heritage bodies) (3) the discipline-neutral aggregator (such as Europeana or IMLS-DCC).  Level (2) typically acts as “provider” to the “aggregator”.

Obviously, the highest level can make the least assumptions about common concepts, in particular a data model, in order to integrate content. Therefore, it can offer services only for very general relationships in the provided content. On the other side, questions needing such a global level of knowledge will be equally generic. Therefore, the challenge is NOT to find the most common fields in the provider schemata (“core fields”), but the most relevant generalizations (such as “refers to”, avoiding overgeneralizations (such as “has date”). These generalizations are for accessing content, but should NOT be confused with the demand of documenting knowledge. At that level some dozens of generic properties may be effective.

The preoccupation of providers and aggregators with a common set of fields has the result that they only support rudimentary connections between the datasets they collect and as a result reduce the ability for researchers to determine where the most relevant knowledge may be located. As with the library, the aggregator’s infrastructure can only support views of the data (search interfaces) that reflect their own limited knowledge because the data arrives with little or no context and over-generalized cross-correlations (“see also”, “relation”, ”coverage”).

The common aggregation process itself strips context away from the data creating silos within the aggregator’s repository. Without adequate contextual information searching becomes increasingly inadequate the larger the aggregation becomes. This limitation is passed on through any Application Programming Interfaces that the aggregator offers. Aggregators slowly begin to understand that metadata is an important form of content, and not only a means to query according to current technical constraints. Some aggregators, such as the German Digital Library, store and return rich “original metadata” received from providers and derive indexing data at the aggregator side, rather than asking providers to strip down their data.

The institution actually curating content must document it so that it will not only be found, but understood in the future. It therefore needs an adequate [1] representation of the context objects come from and their meaning. This representation already has some disciplinary focus, and ultimately allows for integrating the more specialized author knowledge or lab data. For instance, chronological data curves from a carbon dating (C14) lab should be integrated at a museum level (2) by exact reference to the excavation event and records, but on an aggregator level (3) may be described just by a creation date.

A current practice of provider institutions to manually normalize their data with millions of pounds, dollars or euros directly to aggregator formats appears to be an unbelievable waste of money and knowledge. The cost of doing so exceeds by far the cost of the software of whatever sophistication. It appears much more prudent to normalize data at an institutional level to an adequate representation, from which the generic properties of a global aggregator service can be produced automatically, rather than producing, in advance of the aggregation services, another huge set of simplified data for manual integration.

This is precisely the relationship between the CRM and aggregation formats like the EDM. The EDM is the minimal common generalization at the aggregator level, a form to index data at a first level. The CRM is a container, open for specialization, for data about cultural-historical contexts and objects. The CRM is not a format prescription. Concepts of the CRM are used as needed when respective data appear at the provider side. There is no notion of any mandatory field. Each department can select what it regards as mandatory for its own purpose, and even specialize further, without losing the capacity of consistent global querying by CRM concepts. CRM data can automatically be transformed to other data formats, but even quite complex data in a CRM compatible form can effectively be queried by quite simple terms [3].

Similarly, institutions may revise their data formats such that the more generic CRM concepts can automatically be produced from them, i.e., make their formats specializations of the CRM to the degree this is needed for more global questions. For instance, the features of the detailed curve of a C14 measurement are not a subject for a query at an institutional level. Researchers would rather query to retrieve the curve as a whole.

The British Museum understands this fundamental distinction and therefore understands the different risks and costs. This means both the long term financial costs of providing data services, important to organizations with scarce resources, but also the cost to cultural heritage knowledge communities and to society in general. As a consequence they publish using the CRM standard. They also realize that data in the richer CRM format is much more likely to be comprehensible in the future than in “core metadata” form.

Summarizing, we regard publishing and providing information in a CRM compatible form [2] at the institutional or disciplinary level to be much more effective in terms of research utility (and the benefits of this research to other educational and engagement activities). The long-term costs are reduced even with further specializations of such a form, and the costs of secondary transformation algorithms to aggregation formats like EDM are marginal.

Dominic Oldman

 

[1]  Smith B. Ontology. The Blackwell Guide to the Philosophy of Computing and Information., pages 155–166, 2003. Floridi, L. (ed). Oxford: Blackwell.

[2] Official Version of the CIDOC CRM, the version 5.0.4 of the reference document.
Nick Crofts, Martin Doerr, Tony Gill, Stephen Stead, Matthew Stiff (editors), Definition of the CIDOC Conceptual Reference Model, December 2011.
Available: doc file (3.64 Mb), pdf file (1.56 Mb)

[3] Tzompanaki, K., & Doerr, M. (2012). A New Framework For Querying Semantic NetworksMuseums and the Web 2012: the international conference for culture and heritage on-line. April 11-14, San Diego, CA, USA

 

Big Data, Collaboration and Scale

Featured

In his book, “Small is Beautiful: A Study of Economics as if People Mattered” the economist Dr E.F. Schumacher, the inspiration for the current British Prime Minister’s ‘Big Society’ idea (the similarities and differences are not for this blog), talks about an appropriate scale for a particular activity. The example that Schumacher gives himself is that of teaching. Some things, said Schumacher,

“can only be taught in a very intimate circle, whereas other things can obviously be taught en masse, via the air, via the television, via teaching machines, and so on. What scale is appropriate? It depends on what we are trying to do”  

The scale of a project therefore needs to reflect its objectives, but projects that reach a certain level of scale and largeness will experience limitations and constraints on the type of objective they can pursue satisfactorily. We see and experience the limitations of scale all around us, for example, when we visit different towns and find that that different high streets all host the same shops offering all the same goods. It also describes, to some extent, why Google have been so successful. The scale of Google is enormous with a business model based on attracting, and continuing to attract, as many visitors as possible (I am one!). This model means that most of the services that Google offer are aimed at a mass and general audience.

Schumacher’s conclusions came from observations of a world increasingly obsessed with largeness, economies of scale and globalisation. But this obsession often fosters blandness, commoditisation, repetition and a lack of humanity (note Frederick Winslow Taylor and modern digital comparisons). Thankfully these tendencies towards the large are met by a human reaction towards the stimulating, differentiated, and innovative – often expressed through relative smallness.

The Europeana project might be seen as just this type of reaction and the rhetoric certainly supports this contention. For example,

“Can Europe afford to be inactive and wait, or leave it to one or more private players to digitise our common cultural heritage? Our answer is a resounding ‘no’…. Our goal is to ensure that Europe experiences a digital Renaissance instead of entering into a digital Dark Age”. The New Renaissance – Report of the Comite Des Sages

The provision of a publicly funded and open access cultural heritage resource covering the broadest of European culture is something that many of us applaud and want to succeed. However, developing such a project of large ambition and scale, requiring the cooperation and collaboration of many different countries, organisations and people, and establishing a centralised cultural heritage digital repository that surpasses anything that a generalist like Google can offer is an ambitious undertaking. As a portal to link cultural heritage resources together the structure and scale works well. However, as a repository for knowledge and data reuse, scale introduces some difficult problems and magnifies issues far from resolved at a local level in many locations.

For those of us working in museums, archives and galleries this type of venture has a number of domain specific issues and risks that, on reflection, far exceed those that Google would have dealt with (and are able to side step) and which are not necessarily solved with money. But more than this the same issues of scale that support and sustain Google’s mass market business model tend to work against the more principled ambitions of projects like Europeana.

Just as Google’s success is founded on a model which (at least seemingly) allows friction free access to resources, so Europeana must do the same. But issues of scale, as with Google, mean that questions of data quality creep in – but in Europeana’s case for different reasons and in potentially more destructive ways. Its scale and structure mean that (particularly in the current economic climate) the project must either herd cats or make compromises that may limit some of its ambitions. Unlike Google, mistakes or changes in direction in such a complex structure of membership can be difficult to rectify and disagreements about strategy can quickly lead to splintering and balkanism.

These risks are inherent in a model of largeness but to a certain degree the Europeana mission has come about as a result of the general inertia of cultural heritage organisations to use the Internet to bring their combined knowledge together (surely this is the Internet’s primary contribution to furthering the development of humankind). The question now that Europeana is established is how can we, as a sector, support and sustain these efforts and help it, and other services, to develop and become richer and important resources?

My answer is to convey upon them the gift of smallness and by doing so bring the Europeana mission closer to the people and organisations that matter. This distance could be slowly reduced by gradually replacing single mechanical national aggregators with communities of museums, galleries, archives and libraries with shared interests and who (whether for intellectual or practical reasons) are able to share local infrastructures, services and expertise. In this way smaller more sustainable networks of knowledge can connect more directly with large portals (in terms of both data and people) and provide them with richer contributions. This distinctively different and innovative approach also requires that portals, like Europeana, equally reach out and actively support and favour this form (or culture) of collaborative digital curation to ensure that we don’t repeat past failures of well intentioned largeness.

In other words, the formation of a truly sustainable ‘Big Cultural Society’ in which big data is also quality data requires the foundation of natural collaborative frameworks formed at an appropriate scale which, when joined together, can create a Web of culture and science.

Dominic Oldman – Oct 2012