In Support of Knowledge: Synergy and the Integration of Cultural Heritage Data

The relationship between information and computers has always been limited to some extent by the latter, and this has been particularly apparent in cultural heritage systems. However, interest in cultural heritage computing started as early as the 1960s, and the first museums and computers conference was held at the Metropolitan Museum in New York, sponsored by IBM, in April 1968. The curatorial reviewer of that event, Edward F. Fry, then an assistant curator at the Guggenheim, and later an Andrew W. Mellon distinguished Professor of Art History, set out a positive vision for how computers might become useful to cultural heritage experts (http://www.jstor.org/stable/30199388). This included a different type of ‘standard’, one that was flexible and could support developing knowledge across all disciplines, including the organisation of bibliographic material alongside material culture and other historical information. The ultimate aim, according to Fry, would be to provide scholars with a resource that allowed the pursuit of new knowledge from existing and connected facts.

The success of modern institutional information applications relies heavily on human knowledge and engagement. This often produces an uneasy relationship because knowledge is compromised when translating it into rows and columns often for simple catalogue and inventory formats. While user experience (UX) techniques and general computing have developed considerably over the years, systems are still limited by a lack of expression and flexibility available at the point at which knowledge is encoded into structured digital data. While these limitations can be managed within the closed environment of the organisation, information systems don’t not come close to achieving Fry’s criteria, now 45 years old.

Software applications attempt, with the help of technologists, to represent data stripped of context and placed within artificial schemas. Systems that carry data derived from information that is semantically complex (like much humanities data) rely on the knowledge of users to fill in the gaps left implicit, ambiguous or missing by computer systems. The majority of databases in which information is stored have no ability to store semantics effectively and therefore the layers of software that make up an application try to compensate for this through programmatic interpretation. People are therefore an integral part of any internal information system, both in its production and it final utility and effectiveness. But surely it makes more sense for subject experts to specify the semantics?

This model is also highly expensive. Ongoing knowledge production requires changes to the database; and changes to the database require changes to the layers of software that interpret and present the information. Often more training is required for users to understand the changes, both formal and informal. These are processes and costs that have been established and normalised over a long period of time, and are broadly accepted or often deep rooted.

Despite these limitations we happily publish data (orphaned from all of the original layers that make it useful internally) to open environments for others to use. Traditional standards help to some extent but in many knowledge organisations working with richer datasets (particularly cultural heritage), making this information useful to wider audiences is difficult and consumes vast amounts of money with little understanding of the benefits. The people most likely to engage with raw data are technologists who are often not in a position to fully interpret and present the data alone, and conversations with knowledge providers are often limited by traditional database development processes. For example, projects compensate by only consuming data that conforms to a ‘core’ model that is easier to understand without much additional interpretation and processing (or so it seems).

By only processing core fields, a mindset is gradually generated that there is no need to consult with the original producers of the data, which is perceived as an overhead. In many cases producers of data are asked to only provide that which conforms to the core model, reducing the aggregator’s overheads further. In all of this, aggregators disseminate a notion that publishing any data in any form is always a good thing. Yet aggregation systems have failed to provide the value and benefits that their original mission statements promised and in some cases do not accurately represent source data. A core model has, divorced from its wider context, a limited set of use cases beyond providing simple finding aids, and as data repositories increase in size these finding aids come under increasing strain, burying information rather than making it more discoverable.

Imagine if you turned this established model on its head! Imagine that instead of spending vast amounts of money on software layers which require high levels of technical skill to create, manage, support and change (this in itself limiting progress) more of this effort was instead focussed on the data. Instead of building databases that force the meaning and context out of information just so that it conforms to a database structure, and instead of commissioning technologists to reassemble the meaning using programming code – imagine a method of representation designed for information experts that could incorporate the semantics of the information they produce into the data itself, making the implicit, explicit. Just as Edward F. Fry suggested that it should. What would this mean?

Firstly it would mean that much more of the effort required to build an information system would be placed in the hands of the people that understand the information and who would also use and develop it, since it would better reflect their information needs. That doesn’t mean that technologists are not important, but it would allow a more appropriate use of skills. Secondly, with meaning and context made explicit, less resources would be needed to build the application software. Technical experts would focus, not on trying to make sense of abstract artificial data and data structures, but instead on materialising the meaning already embedded in the data. Thirdly, the data would be far more useful for others because, with its context intact, it becomes suitable for research and engagement alike – Researchers need context, but so do general interested digital visitors – and it would become accessible to a far greater number of potential users. Finally, as information evolves it becomes easier and less costly to change the software which itself becomes more flexible. Additionally, data would not need to rely on software providing both a preservation object and a meaningful practical digital object at the same time.

The final part of this imagined world is a data environment that allows the flexible assertion of patterns of information which carry meaning and context. If the language used to provide this were based on universal concepts rather than artificial standards and specialised terminology, and if this framework supported existing data rather than replaced it, then many other benefits start to arise. In this knowledge orientated information world new assertions could be added without breaking the systems that carry them and this would provide the foundation for a very large number of use cases and help connect organisations in a more significant way to communities of people and organisations also interested in improving and using similar or related information. Such a semantic framework would also provide the basis for integration, not by attempting to individually link pieces of information as a technical exercise – an impossible and error prone endeavour – but through an alignment of universal semantics, with provenance, able to transcend organisations, sectors and national borders. Such a change in thinking requires a new set of processes that reflect a greater emphasis on information and the people who create it, and a re-alignment of the relationship between knowledge producers and the producers/developers of information systems.

The Synergy system is about changing the way that we think about digital information and promoting more useful and quality information. It comes from a realisation that without more emphasis on the data, information will remain a second class citizen in a so called, ‘information world’, and serve very limited aims and objectives. The Synergy model understands that effective global information networks need to represent more fully, and be connected to, the knowledge organisations that produce data on a daily basis and who know how data should be represented effectively with all its crucial features intact – A Web of Knowledge.

While large amounts of effort have been directed into digitisation, generating huge amounts of digital information, less attention has been given to how that information should be represented to a varied and growing audience. Whether used for research, education or engagement purposes, a key challenge is to represent data in a way that fully conveys its original context so that it can be effectively built upon and related to other connected and supporting information. For people outside cultural heritage institutions to understand the information and use it effectively the language and concepts used to help convey or represent it must be more universal (and conform to real world logic) but also true to its original institutional meaning. The Synergy system describes how organisation can change and influence their engagement with digital information services and communities on the World Wide Web and elsewhere. It puts a far greater emphasis on the curation and representation of data. The key elements of Synergy are these;

  1. It describes in full the ecosystem required for sustainable data provisioning between data providers and data aggregators as an ongoing and collaborative undertaking.
  2. It addresses the lack of functionality and flexibility in current aggregation systems and therefore the lack of user orientated tools necessary to generate data including data cleaning, data mapping definitions & associated metadata transfer.
  3. It describes the necessary knowledge and input needed from providers (or provider communities) to create quality sustainable aggregations with meaning and context.
  4. It defines a modular architecture that can be developed and optimized by different contributors with minimal inter-dependencies.
  5. It support the ongoing management of data transfers between providers and target data repositories and the delivery of transformed data at defined times, including updates.

In doing this it describes a different model to the one currently implemented as an attempt to move away from short term thinking about data provisioning, to more long term and sustainable systems that are more cost effective, deliver better quality data and elevate information to a first class citizen.

The Synergy reference model is described at http://www.cidoc-crm.org/docs/SRM_v0.1.pdf

Dominic Oldman