Big Data, Collaboration and Scale


In his book, “Small is Beautiful: A Study of Economics as if People Mattered” the economist Dr E.F. Schumacher, the inspiration for the current British Prime Minister’s ‘Big Society’ idea (the similarities and differences are not for this blog), talks about an appropriate scale for a particular activity. The example that Schumacher gives himself is that of teaching. Some things, said Schumacher,

“can only be taught in a very intimate circle, whereas other things can obviously be taught en masse, via the air, via the television, via teaching machines, and so on. What scale is appropriate? It depends on what we are trying to do”  

The scale of a project therefore needs to reflect its objectives, but projects that reach a certain level of scale and largeness will experience limitations and constraints on the type of objective they can pursue satisfactorily. We see and experience the limitations of scale all around us, for example, when we visit different towns and find that that different high streets all host the same shops offering all the same goods. It also describes, to some extent, why Google have been so successful. The scale of Google is enormous with a business model based on attracting, and continuing to attract, as many visitors as possible (I am one!). This model means that most of the services that Google offer are aimed at a mass and general audience.

Schumacher’s conclusions came from observations of a world increasingly obsessed with largeness, economies of scale and globalisation. But this obsession often fosters blandness, commoditisation, repetition and a lack of humanity (note Frederick Winslow Taylor and modern digital comparisons). Thankfully these tendencies towards the large are met by a human reaction towards the stimulating, differentiated, and innovative – often expressed through relative smallness.

The Europeana project might be seen as just this type of reaction and the rhetoric certainly supports this contention. For example,

“Can Europe afford to be inactive and wait, or leave it to one or more private players to digitise our common cultural heritage? Our answer is a resounding ‘no’…. Our goal is to ensure that Europe experiences a digital Renaissance instead of entering into a digital Dark Age”. The New Renaissance – Report of the Comite Des Sages

The provision of a publicly funded and open access cultural heritage resource covering the broadest of European culture is something that many of us applaud and want to succeed. However, developing such a project of large ambition and scale, requiring the cooperation and collaboration of many different countries, organisations and people, and establishing a centralised cultural heritage digital repository that surpasses anything that a generalist like Google can offer is an ambitious undertaking. As a portal to link cultural heritage resources together the structure and scale works well. However, as a repository for knowledge and data reuse, scale introduces some difficult problems and magnifies issues far from resolved at a local level in many locations.

For those of us working in museums, archives and galleries this type of venture has a number of domain specific issues and risks that, on reflection, far exceed those that Google would have dealt with (and are able to side step) and which are not necessarily solved with money. But more than this the same issues of scale that support and sustain Google’s mass market business model tend to work against the more principled ambitions of projects like Europeana.

Just as Google’s success is founded on a model which (at least seemingly) allows friction free access to resources, so Europeana must do the same. But issues of scale, as with Google, mean that questions of data quality creep in – but in Europeana’s case for different reasons and in potentially more destructive ways. Its scale and structure mean that (particularly in the current economic climate) the project must either herd cats or make compromises that may limit some of its ambitions. Unlike Google, mistakes or changes in direction in such a complex structure of membership can be difficult to rectify and disagreements about strategy can quickly lead to splintering and balkanism.

These risks are inherent in a model of largeness but to a certain degree the Europeana mission has come about as a result of the general inertia of cultural heritage organisations to use the Internet to bring their combined knowledge together (surely this is the Internet’s primary contribution to furthering the development of humankind). The question now that Europeana is established is how can we, as a sector, support and sustain these efforts and help it, and other services, to develop and become richer and important resources?

My answer is to convey upon them the gift of smallness and by doing so bring the Europeana mission closer to the people and organisations that matter. This distance could be slowly reduced by gradually replacing single mechanical national aggregators with communities of museums, galleries, archives and libraries with shared interests and who (whether for intellectual or practical reasons) are able to share local infrastructures, services and expertise. In this way smaller more sustainable networks of knowledge can connect more directly with large portals (in terms of both data and people) and provide them with richer contributions. This distinctively different and innovative approach also requires that portals, like Europeana, equally reach out and actively support and favour this form (or culture) of collaborative digital curation to ensure that we don’t repeat past failures of well intentioned largeness.

In other words, the formation of a truly sustainable ‘Big Cultural Society’ in which big data is also quality data requires the foundation of natural collaborative frameworks formed at an appropriate scale which, when joined together, can create a Web of culture and science.

Dominic Oldman – Oct 2012




The Semantic Web: The new Enlightenment in an Age of Unreason

Located in the King’s Library of the British Museum, off the east side of the Great Court, you will find the Enlightenment Gallery. This gallery is unique being the only permanent space that comes close to a genuine time machine. It takes visitors back to the age of the eighteenth century collector and organises objects to show the broad historical concerns studied by the wealthy scholars of the day. Their vigorous interest, underpinned by a position of economic dominance, was partly directed towards developing a more detailed and systematic (scientific) understanding of the world and humankind from ancient times. It was also a period known as the, ‘Age of Reason’.

However, the efforts of these private collectors meant that even the Royal Society’s collection came under increasing pressure due to competition with individual collectors for artefacts and specimens generated often by its own members, most notably Sir Hans Sloane. It was therefore hugely significant that Sloane’s own extensive collection of over 71,000 objects including flora and fauna, coins, prints and drawings, books, manuscripts and other curiosities found their way to the world’s first national public museum. In one single act (enshrined by Parliament) a collection previously only accessible to the privileged few became available to visitors who, as today, came to London from around the world.

As a result of this transfer from private to public, the British Museum of the time spanned both the artificial and natural world (including a substantial library), and would have been an awe inspiring (albeit sometimes confusing) experience for the new visitors – and for the new administrators difficult to organise and manage. Nevertheless the themes of scholarship previously only available to a privileged few became available for any visitor to cast their eye over and, over time, the British Museum would become a natural home for other previously private and inaccessible collections.

The more objects in the Museum’s collection the more evidence available to scholars to support developing theories and improved interpretations of our history. In some ways the eighteenth century preoccupation with collecting objects to solve the big questions of humanity equates to the modern day call of Tim Berners-Lee to support the web of data. To the modern day researcher the more available data the more comprehensive and valid the research and the better the quality of the conclusions. There are of course further comparisons with Sloane, the Royal Society and the British Museum in terms of levels of accessibility, competing interests and the ability to manage and make sense of ever increasing bodies of information.

However, the transfer of private collections into public museums and libraries went hand in hand with the development of different classifications that departed from some of the broader (or period) concerns of Sloane and his colleagues, evolving to match the academic and administrative agendas of more specialist museums. The eventual division of the Sloane collection is associated with the creation of the Natural History Museum and the British Library, but the result was not just a physical separation but the start of viewing objects and managing collections with different approaches, separate taxonomies and, as these new organisations established themselves, with very different organisational cultures. These new cultures created further internal divisions along departmental and administrative lines often resulting in more narrow agendas, perspectives and cataloguing habits.

Today an initiative to reconstruct the Sloane collection (‘Reconstructing Sloane’) is confronted with 250 years of separation. This means that the organisations involved need to embrace collaboration and attempt to bring together their accumulated knowledge, stored using different information schema and different terminologies, to answer a new set of questions prompted by digital unification. In this respect the Sloane project confronts the issue of how researchers will transfer the type of analysis currently reserved for smaller more narrowly focused and controlled datasets, to the issue of ‘big data’ typically dispersed and controlled by many different organisations.

The Internet provides the physical infrastructure to bring together different cultural heritage organisations, and the Semantic Web provides the protocols by which we may harmonise our data (or knowledge) and find new ‘enlightenment’. However, to establish true networks of knowledge will require new attitudes towards research, analysis, interaction and collaboration. The Sloane project may provide an interesting model for understanding the dynamics of collections working together to harness the potential of the Internet and to break the current collaborative stalemate created by a continued reliance on ‘Gutenberg’ publication models.

To manually sift through the different materials owned by the Sloane partners and attempt to uncover and understand their relationships would require more person years than any normal cultural heritage project team could hope to allocate. The use of already digitised material and further digitisation efforts mean that computers (and computing) can be used to help perform the analysis. But the requirement to search across natural history, textual, art and antiquity datasets from different institutions and answer questions as if the collection had never been separated requires a new and radical approach. The different proprietary schema need to be mapped to a common (and the author would argue Semantic Web) framework to create a digital version of the Enlightenment Gallery capable of supporting, not just one, but a multitude of different interpretations.

But what happens when organisations restrict access to knowledge and assets and insist on applying barriers for the sake of licensing revenue and off-setting publication costs. Putting aside the administrative overheads that these restrictions create it means that semantic relationships, and the potential inferences derived from combining and harmonising knowledge from different organisations, will be frustrated. Paywalls applied at any stage of this process will simply limit its effectiveness and reduce digital projects to staged productions perpetuating the charges of charlatanism and blandness thrown at many cultural heritage web sites.

The Semantic Web works by bringing together data so that relationships and connections can be discovered and explored rather than predetermined by individual museums views of the world. But the process of modelling and analysis of data across networks is fundamentally precluded by primitive commercial barriers and treasure house mentalities. For collaborations such as ‘Reconstructing Sloane’ the only feasible way forward is a reciprocal agreement to provide digital material to the project without access limitations and free from charges.

Why isn’t the principle of reciprocation (the cancelling out of charging between cultural organisations to reduce costs) applied universally and outside formal projects? We now have the strange situation in which anyone can reuse data and high resolution images online (and in real time) from the Yale Center for British Art (YCBA) without any correspondence with them whatsoever using open access and open standard computer interfaces, and others like the National Gallery in Washington is set down a similar path. Yet if these organisations wish to create their own web resource, say on the work of Constable or Turner (artists for which the Center owns very important examples), they are charged. Specific reciprocation agreements between certain organisations for certain limited projects, however, do not solve the problem of semantic and knowledge networks.

When you take into account the overheads of managing image licensing (and many organisations still do not understand the full cost); the savings that a free exchange of assets would provide (the costs of purchasing assets are never set against licensing income); and the benefits created by friction free networks of knowledge; then one can only conclude that the major concern for museums must be the perception that by providing free access they will somehow miss out on a bonanza of income that might present itself sometime in the future.

In reality any income streams are more likely to be associated with innovative services (what you do with digital assets rather than the assets themselves) which require an engagement, financial investment, resource and a degree of risk that most, if not all, museums are unable to consistently sustain. In any event, services of sufficient interest to a large audience will typically require the raw assets of a number of different institutions – a prime reason why they have not materialised (see the Constable example above).

Nevertheless the cultural heritage sector is fearful that the commercial sector will make profits they have missed or have been unable to generate themselves over the last 20 years. But what would happen if, like Yale University, the whole sector provided complete open and free access. It may well attract interest and may result in services with business potential (inspite of the free availability of the assets used in those services). The assets may be used for merchandising, they may provide services that make better sense of the mass of information made available – and they may or may not be successful in creating a profitable business model. It would certainly allow many more open access Sloane type projects at a fraction of the current cost and provide greater incentives for more organisations to contribute to larger networks of public knowledge.

For those who insist on finding additional income streams wouldn’t it be better to let others (commercial and non-commercial) take some of the risks and to encourage innovation from third parties for which we (in our aim to disseminate and educate) can only benefit. Shouldn’t the cultural heritage sector feel confident that successful models could easily be improved upon (if so desired) using those other assets that set the sector apart - our knowledge, expertise and reputations. Alternatively, we can reward the innovation of others, and potentially share in any benefits, by endorsing successful services that meet with our standards and approval.

In this new digital world museums have the opportunity to better use their knowledge, expertise and reputation to more fully and wholeheartedly engage with the cultural Internet if barriers to knowledge and content are lifted. They can still produce their own digital services (perhaps invigorated by a more vibrant digital economy), they can still attempt to generate income through their own services or through the endorsement of others work. But most importantly they can concentrate on their main reason for being and, extending the hopes of the private collectors of the eighteenth century, initiate a more inclusive, accessible and collaborative enlightenment towards a new digital age of reason.

Dominic Oldman

The British Museum, CIDOC CRM and the Shaping of Knowledge

At the British Museum we are fast approaching a new production version of our currently beta Semantic Endpoint. The production version will remove some of the current restrictions and provide a more robust environment to develop applications against. It will also come with much needed documentation detailing a new mapping to the CIDOC CRM (Conceptual Reference Model) prompted by feedback received from the current version and by requirements to support the ResearchSpace project.

The use of the CIDOC CRM itself has raised questions and criticisms, mostly from developers. This comes about for a variety of reasons; the lack of current CRM resources; a lack of experience of using it (an issue with any new method or approach); a lack of documentation about particular implementations; but also, particular to this type of publication, a lack of domain knowledge by those creating cultural heritage web applications. The CRM exposes a real issue in the production and publication of cultural heritage information about the extent to which domain experts are involved in digital publication and, as a result, its quality.

The debate about whether we should focus on providing data in a simple format for others to use in web pages and at hack days, against a richer and more ontological approach (requiring a deeper understanding of collection data) is one in which the former position is currently dominant. To support this there are some exceptional projects using simple schemas designed to achieve specific and collaborative objectives. However, many linked data points lack the quality to be more than basic information jukeboxes that, in turn, support applications with limited usefulness and shelf life. In short, the current cultural heritage linked data movement, concentrating on access (a fundamental objective), may have ignored some of reasons for establishing networks of knowledge in the first place.

The British Museum’s source of object data has its stronger and weaker elements but it has descriptions, associations and taxonomies developed over the last 30 years of digitisation. In order to exploit this accumulated knowledge and provide support for a wide range of users, including humanist scholars, it needs to be described within a rich semantic framework. This is a first step to developing the new taxonomies needed to allow different relationships and interpretations of harmonised collections to be exposed. Semantic data harmonisation is not just about linking database records together but is about exploring and discovering (inferring) new knowledge.

The full power of the CRM comes when there is a sufficient mass of conforming data providing a coverage of topics such that the density of information and events generates a resource from which the inference of knowledge can occur. Research tool-kits built around such a collaboration of data would uncover new facts that could never be discovered using traditional methodologies. In this respect it is an ontology tailor made for making intelligent sense of the mass of online cultural heritage data. Its adoption continues to grow but it has also reached a ‘chicken and egg’ stage needing the implementation of public applications to clearly demonstrate its unique properties and value to humanities research.

By bringing data together in a meaningful way rather than just treating it as a technical process or act of systems integration we can start to deconstruct the years of separation and institutional classifications designed to support narrower curatorial and administrative aims. Regardless of the resources available to research projects, this historical limitation, and the lack of any cost effective digital solution, has made the problem of asking a broader range of questions a difficult challenge. But to ask the broader questions that may lead to more interesting, valuable and sustainable web applications, requires appropriate semantic infrastructures. The CRM provides a starting point.

The publication of BM data in the CRM format comes from a concern that many Semantic Web / Linked Data implementations will not provide adequate support for a next generation of collaborative data centric humanities projects. They may not support the types of tools necessary for examining, modelling and discovering relationships between knowledge owned by different organisations at a level currently limited to more controlled and localized data-sets. Indeed, the proliferation of different uncoordinated linked data schemas may create a confusing and complex environment of mappings between data stores and thereby limit the overall effectiveness of semantic technology and produce outputs that don’t push digital publications much beyond those achieved using existing database technology.

The CRM is difficult not because of what it is (a distillation of existing and known cultural heritage concepts and relationships) but because it requires real cross disciplinary collaboration to implement properly – and this type of collaboration is difficult. The aim of the British Museum Endpoint is to deliver a technical interface but also to demystify the processes underlying the implementation of the CRM as well as the BM’s CRM mapping itself. By doing this the Endpoint should support a wide range of publication objectives for different audiences, a wide range developers with varying experience and domain knowledge and crucially fulfill the future needs of humanities scholars.

In particular the aim is to raise the bar on what can be achieved on the Internet and allow researchers to transfer data modelling techniques, that are currently only serviced by specialist relational database models, into the online world. These techniques will allow scholars, with access to CRM aligned datasets, to make sense of and tackle ‘big data’ littered with many different classifications and taxonomies and allow a broader, specialist and contextual re-examination of historical data and historical events.

Dominic Oldman