



![]() |
The Next Big Thing: From Hypermedia to DatumentsUnilever Centre for Molecular Informatics, University of Cambridge, UK *Department of Chemistry, Imperial College London, SW7 2AY, UK Key features: References; Figures 1, 2, 3, 4, 5 Editor's note: to get the full effect of this paper you will need an SVG viewer and may find the paper is best viewed with current versions of Internet Explorer and Netscape browsers. Download latest SVG viewer from Adobe. In this way we have preserved the original intent of the authors but to assist users, where appropriate, images are also presented in alternative, more commonly viewable formats, what the authors call "dumb" image formats.
AbstractThe concept of a datument as a hyperdocument for transmitting and preserving the complete content of a piece of scientific work is introduced. Currently the scientific publishing process loses almost all of the information environment that the author creates or possesses. It is shown that datuments can record and reproduce experiments and act as a lossless way of publishing science. This is illustrated with specific examples drawn from scientific documents and molecular science, showing how a datument containing molecular coordinates can be viewed in various styles and how typical documents deriving from organic and physical chemistry and expressed in XML can be transformed using XSLT.1 BackgroundThis article is an expansion of a five minute, slightly tongue-in-cheek, invited presentation (by Peter Murray-Rust) at ACM Hypertext 2003. In subsequent discussion the underlying serious message was felt to be important, and this is the emphasis here.We start by defining what we mean by the term "data" in an electronic environment. This is used to cover any material which is not usefully human-readable in raw form. Examples include graphs, digitised maps, database tables, computer code, program output, chemical structures, graphics visualisations, audio and video streams, genomic and microarray data, and many more (really a superset of "hypermedia"). Our background may emphasise physical and biological sciences but the concept can be transferred to many domains. We believe that many concepts here are widely applicable to hypertext in all disciplines. However, practice and technology will differ and, for example, the classic concept of transclusion will vary considerably between fields. We often emphasise the "call-by-value" (i.e. direct copy) strategy rather than "call-by-reference" as we feel this is the manner in which open scientific disciplines wish to work. We emphasise the term "open". This is well understood in software licenses where the term Open Source insists on the preservation of authors' moral rights and the integrity of information and metadata. For "data" it is much less clear and we highlight this concern, without providing complete solutions. "Open", therefore, refers to the desire to make information universally available without hindrance on conditions which preserve authors' rights but make it unnecessary to contact the author for responsible re-use. Our ideas are implemented as working examples within the chemistry domain that act as proofs of concept and have been peer-reviewed (Murray-Rust et al. 2000, 2001, 2003, 2004; Gkoutos et al. 2001). We illustrate these concepts with working examples that are part of the present article. What is needed is the political will of the scientific community to give the impetus to scaling up. Some aspects of this discourse may appear as an uncritical diatribe against all scientific publishers. This was indeed one of the themes in the humorous presentation, and it elicited resonance from the audience. We recognise that there are forward-looking publishers and we are currently pleased to be working with them. However, we feel that the scientific publishing community is in many ways holding back the vision of increased scientific communication in the digital age. We will be pleased to hear from publishers who want to explore the concept of datuments further. 2 Problem and opportunityMost publicly funded scientific information is never fully published and decays rapidly. As an example, the crystallography services in typical chemistry departments such as the University of Cambridge or Imperial College London carry out hundreds of analyses per year. Each is publishable in its own right, but the majority remain as "dusty files" where the effort required to "write them up" for a full peer-reviewed paper cannot be found. Yet these data are among the highest quality scientific experiments performed in any discipline. All information is produced electronically and only about 1% are found to be incorrect in some way. They contain very rich information. Nearly 1000 peer-reviewed papers have been published on information extraction ("data mining") from such crystallographic data alone. The International Union of Crystallography has produced an impressive electronic-only publication process where the complete "manuscript" is submitted electronically, and reviewed not only by humans but by extensive computer programs ("robots"). Such manuscripts are "almost always" accepted if there are no technical errors. Yet well over 80% of such material lies unpublished and unavailable to science.Why? In some cases the scientists wish to have first use of their data and do not want competitors to get it. This was common in the protein crystallography community, which has developed acceptable practices such as putting data "on hold" for, say, six months. In each discipline the practice will vary. Often the result is that the scientific public gets a summary of the work in (e)paper form but has not enough information to repeat the experiment. This is particularly true for in silico experiments (such as quantum mechanical calculations on molecules) where unless the reader has complete knowledge of the input information and installation details of the program, they may get different results or behaviour. Frequently a reader will carry out the experiment again from scratch, as the published information is insufficient. A serious consequence is that data- and text-mining is non-existent in many communities - they lack a sufficiently large corpus to make it useful. Crystallography had J. D. Bernal (Goldsmith 1980), a visionary far ahead of his time like Bush (1945) and, in a more general scientific sense, Garfield (1962). Both the latter foresaw the globalisation of information and laid the infrastructure for the archiving of scientific and crystallographic information. A feature of many sciences is that information is "micropublished" in many different journals. There are c. 3,000,000 new chemical compounds reported each year but few journals carry more than about 50 in any one article. Thus information about chemical compounds becomes spread over perhaps half a million articles each year. There are three main approaches to integrating and coordinating such micropublished information:
Until recently this was inevitable, but now we have the technology to address this. Many information components in a hyperdocument can be recast as context-free XML and integrated with XML text and XML graphics. Here we show the overall information architecture with reference to the latest proofs-of-concept in the chemical field. 3 Robot readers and the digital ageThe current transition to e-journals seems to be welcomed by many - but not us. E-journals published in portable document format (PDF) have missed a great opportunity for change and brought little value to the scientific community (in this sense, portable really means print anywhere rather than re-use anywhere). Many readers still print their reading on paper, so the effect is merely to transfer the cost of producing paper journals (including mail) to the readers' printing bills. Even where readers use the screen there are few or no tools to manage this information - each scientific article is a distinct entity whose linear concept dates from the 19th century. Electronic TOCs and bibliographic hyperlinks may provide some value but the idea of a dynamic knowledge base for the benefit of the community is wholly lacking. We accept that business goals and methods cannot change overnight, but novel forms of communication have usually been ignored. For example, the authors have pioneered e-conferences (Rzepa et al. 1995), e-courses (Murray-Rust et al. 1995) and sit on the board of an innovative e-journal where datuments can be published (Gkoutos et al. 2001). These and similar efforts in other disciplines have been largely ignored. The brave new world articulated in many of the talks at the first World-Wide Web conference in 1994 (Rzepa 1995), which foresaw radically new ways in the digital age, has been largely stifled by conventional business interests and methods.A common feature of all mainstream science publication is the universal destruction of high-quality information. Spectra, graphs, etc., are semantically rich but are either never published or must be reduced to an emasculated chunk of linear text to fit the paper model. The reader has to carry out "information archeology" using the few bricks that remain from the building. The true vision of the digital age is to use information beyond the limitations of paper. We use the test of the "robotic scientific reader". This robot can read and understand scientific discourse such as papers and emails. The understanding is very limited and has carefully controlled semantics but it has several major advantages over human reading:
Figure 1. Information loss in the current publication process. The author (a human/machine symbiote) has a rich (if legacy) information environment. This is downgraded to PDF during publication. The two images have "identical" content but use different technologies: a, SVG, can be scaled indefinitely without corruption, can be used for information extraction (e.g. "PDF" can be retrieved) and can be re-used in whole or part (human readers new to SVG should visit the W3C site http://www.w3.org/Graphics/SVG/ to get a plugin or other viewing technology); b, corresponding JPEG (ten times as large a file), shows the loss and the near impossibility of any information extraction This is not science fiction. A program undertaken at Cambridge (Murray-Rust et al. 2003) has resulted in robots that can read and understand most of the data in a typical paper on the synthesis of new chemical compounds. The robots can read a paper in c. 5 seconds and create a complete datument of all analytical information. Using XSLT stylesheets the robots can answer trivial (chemical) questions like:
It easily conceivable that robots could take action on reading papers, such as "find all inhibitors of HIV protease in J. Med. Chem., order them from suppliers or where unavailable repeat the syntheses". In practice this will still require human oversight for some years, but it illustrates the power of the semantics. This discourse, therefore, is a call for "accessibility for robots as well as humans". 4 Datuments, transclusion and integrityA datument is a hyperdocument for transmitting "complete" information including content and behaviour. We differentiate between "machine-readability", merely that a document such as a JPEG image can be read into a system, and "understandability", where the machine is supplied with tools which are semantically aware of the document content. Examples of the latter are domain-specific XML components such as maps (GML), graphics (SVG) and molecules (Chemical Markup Language, CML). Understandability may require ontological (meaning) or semantic (behaviour) support for components. Neither are yet fully formalised but within domains it is often possible to find that certain concepts are sufficiently agreed that programs from different authors will behave in acceptable manners on the same documents. We shall assume that most scientific disciplines can, given the will, support machine-understandability for large parts of their information.In principle datuments can be infinite in size, both in terms of the semantic and ontological recursion and the need to provide complete information for every component. For example, a scientific paper has citations that are also datuments and which may be required to create the complete knowledge environment. In principle, also, a datument can be dynamic with components changing in time. Nonetheless we believe that in many sciences bounded static datuments are of great value and that many primary publications are valuable as such. Classical transclusion normalises information by providing a single copy of each component and providing links to, rather than copies of, such sources. This works well on the Web as long as integrity is regarded as relatively unimportant (or at least poor integrity can be "lived with"). It also works where a single (monopolistic) supplier has control over all the transcluded information. In a heterogeneous environment it does not yet work. A supplier of transcludable content may have little business or moral motivation to provide continued integrity. A primary publisher may have no contractual information to continue to support authors' supplemental data or even full text indefinitely. While transclusion may work where microcontent is of very high value (e.g. arts and literature) it is difficult to see a business model in science. An alternative model is the datument "snapshot" where all the components are copied and aggregated at "time of publication" (Figure 2). Author-provided SVG a While this forgoes the power of dynamic linking, it enriches of the original material enormously. An example could be a scientific thesis with multiple components, including generic components such as:
Remarkably, models for such aggregation are already arising within the so-called "blogging" communities, which are united by their published "Web logs" and some degree of semantic and ontological unity achieved using RSS metadata feeds (Murray-Rust and Rzepa 2003a, 2003b). 5 Open informationThis article is addressed to those communities who genuinely wish to share scientific information. We believe that "most" scientists wish their data to be re-used, even if it occasionally leads to embarrassing retractions and revisions. Many authors do not recognise the value of aggregating their micropublished work, although this tradition has been common for 200+ years. We hope the datument will show that mutual contribution leads to a vastly richer resource for scientific discovery.We accept that certain data cannot be made freely available though patient confidentiality, patentability, etc. We are, however, urging that all data published in the primary literature be openly available for re-use. "Free" does not necessarily mean open, as re-use may be prohibited. By "open" we mean that the information can be aggregated, filtered and redistributed, and derivative works can be made, subject to appropriate license conditions. In open source software these licenses are well explored and (to paraphrase) include the preservation of original authorship, details of any changes in derivative works (if allowed) and full access to source code (not merely executable functionality). A datument is generally composed of components from many sources. If these sources have any barriers to re-use the distributability and re-use of the datument is severely limited. Among the barriers are:
This could be simplified if authors made it clear they were making the complete scientific datument openly available. In most cases it has been created before submission to the publishers and we see little reason why copyright should be reassigned. If compromise seems inevitable we have heard of a recent case where authors keep copyright of the original manuscript and the publishers have copyright of the form that appears "in print" with pagination. The international scientific unions have emphasised the importance of data being publicly available to the scientific community. In our view authors must not hand over copyright of the "data" to publishers. The datument (perhaps eviscerated of some of its "text") should be regarded as "data" and published in open view. We show how this is technically straightforward and manageable with marginal costs 6 The practice of publishing datumentsAlthough datuments are expressed in XML (Figure 3), this is not (yet) the format in which most scientists work. Data and text are collected in a variety of (often proprietary) non-extensible legacy formats, many in binary form. The two strategies are:
Author-provided SVG Alternative representation (PNG) for non-SVG browsers
Each domain will have to create a significant amount of infrastructure and technology. In some cases this is well understood and under construction. We illustrate it from our own subject of molecular science (with the CML family of languages) and expect that the structure will map to other disciplines. With the help of the open source community we have created:
7 Datuments and HypertextThe datument is therefore a hypermedia document accessible to robots and humans. At the ACM Hypertext conference we were impressed by the developments in human-understandable hypermedia but felt that robots were neglected in comparison. Web hypermedia systems are largely aimed at human readers and have few concessions for robots. Much of the analysis is post facto - analysing how humans and metadata-deprived robots navigate rather than building global hyperstructures ab initio. Developments such as ZigZag (Nelson 2004) with a non-traditional information structure are exciting but it will require much evangelism before they become tools in mainstream publishing.8 Datument technology: a novel approachThis article contains two small examples of datuments (Figures 4 and 5) of published scientific information and both incorporate a mixture of "text" and "data". Their subject matter is chemistry but readers need no detailed domain knowledge. They are interactive, but are not just another example of scientific multimedia or hypermedia. We stress that the content is independent of the presentation and the graphical displays are created by tools operating on the display-neutral datuments. For example, a graphical display is irrelevant to a robot reader.We argue that a cultural change in our approach to information is needed and that money on its own will not solve it. Indeed, greater investment in mainstream publishing may worsen the situation. The publishers' primary selling point is their impact factor, not necessarily the functionality of the product. Funders and academic bodies compound this, and novel initiatives are often not welcomed if they have low impact. The model of publication must therefore change. Realistically this will take time but we have to create something where the benefit is to the scientific community, and where the practitioners can be visionary. We propose students and their theses or reports as fertile ground. Students have less fear of the impossible and less legacy to unlearn. We have involved both undergraduate and postgraduate students in authoring XML in many of the ways shown above and they have not only picked it up quickly but added their innovations. We therefore suggest that positive incentives should be given to students to create their theses as XML datuments. We illustrate this approach with an example derived from a small part of a typical student chemistry thesis (Figure 4). The original component of the thesis is written in XML, with the chemistry carried directly using CML, itself an XML language. This datument can then be transformed into different representations for human assimilation. Figure 4a illustrates its conversion to an Acrobat file, destined largely for those humans who wish to print or archive the content, whereas the same datument can be transformed into e.g. Figure 4b, where the chemical content can now be viewed using either SVG (for 2D perception) or directly using a Java applet (where 3D perception might be needed). Figure 4. a, Acrobat file derived from a chemistry datument; b, the same content presented using SVG/JMol viewers (both presentation styles are derived from the same XML datument, documents will display in a separate browser window) The molecular structures emphasize the re-use of XML in three ways:
What are the immediate benefits of this approach? Some examples, which we contend may immediately save the student work:
Figure 5. The use of XML and XSLT to provide a variety of rendering and transformation styles for scientific documents (this should be viewed using an XML/XSLT compliant browser such as Internet Explorer 6, document will display in a separate browser window) Here the two datuments (organic chemical synthesis and computational chemistry) are cast in XML and retransformed on-the-fly by XSLT stylesheets. These transformations involve re-use of the information (filtering, sorting, tabulation, transformation of values). The stylesheets are independent of the precise content of each article and therefore applicable to a wide range of datuments.
AcknowledgementsPMR thanks Adam Moore, Helen Ashman and ACM Hypertext 2003 for the invitation to speak on the panel and discussions with many delegates. They have also made valuable editorial suggestions. We thank our students (Sam Adams, Vanessa de Souza, Joe Townsend, Chris Waudby (Cambridge) and Mark Williamson (Imperial College) for inspiration from their projects (to be reported elsewhere).ReferencesBush, V. (1945) "As we may think". Atlantic Monthly, July http://www.stanford.edu/class/history34q/readings/Bush/Bush_AsWeMayThink.html Garfield, E. (1962) "The Ideal Library - The Informatorium".
Current
Contents, June 19
Gkoutos, G. V., Murray-Rust, P., Rzepa, H. S.,
Viravaidya, C. and Wright, M. (2001) "The Application of XML Languages
for Integrating Molecular Resources". Internet J. Chemistry, Vol.
4, article 12
Gkoutos, G. V., Rzepa, H. S., Clark, R. M., Adjei, O. and Johal, H. (2003) "Chemical Machine Vision: Automated extraction of chemical meta-data from raster images". J. Chem. Inf. Comp. Sci., Vol. 43, 1342-1355 Goldsmith, M. (1980) A Life of J. D. Bernal (London: Hutchinson), pp. 219 Harrison, K., May, P. and Rzepa, H. S. (Editors) (2003) The Exemplarchem Project: an Internet based exhibition of exemplary project work in Chemisty http://www.exemplarchem.org/ Mills, A. and Murray-Rust, P. (1995) Principles of Protein Science, course notes http://www.cryst.bbk.ac.uk/PPS2/top.html Murray-Rust, P., Adams, S., De Sousa, V., Townsend, J. and Waudby, C. (2003) unpublished projects Murray-Rust, P. and Rzepa, H. S. (2003a) "XML for scientific publishing". OCLC Systems & Services, 19(4), 163-169 Murray-Rust, P. and Rzepa, H. S. (2003b) "Towards the Chemical Semantic Web. An introduction to RSS". Internet J. Chem., Vol. 6, article 4 http://www.ijc.com/abstracts/abstract6n4.html Murray-Rust, P., Rzepa, H. S., Williamson, M. J. and Willighagen, E. L. (2004) "Chemical Markup, XML and the Worldwide Web. Part 5. Applications of Chemical Metadata in RSS Aggregators". J. Chem. Inf. Comp. Sci., Vol. 44, No. 4 Murray-Rust, P., Rzepa, H. S and Wright, M. (2001) "Development of Chemical Markup Language (CML) as a System for Handling Complex Chemical Content". New J. Chem., No. 4, 618-634 http://www.rsc.org/CFmuscat/intermediate_abstract.cfm?FURL=/ej/NJ/2001/B008780G.PDF&TYP= Murray-Rust, P., Rzepa, H. S., Wright, M. and Zara, S. (2000) "A Universal approach to Web-based Chemistry using XML and CML". ChemComm, 1471-1472
Nelson, T. (2004) Zigzag Software: Design for a New
Computer Universe, January
Rzepa, H. S., Goodman, J. M. and Leach, C. (eds) (1995) Proceedings of the First Electronic Conference on Trends in Organic Chemistry (ECTOC-1) (The Royal Society of Chemistry) http://www.ch.ic.ac.uk/ectoc/ectoc_conf.html#proceedings Rzepa, H. S. (editor) (1995) "WWW94 Proceedings". Computer Networks and ISDN systems, 27(2), 135-341 http://www94.web.cern.ch/WWW94/PrelimProcs.html Glossary of terms
|