



![]() |
Semantic Problems of Thesaurus MappingInstitute of Computer Science, Foundation for Research and Technology – Hellas, Heraklion, Crete, Greece Email: martin@ics.forth.gr Key features References; Figures 1, 2, 3, 4, 5, 6, 7; Tables 1, 2, 3
AbstractWith networked information access to heterogeneous data sources, the problem of terminology provision and interoperability of controlled vocabulary schemes such as thesauri becomes increasingly urgent. Solutions are needed to improve the performance of full-text retrieval systems and to guide the design of controlled terminology schemes for use in structured data, including metadata. Thesauri are created in different languages, with different scope and points of view and at different levels of abstraction and detail, to accommodate access to a specific group of collections. In any wider search accessing distributed collections, the user would like to start with familiar terminology and let the system find out the correspondences to other terminologies in order to retrieve equivalent results from all addressed collections. This paper investigates possible semantic differences that may hinder the unambiguous mapping and transition from one thesaurus to another. It focusses on the differences of meaning of terms and their relations as intended by their creators for indexing and querying a specific collection, in contrast to methods investigating the statistical relevance of terms for objects in a collection. It develops a notion of optimal mapping, paying particular attention to the intellectual quality of mappings between terms from different vocabularies and to problems of polysemy. Proposals are made to limit the vagueness introduced by the transition from one vocabulary to another. The paper shows ways in which thesaurus creators can improve their methodology to meet the challenges of networked access of distributed collections created under varying conditions. For system implementers, the discussion will lead to a better understanding of the complexity of the problem.1 IntroductionTerminological resources are increasingly important for information retrieval in wide area networks, for retrieving documents by querying databases and metadata employing controlled vocabularies. In particular, thesauri which organize terms and associated concepts in the form of simple semantic networks become important tools for searching through the rapidly growing electronic information flood. There is growing interest in developing automated intermediaries to negotiate the differences between controlled vocabulary schemes so that a user can use a familiar set of terms to search collections using other vocabulary schemes.The paper discusses the effect of thesaurus mapping on the vagueness in retrieval from a theoretical and logical point of view, separate from the effects of the relation of the thesaurus to the collection it addresses. Therefore, it makes ideal assumptions for the latter without going into any detail. It assumes that a certain collection is tightly connected to a terminological resource, which is given in the form of a thesaurus containing typed relations between terms and concepts, as defined by Foskett (1997). By "tight connection" we mean, in one case, that data values or classification terms are restricted to this thesaurus and consistently used. In that case, a query with some term should retrieve exactly the objects meant (this may actually require the expansion of a query term into its narrower terms). Let's refer to this case as the controlled vocabulary situation. Alternatively, the collection may use free text or free keywords. In this case, what we mean by "tight connection" is that a search-aid thesaurus exists that approximates the (expert) language used in the collection and the associated concepts. In this case, we assume that this thesaurus provides better results for that resource than any other. Let's refer to this as the free text situation. Finally, we are not concerned with documents only, but with museum object descriptions and others as well. The term objects is used here to cover all collection objects, not just text documents. 1.1 Related WorkThere is a vast literature and established practice about creating thesauri for the purpose of information description and retrieval (for example, publications of the Getty Research Institute, the British Arts and Humanities Data Service (AHDS), and various national standards). Thesauri are designed as an agreement and compromise on a set of shared terms for common concepts. Even though the agreement on term definitions is sought over large communities, thesauri even in the same domain differ significantly (Krause 2000), and there is limited interoperability between tools and digital resources employing different thesauri.To resolve the incompatibility between different terminology resources, research has initially concentrated on attempts to unify thesauri by merging, e.g. Mili and Rada (1988), Mannino et.al. (1988), Rada and Martin (1987). The Unified Medical Language System (UMLS) merges concepts from some 50 sources into a metathesaurus, which retains links to its original sources. It is probably the largest merging effort undertaken so far (Nelson 1999). Two problems have turned out to be the most difficult:
The problem of combining subject headings and classification systems, such as Library of Congress Subject Headings (LCSH) and Dewey Decimal Classification (DDC), has been investigated by Chan (2000) and Vizine-Goetz (1998). Chan writes: "How to combine the salient features of a rich vocabulary like LCSH and the structured hierarchy found in classification schemes such as LCC and DDC to improve retrieval of networked resources remains a fertile field for research and exploration", and advocates the harmonization of vocabularies. Chan refers to the efforts of the Library of Congress to make the LCSH more useful for networked access by improving its faceted structure, the principles of term construction, and rigorous term relationships. In this paper, we argue for development in the same direction. 1.2 Intellectual versus Automatic Term CorrelationTerm correlations may be intellectually created, as by MACS, the French Ministry of Culture (see Merimee), the HEREIN Project, and CARMEN (Krause 2000). Krause calls intellectually created term correlation lists "cross-concordances." Alternatively, they may be created by statistical methods or even by neural networks, as in CARMEN and Vizine-Goetz (1998). Chen et al. (1996) use a concept space approach to create thesauri automatically and to traverse between two thesauri from different biological subdomains. They report a considerable increase of recall in a cross-domain search experiment, but differences between the links provided by the algorithm and those given by experts. Statistical and neural network methods do not easily allow interpretation of the intellectual nature of a given link, if at all. They are far cheaper, however, and can detect relations of which humans are unaware. The ultimate precision is usually low. As Chan (2000) puts it: "the tension between quality and quantity has never been keener." As this paper deals with semantic problems, we do not consider statistical methods, even though we are convinced that the future lies in the coordinated combination of intellectual and statistical methods, as the CARMEN project and others do.An interesting point in Chen et al. (1996) is the report that the term associations users made in cross-domain searching were context-driven: "...Based on our protocol analysis, we found that several contexts for these similarities and differences existed, including, two genes were identified by similar (or different) experimental strategies; their cellular structures had similar (or different) composition; ... genes manifested similar or different phenotypes; genes or proteins had similar or dissimilar sequences (homology) or contained similar motifs or domains; proteins or genes performed similar (or dissimilar) functions;... ." This paper contributes to a better understanding of such phenomena. Even though semantic heterogeneity of terminological resources has frequently been referred as a problem, a systematic analysis of its intellectual basis and structure has not been carried out. Krause (2000) writes: "... the information market over the past twenty years... mainly views the development in distributed databases, user interfaces and the Internet as technological improvements and problems of standardization, without addressing the conceptual challenges involved." 1.3 Thesaurus Mapping in One DomainWe regard thesaurus mapping as the process of identifying terms, concepts and hierarchical relationships that are approximately equivalent. It is a central process for merging thesauri, metathesaurus and cross-concordance construction, and thesaurus switching. This paper investigates the problems of finding appropriate equivalents, in particular focussing on issues related to polyhierarchies and the relationships between compound and non-compound terms.
Figure 1. Scenario using correlated thesauri The following assumes a general scenario (Figure 1) of a user addressing different digital collections using a particular thesaurus of choice, which is mapped to thesauri in other languages, to more specialized vocabularies, or to different versions of the thesaurus. We adopt the notion of a two-step process from Krause (2000a), which separates the vagueness introduced by thesaurus mapping (step one) from that introduced by the relationship between the user query and the document (step two). To separate these effects intellectually, each thesaurus is assumed to be consistent with the indexing of one or more collections, i.e. a correct user query to one collection through its own thesaurus yields full recall and high precision. Such ideal conditions can exist, for instance, in databases indexing museum objects with a thesaurus about the physical object types. We further assume that the set of objects in all collections is basically of the same nature and from one domain, to exclude another source of vagueness. Under these conditions, we attribute the remaining heterogeneity between different thesauri to (see Doerr 1996, p.3):
Section 2 presents well-known ideas of concept-based thesauri in order to clarify several notions. First, we define the kind of mapping we mean to clarify the differences to other approaches. Second, we discuss distinct classes of "multilingual thesaurus", which is used in a fuzzy way in literature. Third, we refine the current notions of equivalence expressions, as given in ISO5964, in order to conform to certain logical requirements. Based on that background, we make a novel proposal for a methodology of mapping that allows for controlling vagueness in cross-thesaurus retrieval. This becomes increasingly relevant, as several projects are beginning to create such mappings on a large scale. Section 3, studies effects that may either impede the definition of equivalences between terms or between hierarchies, or impede the exploitation of semantic relations in the target thesaurus for query expansion. Chan (2000) regards thesauri as "a query expansion device", a virtue that should be preserved through mapping. 2 Application of Concept-based MappingEven though the ISO Guidelines for the establishment and development of multilingual thesauri (ISO5964) were published in 1985, it is only now that such mappings are being attempted on a larger scale. Since its publication, no specific methodology has been proposed about which terms should be correlated with equivalence relations and which not in the lack of exact equivalence, and how this would affect the query or information retrieval quality of the pair of correlated thesauri as a whole.2.1 Concept-based mappingIn this section and some of the following we talk about sets in the mathematical sense. By objects we do not mean only documents. It may be anything in an electronically registered collection: a potsherd, a stool, a palace, an image, and a text. By sets we usually refer here to sets of such objects, typically defined by the sharing of one or more common properties. We cannot go into more details about Description Logic and similar theories here, and we can mention only the basic idea:Under certain assumptions, preferred terms, so-called "descriptors", can be identified with concepts. Each concept in turn can be identified with the intention of a set of objects. In the sequence, we can transform the mapping problems into a mathematical problem about sets, i.e. terms are identified with the sets of objects they correctly classify (see Doerr and Fundulaki 1998 for details). "Correctly" is a question of user convention, and we assume that users can in general positively decide which term is correctly applied and which not. This assumption provides an absolute measure to compare concepts in thesauri even between multiple languages. As long as objects in a large enough database are classified in a well-defined way with two thesauri in parallel, set-relations between the concepts of both thesauri can be approximated automatically (as by Amba et al. 1996). Any inconsistencies can then be reduced to human errors. Such assumptions are well known and basic to Description Logic (DL), e.g. Baader et al. (1992), Borgida (1995), DL Web site, and implicit in many thesauri describing physical objects. In practice, not all subset relations may have been expressed in a thesaurus and term interpretation can be context dependent in a complex way, as will be discussed later. To make a clear distinction from statistical or neural network methods, let's define "concept-based mapping". The principles are:
2.2 About Multilingual ThesauriOften any kind of relations between terms from thesauri in different natural languages are referred to as translations. In our opinion, translation in the proper sense differs from the concept-based mapping and cross-concordances in significant ways. In the AQUARELLE project, Dachelet (1997) proposed distinctions between different kinds of multilingual thesauri. To clarify the differencese we define the following classes of "multilingual thesaurus":
Figure 2. Demonstrating different notions of a multilingual thesaurus in one context 2.3 Equivalence ExpressionsEquivalence expressions similar to those in ISO 5964 are used with increasing frequency for thesaurus mapping: in the Merimee "Thesaurus Architecture", CARMEN, MACS, the HEREIN Project and others. Based on the idea of concept-based mapping and on the argument that correlated thesauri should serve equivalent retrieval results across systems employing different terminological resources, we have proposed that the expressive power of the mapping should be at least equivalent to the expressive power of the search paradigm. Otherwise, the user could express better queries in each target system than the mapping mechanism could provide. Doerr and Fundulaki (1998) investigated the mapping equivalent to the Boolean expressions (AND, OR, NOT) foreseen by the Z39.50 protocol. We found that a slight extension to ISO5964 is sufficient to achieve equivalence expressions with the expressive power of Boolean queries. For that purpose, we interpret the equivalence expressions of ISO5964 as concept-based mappings, i.e. as set relations of the associated sets of objects. This seems to be justified by the Venn-diagram-like illustrations in the ISO5964. We make the following interpretations and extensions:
So far, these equivalence expressions provide a means to express initial query terms in terms of any target thesaurus. Obviously, any Boolean combination of terms in the initial query can be converted into a Boolean combination of target terms (see below). Figures 3 and 4 demonstrate the semantics of equivalence expressions with Boolean compounds. An example for Figure 3, the French term bergerie has the exact equivalence to "sheep barns OR sheep folds" in the AAT. The first common broader term of both terms in the AAT is single built works. The obvious broader term animal housing is the broader term in AAT only to sheep folds, probably because of its monohierarchy design (see section 3.3). For Figure 4, please see the list in the Appendix. In Figure 4, the dotted circle on the right-hand side indicates where the approximated concept would appear in the target hierarchy, under the assumption that the BT relation expresses subsumption.
Figure 3. Demonstrating equivalence to OR combinations of terms
Figure 4. Demonstrating equivalence to AND combinations of terms 2.4 Methodological AspectsAbove, the arguments are constrained to the controlled vocabulary situation: concept-based mappings, derived from ISO5964, used to create correlated thesauri. Under these restrictions the effect of the following methodological arguments can be evaluated theoretically. These restrictions are realistic. For example, the HEREIN project will connect databases about material cultural heritage with thesauri correlated in the style of ISO5964. The precision of manual object classification is precise in the way assumed in section 2.1. Other obvious cases of precise classification are the use of place name authorities (gazetteers) like the Thesaurus of Geographic Names (TGN) and cultural period authorities. Professional users in cultural heritage administration (as investigated in the AQUARELLE and Term-IT project) and many other disciplines require stricter standards for recall and precision than general users seeking information on the Internet. The optimization of the recall/precision ratio usual in information retrieval does not satisfy the needs of a statistical survey. Chan (2000) also refers to the requirement for recall and precision as distinct: "Subject access tools are used to enable optimal recall ... to enable optimal precision...". In the free-text situation, basically the same arguments presented below should hold, but the effect will not be so explicit because several factors introduce additional vagueness. The question of when the effect of more elaborate correlations vanishes in the vagueness coming from other sources is interesting, but beyond the scope of this paper. The same holds for the question of whether the effort to create mappings intellectually or semi-automatically is affordable or not. We are satisfied here with the fact that people are increasingly undertaking that effort.To illustrate the relevance of the following, Table 1 presents some statistics about the carefully produced equivalence expressions from the 1997 editions of the French Merimee "Thesaurus Architecture" to the AAT and NMR thesaurus. Table 2 compares the frequency of equivalence expressions among these three thesauri.
Table 2. Distribution of equivalence relation types in the Merimee mappings
In the mapping to the AAT, AND combinations mainly reflect post-coordination rules. Of the 199 so-called AND combinations used by Merimee to map to the AAT, at least 174 turned out to be role restrictions rather than true AND combinations (see section 3.3). They are listed in the Appendix. NMR is far more detailed and pre-coordinated, therefore OR combinations dominate, with up to 6! terms combined. The AND combinations to NMR follow the logic of Figure 4. As one team has created the above equivalences, the differences must be attributed to the nature of the target vocabularies and not to differences in the practice of the editors. Some 40% of the Merimee terms are not mapped, with no indication at all which terms they relate to in the other thesauri. For example, the French term EDIFICE FUNERAIRE has no equivalence to the NMR. Its narrower terms MAUSOLEE and OSSUAIRE have one equivalence each: mausoleum, ossuary. Obviously, the broader term is mausoleum in NMR: funerary site could be a broader equivalence of EDIFICE FUNERAIRE, but that has not been declared. Probably the editors felt that it would be "too far". This is an example of the proposal illustrated in Figure 5. Table 2 illustrates the complexity of thesaurus mapping, and shows that mappings not created with a well-founded methodology for networked information retrieval do not provide the necessary qualities, as will be discussed in the rest of this paper. Moreover, we think that providing such quality would not require a much greater effort than that invested in a mapping like that presented above.
Figure 5. Demonstrating semantics of term inclusion Now consider the question of whether an optimal mapping is possible. If someone uses an arbitrary set of equivalence expressions for correlation, the existence of an equivalence link for some query terms is not guaranteed. So a query using such a term would simply fail against the respective target. Even if an equivalence expression exists, the relation between the intended and the actual query with replaced terms is a priori unpredictable. Equivalence relation creation in a larger environment (Hutchins 1995) causes some "combinatorial explosion", therefore "switching languages" are proposed as intermediaries. The results of such subsequent mappings are even more undefined. We therefore propose to approximate concepts of a source thesaurus systematically by confining them within the nearest broader and narrower concepts of the target thesaurus (Figure 5). The idea can be stated in the following rules:
Rule 1 defines a notion of "completeness" of the mapping, that is, if Rule1 holds, the replacement of any query term is possible. Of course, it may be impossible to find any narrower equivalence, in which case we use the empty concept ("bottom" in DL). In this case, the query returns no result, which is consistent. In the case of negation, however, it would return the universal concept ("top" in DL), that is, the whole target base. This is a case to be prohibited. Even though, such a situation can be tolerable in a query refinement cycle, as the actual intermediate results may not be transferred. In rare cases it may even be impossible to find a broader concept, in which case one must map to the universal concept. This situation could be avoided if thesaurus providers agree to share some high-level concepts. In addition, other AND combinations as they appear in a typical user query may "absorb" the universal concepts and return reasonable results. If the result of a translated query would be the whole target base, the user should be informed and given the choice to cancel the query. We see here an area for pragmatic solutions in the framework of application development. Note also that concept inclusions propagate without problem through multiple intermediate translations. The above qualitative reasoning holds in this form only for simple conjunctive queries. In general, the problem can be reduced to a query containment problem, if the database schema plus terminology are interpreted as a schema altogether. For the latter, elaborate theories about the complexity and decidability of query containment exist (Calvanese et al. 1998 on PODS), which have to be applied on a case-by-case basis. Calvanese et al. (1998) on KR'98 present a fairly general, unified framework for information integration from heterogeneous sources. Rules 2 and 3 above define a notion of an "optimal mapping" in the sense that no closer mapping can be found with these kinds of expressions. If one actually chooses Boolean combinations rather than only the primitive concepts for the term inclusion, things may become algorithmically complicated and can go beyond the capacity of normal expert insight. Not even typical Description Logic implementations as CLASSIC or FaCT answer such questions. On the other hand, if the mapping is close to optimal, the vagueness control still exists. Here is an area for further applied research. Finally, the use of Boolean compounds poses some more methodological questions. Whereas other equivalence expressions can be read (anti-) symmetrically (reciprocally),e.g. narrower equivalence reads reciprocally as broader equivalence, a Boolean compound can not be easily interpreted in the opposite direction (see Figures 3 and 4). Thesaurus or DL tools don’t even normally indicate which concepts are used in a Boolean compound. Our laboratory has implemented a research prototype (Ntoas 1999) on top of the thesaurus management system SIS-TMS (Doerr and Fundulaki 1998a), which indexes expressions containing Boolean operators and role restriction ("restrict [p,C]") and relates them to their immediate broader and narrower terms. Future research could address the question of the degree to which Boolean compounds or more complex DL expressions of some mapping can be exploited to calculate equivalence expressions in the opposite direction. Summarizing, under the given assumption the proposed mapping methodology allows for the propagation of any query to collections classified with thesauri that have been mapped to one another. The query will not fail for any given query term, except if the whole database should be returned. As concepts do not precisely match, the results cannot always be equivalent. Rather, following the choice of the user, a (smallest possible) larger result or a (largest possible) smaller result can be returned, which could be further refined through post-processing steps. Such results can be used for statistical purposes. For the free text situation (see section 1), inexact equivalence as defined in ISO5964 should be quite useful in balancing recall and precision. I guess, however, that the appropriate generalization of an inexact equivalence for full-text retrieval is a correlation based on associated relevance weights. We shall not follow this subject further here. All of the above is based on the ideal assumption that the correlated thesauri follow the same rules and that their BT relations are complete, consistent, and follow the same logic. If this is not the case, some consistency checks can be carried out. At least it can be verified if manually or automatically derived equivalence relations won’t cause cycles with the given BT relations on either side, i.e. if some concept seems to include one of its broader concepts. In some cases the search for equivalences may reveal missing additional BT relations on either side, as sheep barns BT animal housing above. The rest of this paper is devoted to thoughts about the reasons for inconsistencies between hierarchies, the problems appearing in reality. 3 Heterogeneities of the Hierarchical StructureIf two correlated thesauri use subsumption hierarchies (or IsA relationship) and declare explicitly all direct subsumption relations between their primitive (non-compound) concepts, many nice applications can be done. The transitivity of subsumption allows expansion of query terms into their narrower terms to arbitrary depth, in particular by correlated concepts of another thesaurus and their narrower terms. This allows switching use from one thesaurus to another, e.g. to a more specialized one. Thus a general high-level thesaurus can be federated with a series of application-specific thesauri. Further, the subsumption relations between all terms of two thesauri can be calculated from a complete mapping in the above sense, and eventual logical inconsistencies can be reduced to human errors and eliminated. In practice, however, term hierarchies often (1) do not express subsumption, (2) are ambiguous, or (3) do not express all immediate subsumption relations. Some reasons and possible solutions are analysed below.3.1 Hierarchical relation without subsumptionTraditionally, thesauri were printed books, and the hierarchies were used as an association mechanism to lead users most effectively to a concept for which he or she does not know the term. The sequencing into book pages does not foster the use of polyhierarchies. Hence, thesaurus hierarchies were more like decision trees than semantic relations. With computers, representation restrictions become obsolete (Welty and Jenkins 1999). Fascinating in this context are Ranganathan’s (1965) classical considerations about the obstacles the "notational plane" causes to the development of the "ideal plane". The traditions from editing printed books are not easily overcome, however. So often any hierarchical relation is messed up with subsumption, as they are equally useful for user guidance. In our opinion, user guidance and semantic relations are not all the same and should coexist in the same thesaurus.ISO 2788 still regards the part-whole relation (BTP) as a kind of Broader Term relation, whereas e.g. the AAT Editorial Manual (1998) already regards them as a kind of Related Term (RT, "Code 2B"), and no longer as subsumption. Another example is the inclusion of geographical areas in place name thesauri (e.g. the Thesaurus of Geographical Names (TGN)). Obviously, France isA Europe does not hold. Different hierarchical relations hold for temporal intervals and cultural periods, even though there are still few examples of thesauri about periods. The CIDOC CRM ontology (Doerr and Crofts 1999) refers to these four relations as forms part of. These four relations can be explained by their extensions to different related sets, i.e. sets of points on the surface of earth, in an object, on the time-line, and in space-time. Hence they inherit the partial order relation from the subsumption of the respective sets, form (poly)hierarchies, and are therefore frequently mistaken for broader terms. Transitivity does not extend among them, however (see Motschnik (1993) about limited transitivity between different part-whole relations), and therefore they cannot be mingled. Nevertheless, expansion of query terms, e.g. from object types into their parts, can be useful if explicitly required. Another source of confusion are the semantic relations within a set of derived concepts, parallel hierarchies as described by Soergel (1995) or the DL role restriction (Baader et al. 1992). For example, even though "Greece IsA Europe" does not hold, "Greek person IsA European person" does hold. In this case, Greek person is interpreted as Person.who lives in: Greece. Who lives in can be seen as a DL role, here restricted to Greece. Another example: even though "bridge construction IsA bridge" does not hold, "book about bridge construction IsA book about bridges" may be regarded as valid. The use of terms in a specific database field may hide a concept derivation, e.g. when object names are used as subjects or place names as nationalities. Editors may introduce in their thesaurus the subsumption relations correct for that use, i.e. of the derived concepts. Out of context those can be completely wrong. In particular, subject headings often refer to physicalobjects (e.g. objects in museum collections), but their hierarchies cannot be directly used for object classification, causing frequent disputes between librarians and museum curators. Characteristically, the Getty Information Institute has rearranged large amounts of terms from the LCSH (Library of Congress Subject Headings) into the AAT, a thesaurus mainly about physical objects (its Object facet, except for the hierarchy information objects). We regard it as worthwhile to investigate under which conditions associations such as the above can cause subsumption relations in their derived concepts. Related to this is Welty and Jenkins' (1999) thorough study on modeling subjects. We would not expect the opposite, i.e. that concept derivation from a subsumption hierarchy may not preserve this hierarchy in the derived concepts. At least theoretically it should not happen (Ntoas 1999). Subject terms used for library cataloging are sometimes interpreted as applying to books which cover the breadth of the term. For example, biology of mammals would be used to denote books about the biology of all mammals, rather than some mammals. In this interpretation, narrower terms are not subsumed. Finally, thesauri such as the "dmoz" project, which lets users act freely, end up with associations motivated by any contextual link, and may even contain cycles and other relationships that break thesaurus rules. For example, both "Top: Arts: Classical Studies: Journals" and "Top: Arts: Classical Studies: Academic Departments" are declared as narrower terms of "Top: Arts: Classical Studies" (DMOZ). Summarizing, there are hierarchical relationships used in thesauri which are not subsumptions. They have to be clearly marked as being of different nature so that correct reasoning can be done. Otherwise, screws may be taken for cars, villages for nations, Andorra for a continent, etc. The use of terms in a specific database field may hide a concept derivation, and out of context hierarchies made for that use could be wrong. Therefore, the assumed semantics of hierarchical relations should be made clear before thesauri are correlated and should be communicated as thesaurus metadata or clear notations for the relationships. 3.2 Context-induced ambiguityTerms and concepts often reveal a polysemy which is disambiguated by the context in which they are used. English, for instance, is full of so-called homonyms or contrastive ambiguity (Pustejovsky 1995), like: orange (color) and orange (fruit); pink (color) and pink (vessel); column (architectural element) and column (text arrangement). Even though some older thesaurus maintainers insist on word-based hierarchies, concept-based hierarchy organization prevails now in computer science. Consequently, the concepts have to be disambiguated. For example, in the AAT, a domain determinator like color in orange (color) disambiguates the concept. The actual word (e.g. orange) can be attached as a non-preferred term or synonym to all possible meanings. Word Net (Miller et al. 1993), for example, uses many-to-many relations between words and concepts, the most consistent approach to represent the real relation.So far, the problems of homonymy seem to be solved by this so-called sense enumeration (Pustejovsky 1995). Each sense of a term represents a concept independent of contextual influence, and subsumption hierarchies can be designed independent of use. Homonymy is a language-specific feature, i.e. the different senses of one term in one language are normally translated into different terms in another language. There are, however, the more subtle cases, which Pustejovsky calls complementary polysemy, an expression of the dynamic power of the concept formation behind our languages. For example, is door an object or an opening? Is neck a part or a place on a body? Is school an organization or a building? These terms are typically translated one-to-one into other languages for the same set of meanings. Pustejovsky introduces the notion of qualia, the different aspects that may cause a word to change meaning in context. I have the impression, that this polysemy may be intrinsic to the concept itself. He talks about the Qualia Structure of nominals, which he analyses in the following main categories (referring also to Aristotle’s notion of modes of explanation):
A clear distinction between concept and term, as in the case of contrastive ambiguity, cannot be made, and Pustejovsky regards sense enumeration as impractical. Three problems arise from that:
Figure 6. Multiple broader terms under multiple aspects To the best of our knowledge, the problems arising from complementary polysemy for thesaurus design has not yet been studied. Under the above considerations, it is useful to make the qualia of the use of a thesaurus explicit and to add such characteristics to thesaurus metadata. For example, a hammer in the morphological sense, as archeologists would classify items, has nothing in common with a steam hammer. This may even provide incorrect broader terms for the functional aspect an engineer needs, who sees both concepts closely related as types of impact devices. On the other side, thesaurus editors should systematically take into account other aspects of use and identify the additional broader term relations the other aspects require, as long as they are not contradictory. Maybe a better solution would be to make BT relations aspect-specific, as suggested by the colors in Figure 6. A similar situation is shown by Pustejovsky (1995) on p. 145. Of course, we are aware that this may become quite labor intensive if no automatic methods are found. The whole topic leaves many questions open. Chen et.al. (1996) on the context-driven associations cited in section 1 confirms our impression that these problems are highly relevant for thesaurus mapping. Such difficulties seem to be often ignored by technicians or taken as unavoidable vagueness of human argument. 3.3 Missing subsumption relationsBesides reasons of different contexts of use, we see the enforcement of monohierarchies and post-coordination rules as the major reasons for missing subsumption relations. As mentioned above, monohierarchies were preferred as long as the predominant medium for thesauri were books. A nice example is the colorant hierarchy of the AAT. Look at the position of crimson lake and carmine in this hierarchy in Figure 7.
- <materials by function>
- <materials by function>
Figure 7. Position in hierarchy of the AAT terms "crimson lake" and "carmine" There are three inherent aspects: functional form (pigment, dye, lake...), appearance (red, blue, brown…), production or provenance (artificial, organic…). The above sequencing structure seems to have placed the two terms arbitrarily. One could as well start with color or provenance. Carmine does not appear at all under its characteristic color, because priority was given to other prominent features (carmine is used for dye and pigment). The scope note of crimson lake states: "Deep, transparent, ruby-red lake pigment with bluish undertone, made from kermes, a natural dyestuff of insect origin; carmine, a better pigment introduced in the 16th century, became its chief competitor. MAYER." Hence carmine is a red organic pigment, but does not appear under that term and is far away in hierarchy from a very similar one. Imagine a thesaurus federation: a thesaurus may have a leaf node of organic colorant and we would like to employ the AAT as the source for more specific terminology. Even though all concepts are in principle in the AAT, and there is no difference in conceptualization, we cannot continue from organic colorant to narrower terms in the AAT. In a student project, we have created an experimental colorant hierarchy, where each colorant was put directly under three broader terms: functional, color, and provenance terms. The complete result is difficult to show graphically, but browsing is effective, because on descending one branch or the other one comes down to the correct end and no possible broader term is missing. This brings us to the other point: post-coordination. Obviously it is quite inefficient to define all combinations of terms like artificial inorganic red pigment, synthetic organic green pigment, etc. Therefore, thesauri like the AAT and the German subject headings Schlagwortnormdatei (SWD) use rules to combine terms dynamically. For reasons of simplicity, only a "+" and "&" signs are used, both heavily overloaded with different interpretations. For example, "factories + grinding" in the AAT means a "factory which does grinding"; i.e. nothing more than a mill. The term mill has been sacrificed to post-coordination (it is a decoordinated subject in the AAT terminology), as are many other useful terms, as can be seen in the Appendix from the Merimee mappings. See also Soergel's (1996) extensive analysis of such problems in the AAT. In current practice post-coordination suffers from three problems:
On the assumption that it makes sense for specific parts of hierarchies to be post-coordinated with specific relative roles, Ntoas (1999) designed a mechanism in which the user can declare that a specific concept or subhierarchy can be refined by restriction of a specific role to another subhierarchy. For example, factories which do can be declared to be valid for a subhierarchy of processes, like factories which do grinding, etc. In the sequence, the user can browse through the virtual hierarchy induced by the subsumption properties of the respective processes. Further, explicitly declared natural concepts like mill will appear at their natural position in the virtual hierarchy, and narrower terms of mill can be added, which is impossible in the AAT. Boolean combinations, which do not define or lead to a natural concept, were not included in the browsing. They play a minor role in natural concept formation and are easily handled and understood by users. Roles were taken from concepts in the CIDOC CRM (Doerr and Crofts 1999) ontology and from roles we found to be implicit in AAT terms. The semantic relations of the UMLS Semantic Network are another source of relevant roles, not only for the medical domain. It seems that a few relatively generic roles may actually be sufficient for most cases. The mechanism was verified with term combination from the equivalence expressions between the 1997 editions of the French Merimee "Thesaurus Architecture" and the AAT listed in the Appendix. For example, the compound "factory & owner's & houses" is interpreted as "houses which has owner: Person, which is owner of: factory"; wood & roofs as "roofs which consists of: wood"; umbrellas & factories as "factories which produce: umbrellas", etc. The above examples and Figure 6 demonstrate that post-coordination is useful at all levels of hierarchy, but that there must be a mechanism to embed natural concepts in post-coordinated schemes. We regard mechanisms simulating hierarchies of post-coordinated concepts as necessary to mediate effectively between pre- and post-coordination in thesaurus mapping and federation. 3.4 SummarySection 2 showed that a suitable methodology to create thesaurus mappings can provide well-defined global recall and precision qualities for transitions between thesauri, that have so far not explicitly been considered. Thus we made assumptions that are realistic but often not present. In this section, we have studied effects that may undermine those assumptions. Some can be avoided, either by better awareness of the thesaurus providers or by specific reasoning services. In other cases, information about implicit assumptions can help avoid comparing incomparable structures. Therefore, we have repeatedly proposed that certain thesaurus characteristics be documented in metadata, to avoid semantic heterogeneity conflicts and to facilitate interoperability of reasoning mechanisms. These characteristics are:
4 ConclusionsMany things can be done to bring forward information integration with thesauri. There is still a large gap between practitioners and scholars on one side, and theoreticians in knowledge representation and system engineers on the other. Whereas the practitioners administer the domain knowledge, the others have the technology to improve its handling. It is not easy for the practitioners to understand and appreciate the potential and limitations of technology, and often the theoreticians do not show particular understanding for intellectual problems in practice that do not directly conform with their models. We see a need for increased interdisciplinary empirical studies and verification of theoretical results, both to demonstrate the utility of theory to practitioners and to identify their limits and need for better theoretical understanding.The results of several excellent implementations of thesaurus federations seem to have remained relatively unevaluated in terms of the real quality of concept mediation achieved. Some technology providers seem to see their task end at the point of installation and optimize their systems to work with any thesaurus. As we have tried to make clear in sections 2 and 3, we believe that thesaurus creators (scholars, experts and practitioners) have a responsibility to improve their methodology to meet the challenges of advanced technology (e.g. the completeness of hierarchies) and technologists have a responsibility to understand the complexity of the problem (e.g. contrastive polysemy). If this were to happen in a coordinated way, we could soon achieve a new quality of applications. We believe that some concrete steps could be undertaken:
AcknowledgementI wish to express my thanks to the reviewers of this paper, and particularly to Linda Hill, for their valuable comments, which greatly helped to improve its quality.ReferencesAmba, S., N. Narasimhamurthi, K.C. O’Kane and P.M. Turner (1996) "Automatic linking of thesauri". In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Konstanz:Hartung-Gorre, pp. 181-187AQUARELLE Telematics Applications Programme, Information Engineering Sector, Project IE 2005, "Final Report", http://aqua.inria.fr Baader, F., H-J. Burckert, J. Heinsohn, B. Hollunder, J. Muller, B. Nebel, W. Nutt and H. Profitlich (1992) Terminological knowledge representation: a proposal for a terminological logic, DFKI Report, DFKI, Saarbruecken Borgida, A. (1995) "Description logics in data management". IEEE Trans. on Knowledge and Data Engineering, 7(5):671--682 Calvanese, D., G. De Giacomo and M. Lenzerini (1998) "On the decidability of query containment under constraints", In Proc. of the 17th ACM SIGACT SIGMOD SIGART Sym. on Principles of Database Systems (PODS'98), pp. 149-158 Calvanese, D., G. De Giacomo, M. Lenzerini, D. Nardi, and R. Rosati (1998) "Description logic framework for information integration", In Proc. of the 6th Int. Conf. on the Principles of Knowledge Representation and Reasoning (KR'98), pp. 2-13 Chan, Lois Mai (2000) Exploiting LCSH,
LCC, and DDC To Retrieve Networked Resources Issues and Challenges,
Library of Congress, December 19
Chen, Hsinchun, J. Martinez, T. D. Ng, and B.
R. Schatz (1996) "A Concept Space Approach to Addressing the Vocabulary
Problem in Scientific Information Retrieval: An Experiment on the Worm
Community System". Journal of the American Society for Information Science, Vol. 47,
No. 8, August
Constantopoulos, P., M. Sintichakis (1997) "A Method for Monolingual Thesauri Merging". Proc. 20th International Conference on Research and Development in Information Retrieval, ACM SIGIR, July, Philadelphia, PA, USA Dachelet, R. (1997) "Multilingual querying and multilingual thesauri in Aquarelle", Technical Report, INRIA-Aquarelle, March The DL Website: More information can be found for on http://www.ida.liu.se/labs/iislab/people/patla/DL/index.html DMOZ Open Directory Project, http://dmoz.org/ Doerr, M. (1996) "Authority Services in Global Information Spaces:A requirements analysis and feasibility study", Technical Report FORTH-ICS/TR-163, February Doerr, M., I. Fundulaki (1998) "A proposal on extended interthesaurus links semantics". Technical Report FORTH-ICS/TR-215, FORTH, Institute of Computer Science, Heraklion - Crete, Greece Doerr, M. (1998) "Effective Terminology Support for Distributed Digital Collections". In Sixth DELOS Workshop, Preservation of Digital Information, Tomar, Portugal, June Doerr, M., I. Fundulaki (1998a) "SIS - TMS: A Thesaurus Management System for Distributed Digital Collections", Proc. 2nd European Conference, ECDL'98, September, Heraklion, Crete, Greece Doerr, M., N. Crofts (1999) "Electronic Esperanto: The Role of the Object Oriented CIDOC Reference Model". Proc. ICHIM'99, Washington, DC, September Doerr, M., D. Kalomoirakis (2000) "A Metastructure for Thesauri in Archeology". Proc. CAA2000, Lubljana, April EBTI: A short description of the EBTI (European Binding Tariff Information) Thesaurus can be found in: http://www.bjl.be/2_3_1.htm English Heritage, National Monuments Record (2000) NMR Monument Type Thesaurus, June 19 http://www.rchme.gov.uk/thesaurus/mon_types/default.htm Foskett, D.J. (1997) "Thesaurus", In Readings in Information Retrieval, edited by K. Sparck Jones and P. Willet (Morgan Kaufmann), pp. 111-134 Getty AHIP (1994) Introduction to the Art & Architecture Thesaurus. Published on behalf of The Getty Art History Information Program (New York: Oxford University Press) Getty Information Institute (1996) Guidelines for Forming Language Equivalents: A Model Based on the Art&Architecture Thesaurus, International Terminology Working Group (for copies contact Murtha Baca, mbaca@getty.edu). The HEREIN Project http://www.european-heritage.net/fr/Thesaurus/Contenu.html Hutchins, W. J. (1995) "Machine Translation: A Brief History". In Concise history of the language sciences: from the Sumerians to the cognitivists, edited by E.F.K.Koerner and R.E.Asher (Oxford: Pergamon Press), pp. 431-445 ICOM/CIDOC Documentation Standards Group (1998) : "CIDOC Conceptual Reference Model", http://www.ville-ge.ch/musinfo/cidoc/oomodel/index.htm ISO 2788-1986 (1986) Documentation - Guidelines for the establishment and development of monolingual thesauri, International Organization for Standardization, Ref. No ISO 2788-1986 ISO 5964-1985: (1985) Documentation - Guidelines for the establishment and development of multilingual thesauri, International Organization for Standardization, Ref. No. ISO5964-1985 Kramer, R., R. Nikolai, and C. Habeck (1997) "Thesaurus federations: loosely integrated thesauri for document retrieval in networks based on Internet technologies". In International Journal on Digital Libraries (1), 122-131 Krause, J. (2000) "Virtual libraries, library content analysis, metadata and the remaining heterogeneity". Proc. ICADL 2000, the 3rd International Conference of Asian Digital Library, Seoul, Korea Krause, J. (2000a) "Information Systems for Social Science Research. A perspective from Information Science". In Proceedings of the Symposium Information System for Social Sciences, Mannheim, Germany Landry, P. (2000) "The MACS Project: Multilingual Access to Subjects (LCSH, RAMEAU, SWD)". Classification and Indexing Workshop, 66th IFLA Council and General Conference, Meeting No. 181 http://www.ifla.org/IV/ifla66/papers/165-181e.pdf Mannino, M.V., S. B. Navathe, and W. Effelsberg (1988) "A Rule-Based Approach for Merging Generalization Hierarchies". Information Systems, 13(3):257-272 MERIMEE, "THESAURUS ARCHITECTURE" for the indexing of complexes, buildings and built works described in the national database "Merimee" about the French Heritage http://www.culture.gouv.fr/documentation/thesarch/pres.htm Mili, H., R. Rada (1988) "Merging Thesauri: Principles and Evaluation". IEEE Transactions On Pattern Analysis and Machine Intelligence,10(2):204-220 Miller, A. G., R. Beckwith, C. Fellbaum, D. Gross, and K. Miller (1993) Introduction to WordNet: An On-Line Lexical Database Motschnik-Pitrik, R. (1993) "The Semantics of Parts Versus Aggregates in Data/Knowledge Modelling". In Proc. CAISE’93, Paris, June (Berlin: Springer-Verlag), pp. 352-361 Nelson, S. J. (1999) "The Role of the Unified
Medical Language System (UMLS) in Vocabulary Control". CENDI
Conference on Controlled Vocabulary and the Internet
Nikolai, R., R. Kramer, M. Steinhaus, B. Felluga, and P. Plini (1999) "GenThes: A General Thesaurus Browser for Web-based Catalogue Systems". In Proceedings of the Third IEEE Meta-Data Conference, Bethesda, Maryland, April NKOS, Networked Knowledge Organization Systems/Services
NKOS Registry (1998) Draft Set of Attributes, based on Contolled Vocabulary Registry developed by Linda L. Hill and Interconnect Technologies in 1996, with some modification, last revision: 7/30/98 http://alexandria.sdc.ucsb.edu/~lhill/nkos/Thesaurus_Registry.html Ntoas, D. (1999) "Economy and consistency in Thesauri". Technical Report FORTH-ICS-TR-262, FORTH, Institute of Computer Science, Heraklion - Crete, Greece OCLC, Online Computer Center (2000) Dewey
Decimal Classification, Dublin, OH, USA (Forest Press)
OntoWeb Workshop (2000) Semantic Web Project Proposal, organised by Dieter Fensel and Ying Dingat, Vrije Universiteit Amsterdam (the Netherlands), Dec. http://www.ontoweb.org/workshop/amsterdamdec8/index.html Pustejovsky, J. (1995) The Generative Lexicon (MIT Press) Rada, R., B. K. Martin (1987) "Augmenting Thesauri for Information Systems". ACM Transactions on Office Information Systems, 5(4) Ranganathan, S.R. (1965) A descriptive account of Colon Classification (Bangalore: Sarada Ranganathan Endowment for Library Science) Rector, A., S. Bechhofer, C. A. Goble, I. Horrocks, W. A. Nowlan, and W. D. Solomon (1997) "The GRAIL concept modelling language for medical terminology". Artificial Intelligence in Medicine, 9:139-171 Soergel, D. (1995) The Art and Architecture Thesaurus (AAT): A critical appraisal. Visual Resources, X, pp. 369-400 Term-IT Project home page http://www.mda.org.uk/term-it/ U.S. National Library of Medicine (2001) 2001 UMLS Metathesaurus, January 12, section 2 http://www.nlm.nih.gov/research/umls/META2.HTML U.S. National Library of Medicine (1998) Fact Sheet UMLS Semantic Network, February 19 http://www.nlm.nih.gov/pubs/factsheets/umlssemn.html Vizine-Goetz, D. (May/June 1998) "Subject Headings for Everyone: Popular Library of Congress Subject Headings with Dewey Numbers". OCLC Newsletter, 233:29-33 Welty, C., J. Jenkins (1999) "Formal Ontology for Subject". J. Knowledge and Data Engineering, 31(2), September, 155-182 Z39.50, ANSI/NISO Z39.50 or ISO
23950: Information Retrieval (Z39.50): Application Service Definition
and Protocol Specification
Appendix: "&" combinations from the Merimee thesaurus to the AAT
|