



![]() |
Reengineering Thesauri for New Applications: the AGROVOC Example
AbstractExisting classification schemes and thesauri are lacking in well-defined semantics and structural consistency. Empowering end users in searching collections of ever increasing magnitudes with performance far exceeding plain free-text searching (as used in many Web search engines), and developing systems that not only find but also process information for action, requires far more powerful and complex knowledge organization systems (KOSs). The paper presents a conceptual structure and transition procedure to support the shift from a traditional KOS towards a full-fledged and semantically rich KOS. The proposed structure also complies with other interoperability approaches like RDFS and XML in the Web environment. AGROVOC, a traditional thesaurus developed and maintained by the Food and Agriculture Organization (FAO) of the United Nations, serves as a case study for exploring the reengineering of a traditional thesaurus into a fully-fledged ontology. We start the process of developing an inventory of specific relationship types with well-defined semantics for the agricultural domain and explore the rules-as-you-go approach to streamlining the reengineering process.1 From thesauri to rich ontologies1.1 The problemEmpowering end users in searching collections of ever increasing magnitudes with performance far exceeding plain free-text searching (as used in many Web search engines), and developing systems that not only find but also process information for action, require considerably more powerful - and complex - knowledge organization systems (KOS) than the classification schemes and thesauri that currently exist. Such systems must serve the following functions, among others:
A typical scenario in information retrieval illustrates some of the shortcomings of current free-text search engines such as Google. A farmer is interested in finding out about rice and starts a search by entering the string 'rice'. The results returned in response to the query immediately indicate several problems. First, because the system performs the search based on the actual text string entered rather than on an interpretation of the meaning of the string, many irrelevant results are retrieved. This occurs because the query term itself is ambiguous (i.e. rice can refer to the grain, to the university in Houston, or to the name of an author, among others). Further, there are millions of results with no apparently meaningful arrangement. To find something of possible relevance, the user may need to click and scan page after page of the retrieved results. Finally, the user is stuck with the results that have been retrieved; to find other related resources, such as rice cultivation, the user must start from the beginning again and formulate a different query, despite the fact that the new query corresponds to concepts related to the original query. The problem becomes evident: The biggest challenge in information retrieval is concept identification in a specific domain of interest! In contrast, in a semantics-driven information retrieval system, the system would recognize, i.e. "understand", that the string 'rice' was ambiguous; it would then request clarification from the user as to which of the possible meanings was intended. Only then, after the user disambiguated the term, would the system execute the search. The system would then retrieve only those resources that had been semantically marked up (through manual or automatic indexing) with the concept of rice, no matter what words or even languages are used in the resources to refer to rice. Moreover, because the system is semantically rich, it not only presents results that are based on understanding the user's request, it also offers related concepts the user might not have thought of initially. Based on a <hasPest> relation, the system could display such concepts as rice weevil and rice moth. Searching on these latter concepts could in turn lead to concepts on pesticides used on rice, and so on. The system could retrieve not only information directly pertinent to the user's query but also help the user explore and clarify the information need and find useful related information. In this scenario, a KOS has two functions: assisting the user with exploring the topic of the query, and supporting intelligent automatic indexing (metadata assignment) through statistical and syntactic-semantic analysis and "understanding" of text; both functions require a KOS with a rich and precisely defined semantic structure. To accomplish these and other more sophisticated tasks, the new KOS must marry the conceptual structure of full-fledged ontologies - well-structured hierarchies of concepts connected through a rich network of detailed relations that support concept retrieval and reasoning - with the terminological richness of good thesauri. While existing KOSs do not provide the full set of precise concept relations needed for reasoning, existing KOSs, both large and small, represent much intellectual capital. This paper explores the question of how this intellectual capital can be put to use in constructing full-fledged KOSs. 1.2 The relationship of traditional KOS to ontologiesReengineering thesauri, classification schemes, etc., into ontologies means building on the information contained in them and refining that information as needed. Consider the relationships given in the ERIC Thesaurus (ERIC = Educational Resources Information Center) with those given in a hypothetical ontology as shown in Table 1.
The inferences given rely on the detailed semantic relationships given in the ontology. But the ERIC thesaurus gives only some poorly defined broader term (BT) and related term (RT) relationships. These relationships are not differentiated enough to support inference. For another example, consider the hypothetical ontological relationships and rules we could formulate with these relationships in an example taken from the AGROVOC thesaurus (described in detail in section 2) in Table 2.
From the statements and rules given in the ontology, a system could infer that Cheddar cheese <containsSubstance>milk fat and, if cows on a given farm are fed mercury-contaminated feed, that Cheddar cheese made from milk from these cows <mayContainSubstance>mercury. But the present AGROVOC Thesaurus (described in detail in section 2) gives only narrower term/broadr term (NT/BT) relationships without differentiation. The limitations of existing KOS can be summarized as follows:
In contrast to traditional KOSs, ontologies provide conceptual abstraction and differentiated relationships. Ontologies specifically separate concepts from lexicalizations and thereby better reflect the structure of human understanding of a domain. In ontologies, the semantics are developed through ensuring that each concept within the domain is uniquely and precisely defined and by specifying elaborated relationships among the concepts. The relationships in an ontology are explicitly named and developed with specification of rules and constraints so that they reflect the context of the domain for which the knowledge is modeled. Given their more precise and unambiguous semantics, ontologies allow further knowledge to be inferred from the knowledge explicitly represented in the ontology. The new (implicit) knowledge could be derived by applying generalization or transitivity rules, the level of applicability of which is limited in a poorly defined KOS like a traditional thesaurus. This added knowledge in the ontology makes it powerful when employed for intelligent information processing. Although there is a huge cost involved in moving from thesauri to ontologies, there is an expectation that the added power of consistency, precision, and completeness will be worth the investment even though reliable numbers on the return on investment (ROI) of ontology development are hard to come by. 1.3 Potential benefits of future generation KOSsFor emerging KOSs to satisfy user needs, they must improve both information organization and retrieval in a way that was not possible with traditional KOSs. The following potential benefits are expected from such systems:
1.4 The process of reengineering: the rules-as-you-go approachReengineering a thesaurus into an ontology entails refining thesaurus relationships, a laborious process. The steps in the process are:
Step 3 is the most laborious. We have plans to streamline this process by implementing intelligent conversion using a "rules as you go" approach. The idea is as follows: The KOS editor watches out for patterns; based on these patterns the editor formulates rules that can be applied immediately to all subsequent similar cases as illustrated in the following:
cow NT cow milk should become cow <hasComponent> cow milk animal <hasComponent> milk (or, even more general animal <hasComponent> body part) goat NT goat milk should become goat <hasComponent> goat milk since goat is an animal and goat's milk ends with the word milk and thus can be seen to be a type of milk. These patterns are a special type of constraint. Other constraints can be formulated and used to limit the options presented to the human editor as thesaurus relationships are refined. The bases for such constraints are the thesaurus relationships, on the one hand, and the entity types of the concepts involved, on the other. Table 3 shows some examples of constraints based on thesaurus relationships.
This inventory will constrain the available choices when manually refining a thesaurus relationship to a more specific ontology relationship. Of course, an authorized ontology editor can override such constraints and thereby update the relationship table. As a relationship has been added or refined the inverse relationship is automatically added or refined. 2 AGROVOC: a multilingual agricultural thesaurusThis section describes the AGROVOC Thesaurus and further illustrates the problem of semantic under-specification.2.1 BackgroundAGROVOC is a multilingual, structured, controlled vocabulary/thesaurus designed to cover concepts and terminology in agriculture, forestry, fisheries, food and related domains (e.g. environment). It was developed by the Food and Agriculture Organization (FAO) of the United Nations and the Commission of the European Communities in the early 1980s to describe documents and other information resources in a controlled language for indexing and searching. It contains approximately 16,500 descriptors and 10,000 non-descriptors.AGROVOC is available online and for download in the five FAO official languages (English, French, Spanish, Chinese and Arabic). It is translated into other national languages such as Czech, Danish, German, Italian, Polish, Portuguese, Slovak and Thai. 2.2 Applications and related terminologiesAGROVOC is used for controlled-vocabulary indexing and searching globally and in various systems throughout the FAO. Systems where AGROVOC is used include:
2.3 Conceptual structure of AGROVOCAGROVOC follows a traditional thesaurus approach. It is a collection of terms, definitions, and term relationships. As is the case with most thesauri, a small, standard, non-adaptable set of relationship types is applied to interlink terms.2.3.1 Equivalence relationshipsUSE: Since thesauri have been primarily developed for the purpose of indexing and retrieval, this relationship indicates that any term preceding the USE relation should be replaced, for the purposes of indexing documents and formulating queries, by the term following the USE relation. The relationship usually (but not always) expresses synonymy between two terms.USED FOR (UF): This is the inverse of USE and indicates that term A is USED FOR term B for indexing purposes. 2.3.2 Hierarchical relationshipsNarrower Term (NT): if X is a NT of Y, then X is narrower in some sense than Y. For example, milk NT cow milk, grain NT rice.Broader Term (BT): if Y is a BT of X, then X is broader than Y; for example cow milk BT milk, rice BT grain. BT is the inverse of NT. Given these rather unspecific definitions, BT and NT relationships can be applied to express generic relations, meronymic relations, instantiations, and many others (see section 4). 2.3.3 Associative relationshipsRelated Term (RT): the thesaurus conceptual model contains the RT relationship to express any kind of associative relationship between two terms that is not a hierarchical relationship. This relationship is hence very ambiguous in that it is the default for all other relationships.Hierarchical (NT, BT) and associative (RT) relationships are relationships between concepts. In the thesaurus, these exist only between descriptors. Following a traditional thesaurus approach, AGROVOC distinguishes between descriptors and non-descriptors (often referred to imprecisely as preferred terms and non-preferred terms). The rationale behind this is that only a descriptor should be used when referring to the concept (for example, for indexing and retrieval); each descriptor uniquely and unambiguously designates a concept. A non-descriptor must not be used for indexing or retrieval; it is linked through a USE cross-reference to the corresponding descriptor that must be used instead. There are no relationships from one non-descriptor to another. 2.3.4 Scope notesMany descriptors in AGROVOC have a scope note, which can be a definition of a term, a history note, instructions to the indexer or searcher, or simply a comment. The purpose is to provide the user with more detail about the term and its usage.2.3.5 Top level structureCurrently AGROVOC has more than 1500 top-level terms, i.e. descriptors which do not have a broader term, making it cumbersome to access the thesaurus from a top-level approach and browse through the hierarchy. Superimposed on AGROVOC is the AGRIS categorization scheme; it has more than 100 top-level categories, ordered in a shallow two-level hierarchy. AGROVOC descriptors are mapped to the second level of AGRIS categories. For example, the AGROVOC descriptor fish farms is mapped to the AGRIS category aquaculture production which is a subcategory of fishery and aquaculture. Thus the AGRIS categorization scheme provides high-level organization for information that has been tagged with AGROVOC descriptors.2.4 Semantic problems of AGROVOCGiven its minimalist conceptual structure, AGROVOC (as other thesauri) has a number of semantic flaws. In the following we will use examples to point out the major drawbacks of the current system and develop the rationale for the shift towards a more powerful, expressive, and unambiguous conceptual model.2.4.1 Ambiguous descriptor to non-descriptor relationshipIn AGROVOC, as indicated, USE/UF covers synonyms and formal variants. In addition, the relation also links quasi-synonyms and very specific narrower terms, which the AGROVOC defines as any of the following:
famine
Definition 2 involves concepts on opposite ends of a scale or otherwise in opposition to each other. (With a few exceptions, the terms designating such concepts are antonyms). Example: hydrophilicity
Definition 3 indicates that USE/UF can also express a hierarchical relationship, for example: biological competition
where the fine distinction between interspecific competition and intraspecific competition is deemed unnecessary for retrieval and therefore abandoned in favor of the more general category. 2.4.2 Ambiguous hierarchical definitionsThe BT/NT relationship used to build up the hierarchy is very ambiguous; it lumps together several different types of relationships as the following examples show:2.4.2.1 <includesSpecific> relationship (erythrocytes are a specific kind of blood cell): blood cells
2.4.2.2 <hasComponent> relationship (blood contains as a component blood cells): blood
2.4.2.3 The following example shows clearly the discrepancies between different thesauri that apply the ambiguously defined modeling principles: AGROVOC and CABI: water
But ASFA: water
Water vapor and ice are phases of water while fresh water and drinking water are kinds of water, so in AGROVOC and CABI hierarchical relationships lump together several different semantic relationships. For retrieval this is generally useful (a search for water should generally find documents on ice as well), but for more differentiated retrieval a user may want to ask for water in all phases or for all kinds of water. There are many other purposes of semantic processing that need more differentiated relationships. In ASFA the phase relationship is treated as a RT, an example of how grouping relationships may lead to inconsistency. Note, by the way, that neither thesaurus includes the concept liquid water, which is logically necessary if water means water in any phase. There are many more examples in AGROVOC where the currently used BT/NT relationship is used to describe different relationships. The most obvious ones have been identified and are used in our proposal below in section 4. 2.4.3 Ambiguous associative relationshipsLike the BT/NT relationships, the associative RT relationships can be refined into more specific relationships. Some examples are given below.2.4.3.1 <hasMember> relationship (Anglophone Africa <hasMember> Botswana) Anglophone Africa
2.4.3.2 <causes> relationship (bleaching <causes> discoloration): bleaching
2.5 The need for reengineering AGROVOC into an ontologyThe examples above indicate clearly the ambiguous nature of the relationships in AGROVOC. With respect to future information retrieval and intelligent processing needs, where it will be necessary to combine different KOS in order to give access to different information systems, it becomes evident that a more rigid structure is required. A reassessment of AGROVOC (as well as other thesauri) to transform its UF, NT, BT, and RT relationships into unambiguously defined relationships and hierarchical order will provide the first step towards solving the problem of ambiguity and inconsistency in information description and retrieval.3 Conceptual model: combining thesauri and ontologiesThis section introduces a conceptual model that provides the necessary structure to create precise semantics to facilitate the transition from traditional thesauri to ontologies. Figure 1 shows the high level conceptual model we propose. Its chief characteristic is a clear separation of the concept level, the term or lexicalization level, and the string level. Present thesauri give a more or less muddled representation of information about concepts and information about terms. The proposed structure allows for a clear separation of concept information and term information. This model owes much to the structure of the UMLS.
Figure 1. Conceptual model for combining thesauri and ontologies 3.1 The basic modelThe following is just the broad outline of the model. Many more types of information could be added. In any event, we consider the model extensible. On the other hand, not all applications will use all features of the model. For example, our model provides for relationships between notes (for example, as hypertext links). This is not possible in all environments but very useful in some. Our intent is to present a framework that can be used for the simplest thesaurus or the most complex and rich ontology in a format that communicates equally to thesaurus and ontology editors with a background in information science, artificial intelligence, or linguistics.
Concepts take center stage in our proposed thesaurus/ontology information model; accordingly, relationships between concepts are central. Concepts are arranged in hierarchies and have additional relationships to other concepts in the network; a hierarchy can be defined on any weak ordering relationship including isa, part-whole, spatial containment, etc. (the relationship must be transitive and not symmetric, but must have an existing inverse relationship, for example <componenttOf> is the inverse relationship of <hasComponent>). There are many other relationship types, such as <causes> ; a scheme of relationship types needs to be defined for the domain of the respective thesaurus. One source for finding relationship types is the detailed analysis of concept relationships present in the thesaurus that is to be reengineered into a richer ontology (see section 4). Each concept should be assigned to an entity type or facet, such as process, function, substance, living organism (see, for example, the semantic types in the UMLS Semantic Network); the type of a concept constrains its participation in relationships. A concept is designated or represented by one or more lexicalizations or terms in one or more languages; this is the linkage between the concept level and the term level. For examples see Table 4.
If a term is a homonym (designates more than one concept), several disambiguated terms are introduced. The homonym is linked to each of the disambiguated terms, and each disambiguated term is linked to the corresponding concept. Two terms designating the same concept are called synonyms. Conversely, if one does not agree that concepts per se exist, one can simply view "concept" as a convenient shorthand for an equivalence class of terms that are linked by the <hasSynonym> relationship, such as the synsets in Wordnet. A KOS may select a preferred term as the term used to represent the concept or it may make that choice dependent on the audience (for example, veterinarians versus farmers). Terms can be connected through many relationships such as <hasSynonym> (with <hasScientificName> as a special case), <hasAntonym>, <hasCognate> (term in a different language from the same root), and <hasTranslation>. One might think that the synonym and translation relationships are not needed since all terms linked to the same concept would be synonyms or translations. However, two terms may be linked to the same concept yet be used in different contexts, i.e. they are not strict synonyms. If a concept has linked to it several English terms and several French terms, it is not true that just any of the French terms is a good translation for a given English term (see the examples in Table 4). Another example of term-specific relationships is <hasAntonym>. For example, big and small designate opposite concepts but are not antonyms. (The antonym pairs are big versus little and large versus small; see Wordnet.) Finally, a term is manifested in one or more
strings,as
shown in Table 4. Strings can be connected through relationships such as
<hasCaseVariant>,
<hasSpellingVariant>,
<hasAbbreviationOrAcronym>,
<pluralOf>
/ <singularOf>,
which are all subordinate of a broader relationship <hasStringVariant>.
A term can be seen as a convenient shorthand
for an equivalence class of strings that are linked by the <hasStringVariant>
relationship. A KOS may select a preferred variant as the string used to
represent the term or it may make that choice dependent on the audience
(as in British versus In addition, a concept, a lexicalization/term, a string, or a relationship type can have several types of notes (definitions, usage notes, comments, image, etc.) in different languages (in the case of multilingual thesauri). Just like concepts and terms, notes can be related to each other through relationships such as <hasTranslation>, <hasSimplifiedVersion>, <hasOtherDefinition>, or any other type of hyperlink. Many other pieces of information about terms can be added, for example, case frames for verbs (in case the verb has a case frame different from the case frame for the corresponding action concept) or register (see below) or whether the term is the preferred term for the concept. Administrative data will be accommodated as well. Relationship types themselves can form relationship hierarchy (i.e. a relationship of relationships), in which more generic relationships are further up in the hierarchy than more specific relationships, for example, <componentOf'> is a specific kind of <partOf> relationship. Why define concepts, terms, and strings as separate entity types? First, each of these entity types takes different types of information. Conceptual relationships and other information are associated with concepts. Linguistic information, such as part of speech and how a term combines with other terms into sentences, usage, or information on etymology, are associated with terms. Information such as that a string is an acronym is associated with terms. Usage information may sometimes be associated with strings; for example, lay people may commonly use a slang abbreviation while professionals use the full string. Definitions are primarily associated with concepts but may also be associated with terms. Second, this distinction avoids confusion. In a standard thesaurus like AGROVOC, for each concept that is to be used in indexing and searching, a preferred term, and for that term a preferred string, is selected; this string is the descriptor. Non-descriptors are linked only to descriptors, not among themselves. As a result, BSE, mad cow disease, and MCD [which we made for illustration] are all linked to bovine spongiform encephalitis as synonyms (or, in some thesauri, as synonym and as abbreviations). But the information that BSE belongs with bovine spongiform encephalitis and MCD with mad cow disease is lost. Furthermore, if decisions on terms are made (for example, omitting mad cow disease as a non-scientific name), these decisions should apply to all term variants, in the example MCD as well. 3.2 Model extensionsAs was mentioned above, many more types of information could be added to concepts, terms, strings, notes, and relationships. For example, we might specify an audience (general lay public, K-12 students by grade level, university students, experts), a subject domain, a scope (as in Topic Maps), or a specially selected subset of concepts and terms to be used for a given application, or all concepts and terms taken from a given source.Scopes could be defined in many ways. For example, one might define a scope as the conceptual system embedded and expressed in a language (whereas the link from terms and notes to language simply refers to the surface form). Consider the conceptual system underlying Walpiri (an Australian indigenous language); one of its noun classifiers includes women, fire, and dangerous things (Lakoff 1987). A native speaker of English would find this classifier and the corresponding <isa> relationships very curious. Thus one would introduce the category and the <isa> relationships with a scope of the Walpiri conceptual system. (By the way, the relationship between these relationships makes sense in the context: fire is dangerous; fire is sometimes started by or anyway related to the sun; the gender of the noun for sun is female). Many such problems, if more subtle, occur in thesauri for international use. A subvocabulary can be extracted using any type information about concepts, terms, strings, and relationships that is available in the thesaurus. Thus one could extract as subvocabulary
3.3 LimitationsThe separation into the concept layer and the term layer is appealing for its simplicity and elegance but it is somewhat of an oversimplification. Terms, particularly terms in different languages, rarely mean exactly the same thing. So the question arises as to when to map two terms to the same concept - and possibly explain shades of meaning and associations in the definition of each term that complements the definition of the concept - and when to create two closely related concepts, possibly under one broader concept. Our model permits any type of relationship between terms. Thus it is possible to introduce conceptually motivated relationships between terms that more accurately reflect the reality of language than the mapping of terms to "concepts". These two representations of conceptual information can coexist within the same system.3.4 ImplementationAll relationships from all layers (concept, term, string) can be stored in the same format within a database. The type of each element should be explicitly given to enable integrity constraints (so that the relationship <hasSpellingVariant> is not allowed between two concepts, for example). A concept can be identified by a URI or other number (cleanest solution) or by its preferred term in the base language of the thesaurus (the term being typed as preferred). Likewise, a term can be identified by a URI or other number (cleanest solution) or by the preferred string (the term being typed as preferred). The same holds for strings. The main difference with implementations in most existing thesaurus management software is that relationships between non-descriptors are allowed. Thoughts for an XML/RDF schema for KOS data are presented in the Appendix.3.5 Related approachesThe proposed conceptual model integrates well with standardization approaches regarding Web technologies like RDFS. The proposed structure shows all aspects of a proposed RDFS-compatible Thesaurus Interchange Format by Matthews et al. (2002), which will appear as a W3C note. The proposal is being done in the context of the SWAD-Europe project. The Appendix presents another approach for representing ontology and thesaurus data in XML/RDFS.4 The AGROVOC case: exploring conceptual relationships in the agricultural domainThe model we introduced has no restrictions on potential relationships to be applied. The model is extensible, and any possible specific relationships can be included. We carried out a preliminary linguistic and conceptual analysis of AGROVOC and found a set of relationships; most of them are well known (but it is important to know that they are needed in the food and agriculture domain), some of them add new nuances. Table 5 lists relationship types found in AGROVOC or otherwise proposed here, and subsections 4.1 - 4.3 give an explanation and examples for some of these relationship types; others appear in examples throughout the paper. This section is not in any way intended as a complete list of relationship types; it merely gives examples to illustrate the additional information and clarity of conceptual structure that can be conveyed through more specific relationships. Much more work, including comparison, is needed to converge on a set of relationships to replace the currently used thesaurus relationships BT, NT, RT, USE and UF in a reengineering of AGROVOC.
|