



![]() |
Vocabulary Mapping for Terminology ServicesOCLC Research, OCLC Online Computer Library Center, Inc. Email: vizine@oclc.org; hickeyc@oclc.org; houghton@oclc.org; thompson@oclc.org Project Web Site: http://www.oclc.org/research/projects/termservices/ Key Features: References; Figures 1, 2, 3; Tables 1, 2, 3; Appendix 1, 2, 3, 4
AbstractThe paper describes a project to add
value to controlled vocabularies by making inter-vocabulary associations.
A methodology for mapping terms from one vocabulary to another is presented
in the form of a case study applying the approach to the Educational Resources
Information Center (ERIC) Thesaurus and the Library of Congress Subject
Headings (LCSH). Our approach to mapping involves encoding vocabularies
according to Machine-Readable Cataloging (MARC) standards, machine matching
of vocabulary terms, and categorizing candidate mappings by likelihood
of valid mapping. Mapping data is then stored as machine links. Vocabularies
with associations to other schemes will be a key component of Web-based
terminology services. The paper briefly describes how the Open Archives
Initiative Protocol for Metadata Harvesting (OAI-PMH) is used to provide
access to a vocabulary with mappings.
1 IntroductionA majority of tools and features for accessing
the names, subjects, and classification categories assigned to content
objects are not easily accessed by people or computers. The knowledge organization
schemes and the features found in cataloging and retrieval systems are
often deeply embedded in proprietary formats and software. Even when knowledge
organization resources are openly available, they are rarely linked with
other compatible schemes or services. This paper describes a project to
add value to controlled vocabularies through vocabulary mapping. The vocabulary
associations are then made accessible through Web services.
In this paper, 'terminology services' is used to describe Web services involving various types of knowledge organization resources, including authority files, subject heading systems, thesauri, Web taxonomies, and classification schemes. The term 'vocabulary' is used to refer to these knowledge organization resources. Vocabularies with associations to other schemes will be a key component of Web-based terminology services. Web services are modular, Web-based, machine-to-machine applications that can be combined in various ways. For background information on Web services, see Gardner (2001) and Tennant (2002). Web services can be accessed at various points in the metadata lifecycle, for example, when a work is authored or created, at the time an object is indexed or cataloged, or during search and retrieval. A Web service that provides mappings from a term in one vocabulary to one or more terms in another vocabulary is an example of a terminology service. 2 Vocabulary compatibilityResearchers have been interested in achieving
compatibility among controlled vocabularies for many years. Lancaster
and Smith (1983) published an overview of the issues involved in integrating
vocabularies, which is still relevant today. They describe several factors
that influence how successfully one vocabulary can be associated with another,
including:
3 OCLC vocabulary projects3.1 Dewey mappingsIn 1994, OCLC staff began linking Library of
Congress Subject Headings (LCSH) to the Dewey Decimal Classification (DDC)
scheme. DDC/LCSH pairs were generated from OCLC WorldCat records that contained
both DDC numbers and LCSH. Co-occurrence mappings were made for frequently
occurring pairs. Later, an association measure was introduced in the co-occurrence
mapping process to provide a better indicator of association than simple
pair frequencies (Vizine-Goetz
1998). Approximately 90,000 co-occurrence mappings have been made in
WebDewey, the electronic version of the DDC. An example of DDC/LCSH co-occurrence
mappings is shown below for DDC class, 617.522 Oral region—surgery:
The mapped LCSH provide additional indexing vocabulary for the electronic version of the DDC and also assist catalogers in assigning subject headings. These terms are also included in versions of the DDC used in automated classification services. 3.2 Other mappingsThe scope of OCLC's vocabulary mapping research
projects has expanded to include additional classification schemes, subject
heading systems, and thesauri. A list of OCLC vocabulary associations and
the mapping approach used (direct, co-occurrence, or both) is shown in
Table 1. In addition to DDC/LCSH co-occurrence mappings, direct mappings
have been made between selected classes from the Library of Congress Classification
(LCC) and the National Library of Medicine Classification (NLMC) and DDC.
The LCC/DDC mappings and NLMC/DDC mappings are used to profile questions
and expertise for virtual reference services. Project staff members have
also made direct mappings of genre terms for fiction and drama (GSAFD)
to LCSH and to LCSHac (headings for children's materials) using the procedures
outlined in this paper. Because the GSAFD vocabulary is quite small - only
153 preferred terms - and based largely on LCSH, the GSAFD mapping effort
was not considered a suitable test of our mapping approach. For these reasons,
the approach was applied to another vocabulary.
The GSAFD vocabulary terms with mappings are accessible using the OAI-PMH. The OAI protocol specifies a simple HTTP protocol for automated sharing of metadata, but as the OAI-Cat effort has shown, the approach works equally well for sharing other XML content. The content of the GSAFD records is MARC in XML (MARC Standards). The records are accessible to users via a browser (http://alcme.oclc.org/gsafd/) and to machines through the OAI-PMH Web services mechanisms. See Van de Sompel et al. (2003) for a more complete description of the how the file can be accessed using the OAI-PMH. The GSAFD/LCSH mapping file can also be downloaded from our project Web site. The file is encoded in MARC in XML and also according to version 0.5 of the Zthes schema. We have also prototyped some experimental Web services using co-occurrence mappings between the GSAFD vocabulary and LCSH. 3.3 Mapping to LCSHAs Table 1 shows, much of our mapping activity
involves LCSH. Describing the relationship between the Art and Architecture
Thesaurus (AAT) and LCSH, Whitehead
(1990, p. 82) asks: "Why map to LCSH?" and replies:
Despite the weaknesses and the critical assessments that have plagued LCSH over the years, the fact remains that LCSH is the standard vocabulary used by the majority of information resources, especially libraries, in the United States.She also notes that efforts to improve or replace LCSH must take into account its widespread use and the probability that it will be maintained for a long time. Others have reached similar conclusions. For example, the FAST project sponsored by OCLC selected LCSH as the basis for creating a faceted vocabulary for metadata. O'Neill and Chan (2003) cite the following reasons for choosing the LCSH scheme:
3.4 Vocabulary encoding standardsMany standards exist for encoding vocabularies:
see Koch (2003) and the SWAD-Europe
Thesaurus Activity thesaurus link page for listings of some current
standards. For authority files, subject headings and thesauri, we have
decided to use the MARC21 Format
for Authority Data. For classification data, we use the
MARC21
Format for Classification Data. MARC was chosen because many large
vocabularies are available in the MARC formats, and the MARC Authority
format supports inter-vocabulary relationships, which are particularly
important to us because of our mapping work. Some examples of vocabularies
available in the MARC format include:
In the remainder of this paper we describe our approach to mapping the ERIC Thesaurus to LCSH. The ERIC Thesaurus was chosen because it is a well-established vocabulary, publicly accessible on the Web, and large enough to provide a meaningful test of our mapping approach. The ERIC Thesaurus is produced by the Educational Resources Information Center, an education information network, sponsored by the U.S. Department of Education, and provides public access to education literature (ERIC 2004). 4 Mapping the ERIC Thesaurus to LCSH4.1 Converting ERIC to MARCVocabularies to be mapped are first converted to
the MARC21 Authority Format. The effort involved in this step varies depending
on the format of the source vocabulary (vocabulary being mapped). We have
converted vocabularies from formats primarily intended for display, e.g.
word processing documents without extensive use of styles and vocabularies
in more structured formats such as the ERIC file (Figure 1).
Multiple instances of broader terms (BT), narrower terms (NT), and related terms (RT) stored in single ERIC fields are encoded as separate fields in the MARC format (Figure 2). The RT field shown below generates 14 fields in the MARC record. These are the fields labeled with MARC tag 550 (without $w subfields). The field labeled UF is similarly converted into two MARC fields (tag 450). One of the terms, Student ability, represents a formerly valid term. The notation in parentheses in the ERIC record indicates this and gives the lifespan of the term. When this data is converted to MARC, a 688 field (Application History Note) is constructed for this data. In the 450 field, subfield $w is added to indicate the term was formerly valid. By encoding the source and target vocabularies in the MARC Authorities Format we are able to standardize the representation of similar information and improve our ability to match vocabularies.
MARC field and subfield statistics are provided in Appendices 1–4 for the following versions of the files:
4.2 Matching vocabulary termsAfter the ERIC file is encoded in the MARC Authority
format, the ERIC vocabulary is matched to the LCSH vocabulary. Using a
series of computer programs, all preferred terms (MARC tag 150) and non-preferred
terms (MARC tag 450) in the source and target vocabularies are matched.
Differences in spacing, capitalization, and punctuation are ignored during
the matching process. The following terms are considered matches:
Currently, plural versus singular forms, terms that differ only by the presence or absence of a parenthetical qualifier, and terms with a qualifier introduced by a comma are not being matched. These refinements would likely improve the match rate and will be employed in the next phase of the project.
A total of 3,797 ERIC terms were matched to LCSH and categorized according to the following match types:
4.3 Evaluating matchesFour categories of ERIC terms were reviewed and
analyzed (numbers in parentheses are ERIC category codes):
About 99% of PT/PT matches were found to represent equivalent concepts in the two vocabularies and 91% of PT/NPT matches represent equivalent concepts. Very few false matches were observed for these two match types. A false match occurs when terms from the vocabularies are identical but the concepts represented are different. Some examples of false matches are:
A total of 365 (294 + 71) equivalent concepts were identified. This is 47% (365/773) of the preferred terms in the ERIC subset. All matches in the subset were manually reviewed to determine which matches represented valid mappings. The following guidelines established in the Northwestern University LCSH/MeSH mapping project (Olson and Strawn 1997) were applied in the evaluation:
![]()
![]() The match types guided our review of the matches. Matches were coded by type and each type was assigned a different color. PT/PT (white) matches were reviewed first, followed by PT/NPT (green). Evaluation of these matches was relatively straightforward since most involved one-to-one matches. NPT/NPT (yellow) and NPT/PT (blue) were more complex to review because they often involved matches to multiple terms in the target vocabulary.
In the example above, the NPT/PT match on the term Adolescence is an invalid mapping because the ERIC term and the LCSH term represent different concepts. The ERIC term Adolescents is for works on young people, 13-17 years of age. The LCSH term Adolescence is for works on the physiological, psychological, or social development of adolescents. The ERIC term, Adolescent Development, is a better match for the later term. For terms that matched three or more LCSH, e.g. Neurological Impairments, the review could be quite time-consuming and sometimes did not yield a correct mapping. In the subset, NPT/NPT matches represent equivalent concepts about 81% of the time, and NPT/PT matches represent equivalent concepts about 55% of the time. This last set of statistics should be viewed with some caution, given the small number of matches analyzed. Even so, the mapping results do have some interesting implications for future mapping projects. If the term/concept-mapping rate is constant within a vocabulary, it should be possible to predict the expected mapping rate for a vocabulary based on a review of a sample of matches. Further, if the false match rate can be predicted reliably, review of matches with a high term/concept-mapping rate (PT/PT and PT/NPT, Table 2) could be dispensed with when the false match rate is below a particular threshold. Only those types of matches with low term/concept mapping rates (NPT/NPT and NPT/PT, Table 3) would need to be reviewed. Further, for matches requiring review, more experienced reviewers could be assigned to complex matches while less experienced reviewers could be given simpler matches.
5 Inter-vocabulary linkingVocabulary links are stored in MARC fields 7XX. Using
these fields, we can encode the following:
A legitimate concern about vocabulary mapping is how the mappings will be maintained. Although not a trivial task, mappings can be maintained with the help of software that tracks changes to vocabulary term records. Changes to vocabulary terms are recorded in a number of ways, e.g. by data in a vocabulary record that indicates when the record was last modified, by notes fields that chronicle changes to a vocabulary term (see field 688 in the MARC record examples), and through notifications of additions and changes distributed by vocabulary owners. Depending on the nature of the changes, human review may be needed to determine if mappings are still valid when a vocabulary term changes.
In this example, the LCSH terms are linked to LC subject authority records accessible through the OAI-Cat framework. These records are accessible to users via a browser and to machines through the OAI-PMH Web services mechanisms. The MeSH link generates a search of the MeSH vocabulary using the search features of the MeSH Browser. 6 Next stepsOur plans for the near term include refining the
matching software and developing improved tools for reviewers. When the
review of the ERIC/LCSH matches is complete, the file of mappings will
be made available to other researchers. The file will be available in MARC
in XML and also encoded according to version 0.5 of the Zthes schema. We
also anticipate making this file available via OAI-PMH and for searching
using SRU/SRW and the Zthes profile. See the Terminology Services project
Web site for details.
AcknowledgementsWe thank the reviewers of this paper for
their many helpful comments and suggestions.
ReferencesDoerr,
M. (2001) "Semantic Problems of Thesaurus Mapping". Journal
of Digital Information 1(8) http://jodi.tamu.edu/Articles/v01/i08/Doerr/
Gardner, T. (2001) "An Introduction to Web Services". Ariadne (29) http://www.ariadne.ac.uk/issue29/gardner/ Koch, T. (2003) "Activities to advance the powerful use of vocabularies in the digital environment - Structured overview" http://www.lub.lu.se/~traugott/drafts/seattlespec-vocab.html Lancaster, F. W. and L. Smith (1983) "Compatibility Issues Affecting Information Systems and Services". General Information Programme and UNISIST, PGI-83/WS/23 (Paris: UNESCO) Mandel, C. (1987) "Multiple Thesauri in Online Library Bibliographic Systems". Cataloging Distribution Service (Library of Congress: Washington, D.C.) Olson, T. and G. Strawn (1997) "Mapping the LCSH and MeSH Systems". Information Technology and Libraries, 16(1), 5-19 O'Neill, E. and L. Chan (2003) "FAST (Faceted Application of Subject Terminology): A Simplified LCSH-based Vocabulary". World Library and Information Congress: 69th IFLA General Conference and Council, 1-9 August, Berlin http://www.ifla.org/IV/ifla69/papers/010e-ONeill_Mai-Chan.pdf Tennant, R. (2002) "Digital Libraries-What To Know About Web Services". Library Journal 12 (July 15) http://www.libraryjournal.com/index.asp?layout=articleArchive&articleid=CA231639 Van de Sompel, H., Young, J. and T. Hickey (2003) "Using the OAI-PMH... Differently". D-Lib Magazine 9(7/8) http://www.dlib.org/dlib/july03/young/07young.html Vizine-Goetz, D. (1998) "Popular LCSH with Dewey Numbers". In Annual Review of OCLC Research 1997 http://digitalarchive.oclc.org/da/ViewObject.jsp?objid=0000003449 Whitehead, C. (1990) "Mapping LCSH into Thesauri: the AAT Model". In Beyond the Book: Extending MARC for Subject Access, edited by T. Petersen and P. Molholt (Boston: G.H. Hall), p. 81 Zeng, M. and L. Chan (2003) "Trends and issues in establishing interoperability among knowledge organization systems". Journal of the American Society for Information Science and Technology, published online 16 Dec 2003 LinksCanadian Subject Headings (CSH) http://www.nlc-bnc.ca/6/23/index-e.htmlColorado Digitization Program Western States Dublin Core Metadata Best Practices http://www.cdpheritage.org/westerntrails/wt_bpmetadata.html Dspace "Metadata" http://dspace.org/technology/metadata.html ePrints UK. "Using simple Dublin Core to describe eprints" http://www.rdn.ac.uk/projects/eprints-uk/docs/simpledc-guidelines/ ERIC (2004) Educational Resources Information Center (ERIC) http://www.eric.ed.gov/index.html Getty Vocabulary Program http://www.getty.edu/research/conducting_research/vocabularies/ GSAFD experimental Web services http://research.oclc.org/WebServices/GenreTermsAndSubjectHeadings/GenreTermsAndSubjectHeadings.asmx Library of Congress Subject Headings (LCSH) http://www.loc.gov/cds/lcsh.html Library of Congress Classification (LCC) http://lcweb.loc.gov/cds/mds.html#lccr MARC21 Format for Authority Data (2003) Concise edition http://www.loc.gov/marc/authority/ecadhome.html MARC21 Format for Classification Data (2002) Concise edition http://www.loc.gov/marc/classification/eccdhome.html MARC Code List for Organizations (2004) http://www.loc.gov/marc/organizations/orgshome.html MARC Standards: MARC in XML http://www.loc.gov/marc/marcxml.html Medical Subject Headings (MeSH) http://www.nlm.nih.gov/pubs/factsheets/mesh.html MeSH Browser http://www.nlm.nih.gov/mesh/mbinfo.html OCLC Research: FAST: Faceted Application of Subject Terminology http://www.oclc.org/research/projects/fast/default.htm OCLC Research: OAICat repository framework http://www.oclc.org/research/software/oai/cat.htm OCLC Research: Search & Retrieve on the Web http://www.oclc.org/research/projects/webservices/default.htm OCLC Research: Terminology Services http://www.oclc.org/research/projects/mswitch/4_termservs.htm SWAD-Europe Thesaurus Activity http://www.w3c.rl.ac.uk/SWAD/thes_links.html Zthes: a Z39.50 Profile for Thesaurus Navigation http://zthes.z3950.org/ Appendices
Appendix 1
Appendix 2
Appendix 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||