Thesaurus Construction

Theory and practical theory

What is a thesaurus?

A thesaurus may be defined as a compilation of words and phrases organised in such a way as to show their relationships and to provide a standardised vocabulary. In other words, a thesaurus:

The relationships displayed in a thesaurus are both a conceptual structure (i.e. showing relationships among concepts in a hierarchical format) and a synonymous structure (i.e. showing the relationship among terms having the same meaning).

A more formal definition of a thesaurus may be found on pages 38-39 of Dagobert Soergel's 1974 book entitled, Indexing Languages and Thesauri: Construction and Maintenance:

"A thesaurus in the field of information storage and retrieval is a list of terms and/or of other signs (or symbols) indicating relationships among these elements, provided that the following criteria hold:

  1. the list contains a significant proportion of non-preferred terms and/or preferred terms not used as descriptor;
  2. terminological control is intended."

But stick with our easier approach to understanding a thesaurus and you will do fine.

Why do we need a thesaurus?

The two main purposes of a thesaurus are essentially:

  1. to help people choose the appropriate concept term(s) quickly and easily so that a search can later be made on a retrieval system for information relating to the term(s); and
  2. to determine the standardised vocabulary used for these terms.

Is a thesaurus a dictionary, an index or a classification system?

A thesaurus is like having a combination of an index in a book, a dictionary and a classification system all in one reference work.

However, the main difference between an index in a book and a thesaurus is that the book index helps you to find the word in the appropriate pages within the book. The thesaurus also points you in the right direction. But this is not by page numbers. Rather, a thesaurus, will point you to the correct word usage and to show which words are broader or narrower in meaning via the hierarchical structure.

A thesaurus also has elements of a dictionary in that concept terms may contain a explanatory note discussing their meaning to help you distinguish one similar concept from another. But a thesaurus is not meant to replace a dictionary because it is neither designed to cover every single word used in the English language, nor will it need to provide an explanation of certain term if it is commonly understood.

A thesaurus also has a classification system of its own. However, this classification system is not about helping you find the concept in one or more books on a library shelf via a number. The classification is merely there to help you see how various terms are interrelated in a hierarchical manner and to let you know precisely what term to use when searching for information on a retrieval system.

How is a thesaurus structured?

A thesaurus has all of its terms arranged in alphabetical order just like a Library of Congress (LC) subject headings list or a standard dictionary. So if the thesaurus is about information technology, you can run through the thesaurus to find the word "Computer" which will be before say the term "Cartridge (magnetic)".

Some of the terms will have what are called scoping notes underneath them. These are word meanings just like a dictionary. They will normally appear directly below the word or phrase used as the concept term and will start with the acronym "SN".

Finally, all similar, narrower and broader terms relating in some way to the principal concept term are displayed below the scope note. These words are arranged using the following categories (or classification codes):

BT: Broader term(s)

This shows the hierarchical relationship in an upward manner in the "upside-down" knowledge tree covered by the thesaurus in the particular subject area. It merely tells you that the term you are searching for has a broader term. Remember, if you see a bunch of broader terms listed in the "BT..." section, the thesaurus must always show a list of these exact broader terms together with their own list of narrower terms specified using the "NT..." code.

NT: Narrower term(s)

This shows the hierarchical relationship in a downward manner through the knowledge tree. It merely tells you that the term you are looking at has one or more specific terms that may help you to narrow down your search for a preferred term.

RT: Related term(s)

Related terms are neither more specific nor broader in their meaning compared to the original term. They merely show term(s) that have the same or very similar meaning to the original term. In other words, these are the synonymous terms.

Finally, to choose the correct term from various other terms, you will find something that says "Use..." or "U..." and "Used for..." or "UF..."

If you see the word "Use..." then the word or phrase that follows is the preferred term (or descriptor in the technical jargon).

If you see "Used for...", this will usually tell you that the word(s) and/or phrase(s) that appear in this section are not necessarily the preferred term(s) but will help you to see how other terms have been applied for the same concept. For example,

Cats

 UF Persian cats

    Siamese cats

In other words, the use of the terms "Persian cats" and "Siamese cats" when searching on a retrieval system may help you to find what you want, but you will have a better chance of finding what you want by using the more general and preferred term called "Cats".

Remember, wherever you see a "U" in one part of the thesaurus, you can always expect to find a "UF" in another part of the thesaurus. Just look at the terms and you will see what we mean by this.

NOTE: "Use..." and "See..." are interchangeable classification codes. You will find one thesaurus may use one type of code, and another thesaurus a different type of code.

This is how a thesaurus is structured!

Must the terms listed in a thesaurus always display the preferred term?

Certainly there must be a number of terms displayed as the preferred terms. You will see this in any thesaurus when the terms are highlighted in bold (or CAPITAL LETTERS in some cases). Whenever you see a highlighted term, this means it is a standardised term and you will have a high probability of finding what you want from a retrieval system using this preferred term.

However, there must also be non-preferred terms so that anyone with non-standard vocabulary can find their own specific terms they are familiar with and then later be directed to the preferred term by the thesaurus.

Hence, there should be plenty of non-preferred entry terms in a thesaurus to help direct people to the limited list of preferred indexing terms. And likewise you should remind users of these other non-preferred terms in the indexing terms using the "UF..."

Must each thesaurus term be just one word in length?

No. All terms can consist of a single word or a combination of words called a compound term. However, do not expect to find a compound term like "History of art". A thesaurus will keep the terms compact and direct. So instead of looking for "History of art", look for "Art History". Also, don't look for an indirect entry like "Schools, library"; look for a term like "Library schools".

Fortunately a good thesaurus constructor will often realise this fact. So they will try to see how people may look for a particular term, and then construct the thesaurus in such a way as to help other people quickly find the preferred term by applying the "U..."

For example, for a thesaurus to cross-reference inverted/indirect entries like "Antennas, Radar" to the preferred and more direct word "Radar antennas", it will show something like this:

Antennas, Radar

Use Radar antennas

Will I find acronyms like RADAR and EDTA in a thesaurus?

It depends on whether the thesaurus was designed for a particular set of clients or for a wider group of people.

If the thesaurus was designed for chemists, the term EDTA would not necessarily be spelt out as "ethylenediaminetetraacetic". Rather it would be kept as EDTA because most chemists know what this term is. However, if the thesaurus was designed for both chemists and non-chemists alike, it might be found under the full name.

The general rule of thumb is to use the common acronym only (e.g. RADAR). If the meaning of the acronym is ambiguous (e.g. NPL), then spell it out so that people will not be confused over its meaning.

Why does a thesaurus use the plural form for selected terms?

This is standard practice. The idea in constructing a thesaurus is that if you can count how many things there are in something then use the plural form of the term. Otherwise, terms that are more conceptual like "honesty" and "knowledge"; that are unique like "nitrogen" and "Earth"; and certain other things are kept in the singular form.

What about variations in spelling?

This is understandable. And any good thesaurus will at least let you know about this and show you the preferred term.

But do remember that some thesaurus are designed for a particular client in mind. For example, a thesaurus could have been designed for the American community and therefore their preferred terms like "color" may be different to what an Australian community may choose as their preferred term (e.g. "colour").

Hence you will find that no thesaurus is the absolute reference by which all other thesaurus works are derived. Each thesaurus is a guide to helping you select the most common and preferred words or phrases used by a particular group of people. In that way, you have a high probability of finding exactly what you want from your local retrieval system.

If you want to design a good thesaurus and want to take care of variations in the spelling, try the "UF [non-preferred term]".

How will I know if I am referring to the correct term in a thesaurus?

There will be instances where you want to find a term like "Plant" but realise there can be two definitions for the term. For example, a plant can be a biological thing like a tree. But a plant can also be an equipment used to run a business.

How does a thesaurus make a distinction between these two terms (known as a homograph)?

A thesaurus will often make the distinction by placing next to each term in parenthesis the area it refers to. For example,

Plants (botany)

Plants (industry)

You may get a thesaurus that will pedantically write a number next to the term followed by the parenthetical qualifiers "( )" to let you know how many terms there are of the same spelling but with different meanings. But the approach is still the same. Just look at the information lying between "(" and ")" to give you the clue to selecting the right term.

Another situation you may encounter is when you are not sure if the term you have in mind is the current terminology. A good thesaurus should recognise this situation and tell you what you need to know. For example, if we looked for the term "Learning laboratories" in a thesaurus covering the field of education, you are likely to see the following:

Learning laboratories

Use Learning resource centres

This means that the current terminology we should be using is "Learning resource centres" rather than the old "Learning laboratories".

Now what if you happened to be searching for the first time the term "Learning resource centres". How will you know that this is the preferred term? Look at the thesaurus for the term you want to search by and you should find the following:

Learning resource centres

Used for Learning laboratories

When you see this, you will know that the term you were searching for is the preferred term because it is printed in bold and/or you see the "UF..." classification code below the term.

For a group of terms with similar concepts, must these terms always have one preferred term?

No, not always. Although it would be nice to have just one preferred term among a variety of terms with similar or the same concepts, sometimes there can be more than one preferred term.

If you encounter this situation with a thesaurus, you have a choice of which preferred term to use. Or sometimes the person(s) who constructed the thesaurus may make that decision for you. It all depends on the type of thesaurus you are using and how it is constructed and whether a good reason is provided for the choices another person makes in choosing the preferred term on your behalf.

Do I ever need to use a thesaurus?

No, you probably don't need to. Most retrieval systems designed for your particular part of the world or organisation are usually equipped to handle the terms you commonly use. A thesaurus is only helpful for novices who need to know how to communicate with another or more specialised group of people in a particular subject field in a standard way or to find something quickly using a more specialised retrieval system. However, the ideas and concepts behind the terms, despite all the variations and combinations of words and phrases we tend to use, do not change that much.

The purpose of a thesaurus is merely to help some people find the preferred terms (often in a specialised subject field), to see how the preferred terms relate to other terms, and to provide a brief explanation of the meaning of some terms in case there could be confusion with respect to other similar terms.

There is nothing else that a thesaurus does.

Creating a thesaurus - the practical steps

When creating a thesaurus, there are several steps to be taken before the final product is ready for use. They are as follows:

  1. Identify the subject field.

    This is important so that you will know precisely how much of the subject you will cover and in what areas you will need to emphasise or give cursory examination during the construction of your thesaurus.

    It is important to estimate the parameters (i.e. the precision of the search results, size of the collection etc) before building a thesaurus. Otherwise, an inappropriate thesaurus constructed for a collection may provide inadequate retrieval results.

  2. Gather all the literature related to this subject which needs to be indexed.

    The thesaurus you will construct is based on this literature. In other words, many of the terms you will use to create the thesaurus will have already been printed in this literature.

    A thesaurus normally includes several documents in the literature. And you have to read each document to understand what it contains. This is the most time-consuming aspect of creating a thesaurus.

    NOTE: Abstracts may help, but they have to cover all the essential concepts in the documents very well for this to work. Also it probably helps an indexer/thesaurus maker to know something about the subject before proceeding with the indexing of the documents. Otherwise it will take a long time to complete the job properly.

  3. Identify the users who will use the thesaurus.

    A thesaurus is usually useful to only a particular group of people involved in the subject field. Therefore, it is important to ask their information needs through such questions as "Will you be doing your own searching or will someone else be doing it for you?", "Will your search for terms be broad or specific?" and "What terms are important to you?"

  4. Identify the file structure of the thesaurus.

    The file structure will be either a pre-coordinated (i.e. form compounds terms before creating the final thesaurus structure) or a post-coordinated (i.e. form compounds terms after creating a thesaurus structure) system. Choose which one will serve the users of the thesaurus best.

  5. Highlight as many keywords in the literature as possible.

    For a small number of articles in a journal, newspaper or magazine, this is best done by highlighting or underlining the keywords in the sentences. These highlighted words are known as content bearing words. For larger and more comprehensive publications like books, use the index section at the back to help you choose your keywords.

  6. Examine the keywords you have highlighted and make sure they each represent a single concept.

    This process involves looking at all the keywords you have highlighted and asking yourself, "Do they represent a single concept?" If not, try to break them down into basic building blocks until you do find a single concept.

    Once your keywords represent a single concept are they called concept terms.

    NOTE: The size of a thesaurus is depended on the level of detail required (i.e. number of keywords) and the size and complexity of the subject field. To keep the number of words to a minimum and to keep things simple, use commonly accepted terms and definitions.

  7. Go through the list of terms and show synonyms.

    Look at the terms and starting showing their relationships. To begin with, look for synonyms. These are terms which have the same meaning. These will be useful later when creating the "RT..." and "UF..." parts of the thesaurus.

    Now choose the preferred term from the class of synonymous terms and use it to designate the fundamental concepts underlying the class. You may find it easiest to highlight the preferred terms in bold.

    NOTE: Developing a thesaurus requires a certain level of intellectual effort to work out from a raw list of terms their relationships. For example, to create groups of synonymous terms in the list with each group representing one concept. One has to know something of the meaning of each term to determine how to consolidate or "lump them together" or whether they should be separated. The same goes for certains words that have the same spelling. For example, seal for a marine animal and seal for the seal of approval on a document. You have to look at the organisation using the list of terms to determine how they define the terms precisely.

  8. Determine broad categories

    Because the literature you have chosen is likely to be fairly specialised, you will have lots of specific and narrowly defined terms in your list. You next task is to create broad categories for these and many other terms called clusters. So look closely at the terms and see if you can choose the most important broad categories that essentially covers the entire subject matter of your literature.

    To help you see these broad categories, begin by placing the terms under Time, Place and other common subdivisions which seem to leap out at you from the subject matter. Then create the names for all these broad categories.

    Remember, the names you give to these broad categories (and which will later form part of the preferred terms in your thesaurus) may not necessarily be included in the articles. You have to think of the more broader terms to help combine the actual terms in the articles.

    NOTE: Again all this work requires some intellectual feat in seeing the hierarchical relationship between words. It is a good idea to help with this process to draw a tree diagram showing the relationships between various terms gathered from the documents. Each level represents a list of descriptors and a jump to one level means there is a sub-descriptor of the main descriptor.

  9. Move the appropriate terms into those broad categories.

    Once you have identified the broad categories, it is an easy task of allocating the terms to those categories.

  10. Develop further subdivisions in each of the broad categories to help develop you hierarchy.

    You start making the hierarchical structure for each category by looking at the terms in each broad category and deciding whether one term is broader or narrower in meaning than another.

    Sometimes you may find the same term exists in different categories. Choose whether to select one term in one category and suppress all others with a "Use..." or "Used for..." or to simply allow for this in what is called poly-hierarchies. The way to choose is to ask yourself (or better still, your clients) what is easiest and quickest way to use the thesaurus?

    At other times you may need to simply add extra terms to fill in gaps. For example, the articles may contain the term "ship", but you realise at this point that "vessel" could also be added to your list of terms because it means the same thing as "ship" even though it is not mentioned anywhere in the articles. You will have to use your judgement on this one. But it will require some lateral thinking as you start to show the relationships between terms.

    As for terms of a foreign language, if they are common, then use as normal. If they are not common, apply the "Use..."

  11. Put the terms into alphabetical order.

    The main terms used to represent each category are now ready to be ordered alphabetically. So do it. The thesaurus is now taking shape and we are close to finishing the task.

  12. Clean up the terms.

    We are into the final stages of thesaurus construction. Begin by cleaning up your terms. For example, where you have nouns in your terms, make sure they are in the plural form. For preferred terms, use bold print. Standardise spelling and hyphenation.

    NOTE: The choice of appropriate preferred terms is not only depended on what you think the preferred term should be, but also what the client wants as well. You may prefer one term, but you may have to use the client's preferred term.

  13. Make sure the main terms can stand alone.

    Each term must clearly represent the concept. It there is a chance of confusion, make term combinations where necessary.

  14. Create synonyms and antonyms as additional entry terms.

    Some more lateral thinking here. Try to imagine you are in the shoes of a typical searcher. What words would you use to find what you want in the thesaurus? Create these extra terms and put in the "Use..." and "Used for..." relationships to link the terms to the preferred terms.

  15. Put in the broader, narrower and related relationships.

    Using all the terms available to you in the hierarchical structure formed in step 10 and the synonyms in step 7 and 14, create the BT, NT and RT in your preferred terms.

  16. Add scope notes.

    Where necessary, add scoping notes to the terms to help explain the coverage and intended meaning. The scoping notes should only be added to terms that are nouns (e.g. add a scoping note to a term like "plants", but not to "planting"). Remember that not all terms will require a scoping note. Only those terms considered ambiguous in meaning should have the scoping notes attached to them.

  17. Omit terms not useful for searching.

    There may have been some terms useful for building up your thesaurus but no longer have importance or use during the search phase when clients start using the thesaurus. Omit these terms if any.

  18. Decide whether to include geographical names with the subject access terms or place them on a separate thesaurus list.

  19. Add an introduction explaining how much the thesaurus covers in the original literature and other literature. How does the thesaurus work? Anything different about it?

  20. Add a title page.

You are flexible to design a thesaurus of your choice !

The task of constructing a thesaurus involves making a lot of decisions on your own such as choosing the preferred terms and working out what's a broader or narrower term.

For example, "Cu" and "Copper" are synonymous terms. "Cable" and "Wire, cable" are equivalent terms. "Polyethylene" is a broader term than "Plastic". So is "Vessel" or "Ship" is a broader term than "vehicle" or "water transport". Note that "Vessel" and "Ship" are synonymous terms.

These sorts of decisions can only come with experience and some knowledge of the subject you are working with.

But the overall aim when making such decisions is to think about the people who will use the thesaurus and ask yourself how will you make it easy for these people (i) to find what they want; (ii) to see all the relationships; and (iii) to choose the appropriate preferred terms.

Just use common sense. And whatever system you develop, explain to your readers why you have chosen this system and how to use it via a set of instructions.