Thesaurus Based Searching in SQL Server 2005

It’s been a while since I have posted something here. I was able to got some time to check out again some features of SQL Server which I have been eager to explore. I have been waiting for this eversince I experimented on it in version 2000.

Ever heard of thesaurus-based searching in SQL Server? This feature has been present since version 2000. However, it was an undocumented feature which I learned from Hilary Cotter. Google this guy and he has a lot of things to say about SQL Server. But he always remind me everytime I do fulltext stuff with SQL Server. He is the go to guy when it comes to fulltext searching. I hope he still uses the bibliographic data I gave him few years ago.

So what is this feature? For example, if you want to search the word “color“, records having the words synonymous to “color” such as “hue” and “tint” can be returned as well. The feature also allows you to expand your search keys. For example, you are searching for “100“, it can return records containing the words “100“, “one hundred” or “isang daan” (100 in Filipino).

How can this be done?

First check out the files in \Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. The folder contains XML-based thesaurus definition files. SQL Server uses one of these files depending on the language setting you have. For English, check out tsENU.xml. In the file, you may add the following entries:

<replacement>
<pat>color</pat>
<sub>hue</sub>
</replacement>

<replacement>
<pat>hue</pat>
<sub>color</sub>
</replacement>

<expansion>
<sub>100</sub>
<sub>one hundred</sub>
<sub>isang daan</sub>
</expansion>

Then, enable your database for fulltext. I’d leave this for you to explore. You just need a few mouse clicks to enable your DB to fulltext searching.

The following queries will somehow give you an idea on how to use this feature. You may try and modify the scripts for you to use in your applications.

SELECT title FROM book
WHERE CONTAINS(title, ‘ FORMSOF (THESAURUS, “hue”) ‘)

This query will return records having the words “color” and “hue”.

SELECT title FROM book
WHERE CONTAINS(title, ‘ FORMSOF (THESAURUS, “one hundred”) ‘)

This query will return records having the word “100”, “one hundred”, “isang daan”.

SELECT title FROM book
WHERE CONTAINS(title, ‘ FORMSOF (THESAURUS, “isang daan”) ‘)

This query will return records having the word “100”, “one hundred”, “isang daan”.

However, SQL Server search is limited to what is defined in your XML-based thesaurus definition files. As far as I know, Microsoft has not provided a fully functional thesaurus definition file. MS left us only with templates (Don’t forget to remove the comments in the files). It is up for us to define words we want included in searching. Yeah Yeah … quite a lot of linguistic work for us techies. Let us hope MS be kind enough to include a fully functional thesaurus definition file next time.

This is all folks. I hope this is enough to give you an idea and start implementing the feature in the apps that you develop.

**************************************************
Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com

Advertisements

4 thoughts on “Thesaurus Based Searching in SQL Server 2005

  1. From : http://msdn.microsoft.com/en-us/library/ms345187.aspx

    The following restrictions apply to editing a thesaurus file:

    *

    Only system administrators can update, modify, or delete thesaurus files.
    *

    When editing thesaurus files using text editor tools, the files must be saved in Unicode format, and Byte Order Marks must be specified.
    *

    Thesaurus entries cannot be empty or word break to an empty string.
    *

    Phrases in the thesaurus file must be no longer than 512 characters.
    *

    A thesaurus must not contain any duplicate entries among the entries of expansion sets and the elements of replacement sets.

    • This info gets closest to solving the problem I’m having. I want a search to identify ‘Circle Drive’, ‘Circle Dr’ and then replace them with ‘CIRD’ in the SQL result. Your example takes it half way there, but it doesn’t quite work. The best info related to my issue on the internet, but one step away.

      • HI Peter, thanks for dropping by. Where is the difficulty you are encountering? the searching part or the replacing part?

  2. Hi Toto,

    Please let us know how to pass multiple words from Thesaurus xml file in the FORMSOF clause? For example: We wish to pass ‘hue’ & “100′ in the FORMSOF clause
    SELECT title FROM book WHERE CONTAINS(title, ‘ FORMSOF (THESAURUS, “hue 100”) ‘)
    We expect records from both hue and 100.

    Thanks in advance.

    regards,

    Parimal

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s