A Custom Text Search and Highlight Method ...


Finding exact matches in text and avoiding search words embedded in larger words...


By: John Kilgo Spacer Spacer Date: June 25, 2006Spacer

I recently completed a small custom content management system for a client to facilitate frequent editing of a 1,200 page manual. The manual is viewable on a large intranet and the client wanted a specific type of search mechanism for employee use. Since regular expressions and I don't get along very well I wanted to use the .NET regular expression Match method so that I didn't have to write any complicated expressions myself.

The .NET Match works very well and is very fast. See Highlighting Multiple Search Keywords in ASP.NET. The problem with it however is that it finds all matches even when they are inappropriate. For example, a certain program abbreviated SSI appears a number of times in the manual. If you search on "ssi" it will be found in "assistance", a word also used many times in the manual. That means there will be many "hits" during the search that are meaningless. Making the search case-sensitive might solve this particular problem most of the time but most of the searches must be case-insensitive.

My first thought was to search on "<space>ssi<space>" but this doesn't work for a number of reasons. What if SSI is the first or last word in a sentence, is followed by an apostrophe, period, quote mark, question mark, etc.? I finally decided just to use albeit slower string handling to solve the problem before handing the search word/phrase off the .NET's match method.

My client provided a list of preceding and following characters that were acceptable to produce a search hit on the word or phrase. I should also mention that the client wanted only exact matches on phrases. In other words if searching for "eligible client" they wanted hits only on that exact phrase, not hits on each word separately.

Another complication was that each of the lowest hierarchical levels of the manual included a title. They wanted a search match if the search word/phrase appeared in the text or the title alone. The text at this lowest level is stored in Sql Server as are the titles in separate tables. This requirement was easily satisfied by appending the text to the title before sending the string to the method performing the search. The titles generating hits in the title or the text were bound to a GridView control converting the passage title to a LinkButton. The user can then click a link to see the highlighted search words in the title and/or text.

The following code is contained in a utility class. The search word/phrase as well as the text to be searched are passed to this method. Please see the following code.

  267     #region Highlight

  268 

  269     // *********************************

  270     //                                 *

  271     // Highlight and Bold Search Words *

  272     //                                 *

  273     // *********************************

  274 

  275     public string Highlight ( string strSearchWord, string strTextToSearch )

  276     {

  277         // --------------------------------------------------------------------

  278         // First, modify strTextToSearch so that unwanted (embedded)

  279         // occurrences of strSearchWord are not highlighted, e.g. if "ssi" is

  280         // the search word we don't want to highlight (find) it in "assistance"

  281         // --------------------------------------------------------------------

  282 

  283         int    intPosition;

  284         char   chrValueBefore;

  285         char   chrValueAfter;

  286         int    intAsciiBefore = 0;

  287         int    intAsciiAfter = 0;

  288         int    intSearchWordLength;

  289         string strFill = "";

  290         string strSearchText = strTextToSearch;

  291 

  292         intSearchWordLength = strSearchWord.Length;

  293 

  294         // ------------------------------------------------------

  295         // To avoid highlighting an embedded search word, find it

  296         // and replace it with a unique character string ("|...")

  297         // ------------------------------------------------------

  298 

  299         for ( int i = 0; i < strSearchWord.Length; i++ )

  300         {

  301             strFill += "|";

  302         }

  303 

  304         // --------------------------------------------------------------------------

  305         // Loop through the "string to search" and replace unwanted occurences of the

  306         // search word with the unique string so that the regular expression will not

  307         // find the search word.

  308         // --------------------------------------------------------------------------

  309 

  310         /*

  311         * Note ASCII values to be looked for immediately before and/or after the search word:

  312         *   32 = <space>

  313         *   34 = "

  314         *   39 = '

  315         *   40 = (

  316         *   41 = )

  317         *   44 = ,

  318         *   46 = .

  319         *   47 = /

  320         *   58 = :

  321         *   59 = ;

  322         *   63 = ?

  323         *   65 - 122 = A-Z, a-z

  324         *   60 = "<"  (for passage text searches we are actually looking at HTML)

  325         *   62 = ">"  (                        ""                              )

  326         */

  327 

  328         for ( int i = 0; i < strSearchText.Length; i++ )

  329         {

  330             intPosition = strSearchText.ToUpper().IndexOf ( strSearchWord.ToUpper(), i );

  331 

  332             // ----------------------------------------------

  333             // Don't do anything if search word was not found

  334             // ----------------------------------------------

  335 

  336             if ( intPosition != -1 )

  337             {

  338                 // --------------------------------------------

  339                 // Handle any characters before the search word

  340                 // --------------------------------------------

  341 

  342                 chrValueBefore = Convert.ToChar ( strSearchText.Substring ( intPosition - 1, 1 ) );

  343                 intAsciiBefore = ( int ) chrValueBefore;

  344 

  345                 if  ( ( ( intAsciiBefore >= 65 ) == true && ( intAsciiBefore <= 122 ) == true ) ||

  346                     ( intAsciiBefore != 32 && intAsciiBefore != 34 && intAsciiBefore != 62 &&

  347                       intAsciiBefore != 40 && intAsciiAfter != 47 ) )

  348                 {

  349                     strSearchText = strSearchText.Substring ( 0, intPosition ) + strFill + strSearchText.Substring ( intPosition + intSearchWordLength );

  350                 }

  351 

  352                 // -------------------------------------------

  353                 // Handle any characters after the search word

  354                 // -------------------------------------------

  355 

  356                 chrValueAfter = Convert.ToChar ( strSearchText.Substring ( intPosition + intSearchWordLength, 1 ) );

  357                 intAsciiAfter = ( int ) chrValueAfter;

  358 

  359                 if  ( ( ( intAsciiAfter >= 65 ) == true && ( intAsciiAfter <= 122 ) == true ) ||

  360                     ( intAsciiAfter != 32 && intAsciiAfter != 34 && intAsciiAfter != 39 &&

  361                       intAsciiAfter != 41 && intAsciiAfter != 44 && intAsciiAfter != 46 &&

  362                       intAsciiAfter != 47 && intAsciiAfter != 58 && intAsciiAfter != 59 &&

  363                       intAsciiAfter != 62 && intAsciiAfter != 63 ) )

  364                 {

  365                     strSearchText = strSearchText.Substring ( 0, intPosition ) + strFill + strSearchText.Substring ( intPosition + intSearchWordLength );

  366                 }

  367             }

  368         }

  369 

  370         // ------------------------------------------------------------------

  371         // Use a .NET Regular Expression method to highlight the search words

  372         // ------------------------------------------------------------------

  373 

  374         Regex regExpression = new Regex ( strSearchWord.Trim ( ), RegexOptions.IgnoreCase );

  375 

  376         // --------------------------------------------------------------

  377         // Highlight keywords by calling the delegate each time a keyword

  378         // is found then replace strFill with the real search word

  379         // --------------------------------------------------------------

  380 

  381         return regExpression.Replace ( strSearchText, new MatchEvaluator ( ReplaceKeyWords ) ).Replace( strFill, strSearchWord );

  382     }

  383 

  384     public string ReplaceKeyWords ( Match m )

  385     {

  386         // --------------------------------------------------------------------

  387         // Add an inline styled span tag to highlight and bold the search words

  388         // --------------------------------------------------------------------

  389 

  390         return "<span style=\"text-decoration: none; font-weight: bold; color: black; background: yellow;\">" + m.Value + "</span>";

  391     }

  392     #endregion

The code at line 381 calls the ReplaceKeyWord method at line 384 before returning the marked up string to the caller. Line 381 also replaces the unique character strings ("|...") with the real search word.

Note that I used ASCII values for the includable and excludable characters on either side of the search word. That was to avoid having to deal with large ranges of characters such as A-Z and a-z. Even with the string handling involved in this method it is very fast. The users are very happy with the speed of the search process.

I hope you find something helpful or instructive in this code.