Finding exact matches in text and avoiding search words embedded in larger words...
I recently completed a small custom content management system for a client to facilitate frequent editing of a 1,200
page manual. The manual is viewable on a large intranet and the client wanted a specific type of search mechanism for
employee use. Since regular expressions and I don't get along very well I wanted to use the .NET regular expression Match
method so that I didn't have to write any complicated expressions myself.
My first thought was to search on "<space>ssi<space>" but this doesn't work for a number of reasons. What
if SSI is the first or last word in a sentence, is followed by an apostrophe, period, quote mark, question mark, etc.?
I finally decided just to use albeit slower string handling to solve the problem before handing the search word/phrase
off the .NET's match method.
My client provided a list of preceding and following characters that were acceptable to produce a search hit on the word
or phrase. I should also mention that the client wanted only exact matches on phrases. In other words if searching
for "eligible client" they wanted hits only on that exact phrase, not hits on each word separately.
Another complication was that each of the lowest hierarchical levels of the manual included a title. They wanted a
search match if the search word/phrase appeared in the text or the title alone. The text at this lowest level is stored
in Sql Server as are the titles in separate tables. This requirement was easily satisfied by appending the text to the title before
sending the string to the method performing the search. The titles generating hits in the title or the text were bound
to a GridView control converting the passage title to a LinkButton. The user can then click a link to see the
highlighted search words in the title and/or text.
The following code is contained in a utility class. The search word/phrase as well as the text to be searched are
passed to this method. Please see the following code.
267 #region Highlight
268
269 // *********************************
270 // *
271 // Highlight and Bold Search Words *
272 // *
273 // *********************************
274
275 public string Highlight ( string strSearchWord, string strTextToSearch )
276 {
277 // --------------------------------------------------------------------
278 // First, modify strTextToSearch so that unwanted (embedded)
279 // occurrences of strSearchWord are not highlighted, e.g. if "ssi" is
280 // the search word we don't want to highlight (find) it in "assistance"
281 // --------------------------------------------------------------------
282
283 int intPosition;
284 char chrValueBefore;
285 char chrValueAfter;
286 int intAsciiBefore = 0;
287 int intAsciiAfter = 0;
288 int intSearchWordLength;
289 string strFill = "";
290 string strSearchText = strTextToSearch;
291
292 intSearchWordLength = strSearchWord.Length;
293
294 // ------------------------------------------------------
295 // To avoid highlighting an embedded search word, find it
296 // and replace it with a unique character string ("|...")
297 // ------------------------------------------------------
298
299 for ( int i = 0; i < strSearchWord.Length; i++ )
300 {
301 strFill += "|";
302 }
303
304 // --------------------------------------------------------------------------
305 // Loop through the "string to search" and replace unwanted occurences of the
306 // search word with the unique string so that the regular expression will not
307 // find the search word.
308 // --------------------------------------------------------------------------
309
310 /*
311 * Note ASCII values to be looked for immediately before and/or after the search word:
312 * 32 = <space>
313 * 34 = "
314 * 39 = '
315 * 40 = (
316 * 41 = )
317 * 44 = ,
318 * 46 = .
319 * 47 = /
320 * 58 = :
321 * 59 = ;
322 * 63 = ?
323 * 65 - 122 = A-Z, a-z
324 * 60 = "<" (for passage text searches we are actually looking at HTML)
325 * 62 = ">" ( "" )
326 */
327
328 for ( int i = 0; i < strSearchText.Length; i++ )
329 {
330 intPosition = strSearchText.ToUpper().IndexOf ( strSearchWord.ToUpper(), i );
331
332 // ----------------------------------------------
333 // Don't do anything if search word was not found
334 // ----------------------------------------------
335
336 if ( intPosition != -1 )
337 {
338 // --------------------------------------------
339 // Handle any characters before the search word
340 // --------------------------------------------
341
342 chrValueBefore = Convert.ToChar ( strSearchText.Substring ( intPosition - 1, 1 ) );
343 intAsciiBefore = ( int ) chrValueBefore;
344
345 if ( ( ( intAsciiBefore >= 65 ) == true && ( intAsciiBefore <= 122 ) == true ) ||
346 ( intAsciiBefore != 32 && intAsciiBefore != 34 && intAsciiBefore != 62 &&
347 intAsciiBefore != 40 && intAsciiAfter != 47 ) )
348 {
349 strSearchText = strSearchText.Substring ( 0, intPosition ) + strFill + strSearchText.Substring ( intPosition + intSearchWordLength );
350 }
351
352 // -------------------------------------------
353 // Handle any characters after the search word
354 // -------------------------------------------
355
356 chrValueAfter = Convert.ToChar ( strSearchText.Substring ( intPosition + intSearchWordLength, 1 ) );
357 intAsciiAfter = ( int ) chrValueAfter;
358
359 if ( ( ( intAsciiAfter >= 65 ) == true && ( intAsciiAfter <= 122 ) == true ) ||
360 ( intAsciiAfter != 32 && intAsciiAfter != 34 && intAsciiAfter != 39 &&
361 intAsciiAfter != 41 && intAsciiAfter != 44 && intAsciiAfter != 46 &&
362 intAsciiAfter != 47 && intAsciiAfter != 58 && intAsciiAfter != 59 &&
363 intAsciiAfter != 62 && intAsciiAfter != 63 ) )
364 {
365 strSearchText = strSearchText.Substring ( 0, intPosition ) + strFill + strSearchText.Substring ( intPosition + intSearchWordLength );
366 }
367 }
368 }
369
370 // ------------------------------------------------------------------
371 // Use a .NET Regular Expression method to highlight the search words
372 // ------------------------------------------------------------------
373
374 Regex regExpression = new Regex ( strSearchWord.Trim ( ), RegexOptions.IgnoreCase );
375
376 // --------------------------------------------------------------
377 // Highlight keywords by calling the delegate each time a keyword
378 // is found then replace strFill with the real search word
379 // --------------------------------------------------------------
380
381 return regExpression.Replace ( strSearchText, new MatchEvaluator ( ReplaceKeyWords ) ).Replace( strFill, strSearchWord );
382 }
383
384 public string ReplaceKeyWords ( Match m )
385 {
386 // --------------------------------------------------------------------
387 // Add an inline styled span tag to highlight and bold the search words
388 // --------------------------------------------------------------------
389
390 return "<span style=\"text-decoration: none; font-weight: bold; color: black; background: yellow;\">" + m.Value + "</span>";
391 }
392 #endregion
The code at line 381 calls the ReplaceKeyWord method at line 384 before returning the marked up string to the caller.
Line 381 also replaces the unique character strings ("|...") with the real search word.
Note that I used ASCII values for the includable and excludable characters on either side of the search word. That was
to avoid having to deal with large ranges of characters such as A-Z and a-z. Even with the string handling involved in
this method it is very fast. The users are very happy with the speed of the search process.
I hope you find something helpful or instructive in this code.