| SearchEngine: 
  Eliminating WordsTopicsThis chapter discusses the filters available for eliminating words from entire 
  files, useless words such as "and" or "the", reducing words 
  such as "www.javasoft.com", and removing words within specific HTML 
  tags.  Before looking at the various methods of eliminating words, it is necessary 
  to describe what the compiler considers a 'word' to be. The word parser, incorporated 
  into the compiler, parses words according to two separate algorithms. 
  
  NumbersAny numeric value (0 to 9 or a valid ISO-Latin1 numeric value) followed 
    by other numeric values, or "." or "," 
    is considered to be a number. Trailing "." or "," 
    characters are ignored. WordsAny letter, followed by letters, numeric values, ".", 
    "-", or "_" is considered to 
    be a word. Trailing ".", "-", 
    or "_" characters are ignored.  If you wish that a hyphenated word be split into its components, use the ­ 
  (­) ampersand entity, also known as a soft hyphen, instead 
  of the hyphen character '-', such as profitmargin. 
 Values such as "1.0" or "1,000" 
  or even dewey decimal values such as "1.2.3" would all 
  be considered to be numbers. Note however, that "1..6" 
  would also be considered to be a number.  The compiler provides the -xn option, which removes all numbers 
  from the word list.  Values such as "wasn't" would be considered to be two 
  separate words; "wasn" and "t". 
  The apostrophe is not tested by the word parser, as it would then have been 
  required to understand single quoted phrases. Since there are no syntactical 
  rules in HTML for #PCDATA (the text within tags), 
  it would be impossible to tell when an apostrophe marks the start or end of 
  a single quoted phrase, and when it is, well, just an apostrophe. Some people 
  also prefer to use the "`" character to start a single 
  quoted phrase.  Removing 
  documents from the word list A table of contents (TOC) document is an ideal candidate for word removal. 
  Although needed to generate the dependency list, it would be unproductive for 
  the TOC document contents to appear in the word database, since the descriptors 
  (words) in that document invariably link the user to other pages.In this case, all words within a document can be removed from the word list 
  in the same way as documents are removed from the dependency list, described 
  below.
  Removing a specific document from the word list To remove all words in a specific document from the word list, use the -xwu 
  option, and specify the document's URL path and filename components, 
  for example:  
  
-xwu /www/rational/application/search/doc/TOC.html
 Removing multiple documents from the word list To remove all words in multiple documents from the word list, use the -xwu 
  option, and a filter using the wildcard character '*'. For example: 
  
  
-xwu */TOC.html
 In this example, all words in all URLs ending with /TOC.html 
  will be excluded from the word list.  Another more dangerous example of filtering is:   
  
-xwu /www/extawt/*
 In the above example, all words in all URLs beginning with 
  /www/extawt/ will be excluded from the word list.  Finally an even more dangerous example of filtering is:   
  
-xwu */extawt/*
 In this example, all words in all URLs containing /extawt/ 
  will be excluded from the word list.  No other combinations of the wildcard character '*' are valid. 
  A filter definition of */extawt/*remove.* will result in a (probably 
  useless) filter to remove all words from URLs containing /extawt/*remove., 
  and not the probable intention of removing all words in all URLs 
  containing /extawt/ and also remove. 
 The wildcard character '*' can appear at the start of the URL, 
  and/or at the end of the URL, anywhere else it is treated as an 
  ordinary character.  Generating 
  a word list  Before individual words can be removed, you have to know what words appear 
  in the search database. The compiler provides the -lw filename 
  option, which lists all filtered words in HTML document format 
  to the specified filename.  The following is an excerpt from the generated word list:   
  
<dl>
<dt>absolute
<dt>accept
<dt>acceptable
<dt>access
<dt>according
<dt>accumulates
<dt>achieve
<dt>achieved
<dt>acronyms
<dt>add
<dt>added
<dt>addition
<dt>address
...
</dl> Creating 
  word filter documents Common usage words, or useless words, can be removed from the database using 
  word lists, which are stored in an HTML document, known as a word 
  filter document. The same format is used as the parsed documents of the dependency 
  list, so that HTML entity characters (&) can 
  be used to represent ISO-Latin1 characters in ASCII files. The current list 
  of valid ampersand entities is given in the appendix Ampersand 
  entities.  Since the word filter document (see below) and generated word list file are 
  both in HTML format, you can use your favorite text editor to cut 
  and paste words to be removed from the word list to the word filter document. Eliminating a word A specific word can be eliminated by simply having the word appear in a word 
  filter document. This is a file in HTML format, which lists the 
  specific words or word filters to be used when removing words. It is a good 
  idea to list them one per line, for readability, and ease of editing. The following 
  is an excerpt from the exclude.english.html file:  
  
<dl>
<dt>a
<dt>able
<dt>about
<dt>above
<dt>accomplish
<dt>accomplished
<dt>accomplishes
<dt>across
<dt>act
<dt>acts
<dt>actual
...
</dl>
 Word filter documents are specified using the -xwf option, for 
  example:   
  
-xwf exclude.english.html
 Reducing 
  words The compiler also provides for simple though potentially dangerous word reduction 
  filters, which trim or reduce words. Generally, word reduction filters should 
  be avoided, since they can have unexpected side-effects, similar to the filters 
  used for eliminating URLs from the dependency list or word list. 
 In addition, word reduction filters slow down the speed of compilation, since 
  each word parsed (there may be several thousand of them) has to be checked against 
  each filter, until a filter is matched, or all the filters have been checked. 
 Word reduction filters have the same form as URL filters, only 
  that, instead of being declared on the command line, they are placed in a word 
  filter document. If a word matches a filter, that word is not eliminated, but 
  reduced and put back into the word list.  For example, after a first compilation, the word list might produce words (taken 
  from the text of links), such as:  
  
ftp.javasoft.com
...
splash.javasoft.com
...
www.javasoft.com
 In this case, say you are interested in keeping the javasoft part 
  as a word in the database, and discarding the rest. You can achieve this by 
  creating the word reduction filter (in your word filter document) as follows: 
 
  
<dl>
<dt>*javasoft*
...
</dl>
 You might think that such filters can be used for reducing plurals, or reducing 
  adjectives, but this is not the case. If you create word reduction filters 
  such as:  
  
<dl>
<dt>*s
<dt>*ing
...
</dl>
 they will reduce for example cards to card and playing 
  to play, but will also reduce miss to mis 
  and king to k. Caveat emptor.  Removing words in specific 
  HTML tags The compiler can remove words found in specific tags. There are four such tag 
  groups:   
  -nt exclude <TITLE> tagged words. -nh exclude <H1..H6> and <CAPTION> tagged 
    words. -nl exclude <DT> and <LI> tagged words. 
  -nb exclude words not inside the above listed tags.  The 
  order of filtering  The compiler takes the parsed word list, and filters them for the final word 
  list in the following order:  
  All words are converted to lower case. If any of -nb, -nh, -nl, or -nt 
    flags are set, all words corresponding to those HTML tags are removed from 
    the list. If the -xn flag is set, all numbers are removed from the list. 
  The resulting word list is tested against word reduction filters, matches 
    are removed, reduced and put back into the list. The resulting word list is tested against the exclusion word lists, and 
    matching words are removed.  This ordering allows for words which were reduced to then be removed.  
Copyright 
© 1987 - 2001 Rational Software Corporation
 |