/[MITgcm]/mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-CONFIG.pod
ViewVC logotype

Annotation of /mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-CONFIG.pod

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.1.1.1 - (hide annotations) (download) (vendor branch)
Fri Sep 20 19:47:29 2002 UTC (22 years, 10 months ago) by adcroft
Branch: Import, MAIN
CVS Tags: baseline, HEAD
Changes since 1.1: +0 -0 lines
Importing web-site building process.

1 adcroft 1.1 =head1 NAME
2    
3     SWISH-CONFIG - Configuration File Directives
4    
5     =head1 Swish-e CONFIGURATION FILE
6    
7     What files Swish-e indexes and how they are indexed, and where the index
8     is written can be controlled by a configuration file.
9    
10     The configuration file is a text file composed of comments, blank
11     lines, and B<configuration directives>. The order of the directives
12     is not important. Some directives may be used more than once in the
13     configuration file, while others can only be used once (e.g. additional
14     directives will overwrite preceding directives). Case of the directive
15     is not important -- you may use upper, lower, or mixed case.
16    
17     Comments are any line that begin with a "#".
18    
19     # This is a comment
20    
21     Directives may take more than one parameter. Enclose single parameters
22     that include whitespace in quotes (single or double). Inside of quotes
23     the backslash escapes the next character.
24    
25     ReplaceRules append "foo bar" <- define "foo bar" as a single parameter
26    
27     If you need to include a quote character in the value either use a
28     backslash to escape it, or enclose it in quotes of the other type.
29    
30     For example, under unix you can use quotes to include white space in a
31     single paramter. Here, to protect against path names (%p) that might
32     have white space embedded use single quotes (this also protects against
33     shell expansion or metacharacters):
34    
35     FileFilter .foo foofilter "'%p'" <- parameter passed through the shell in single quotes
36     FileFilter .foo foofilter '"%p"' <- windows uses double-quotes
37     FileFilter .foo foofilter '\'%p\''<- silly example
38    
39    
40     Backslashes also have special meaning in regular expressions.
41    
42     FileFilterMatch pdftotext "'%p' -" /\.pdf$/
43    
44     This says that the dot is a real dot (instead of matching any character).
45     If you place the regular expression in quotes then you must use
46     double-backslashes.
47    
48     FileFilterMatch pdftotext "'%p' -" "/\\.pdf$/"
49    
50     Swish-e will convert the double backslash into a single backslash before
51     passing the parameter to the regular expression compiler.
52    
53     Commented example configuration files are included in the F<conf>
54     directory of the Swish-e distribution.
55    
56     Some command line arguments can override directives specified in the
57     configuration file. Please see also the L<SWISH-RUN|SWISH-RUN> for
58     instructions on running Swish-e, and the L<SWISH-SEARCH|SWISH-SEARCH>
59     page for information and examples on how to search your index.
60    
61     The configuration file is specified to Swish-e by the C<-c> switch.
62     For example,
63    
64     swish-e -c myconfig.conf
65    
66     You may also split your directives up into different configuration files.
67     This allows you to have a master configuration file used for many
68     different indexes, and smaller configuration files for each separate
69     index. You can specify the different configuration files when running
70     from the command line with the C<-c> switch (see L<SWISH-RUN|SWISH-RUN>),
71     or you may include other Configuration file with the B<IncludeConfigFile>
72     directive below.
73    
74     Typically, in a configuration file the directives are grouped together in
75     some logical order -- that is, directives that control the source of the
76     documents would be grouped together first, and directives that control
77     how each document is filtered or its words index in another group of
78     directives. (The directives listed below are grouped in this order).
79    
80     The configuration file directives are listed below in these groups:
81    
82     =over 4
83    
84     =item *
85    
86     L<Administrative Headers Directives|/"Administrative Headers Directives">
87     -- You may add administrative information to the header of the index file.
88    
89     =item *
90    
91     L<Document Source Directives|/"Document Source Directives"> -- Directives
92     for selecting the source documents and the location of the index file.
93    
94     =item *
95    
96     L<Document Contents Directives|/"Document Contents Directives"> --
97     Directives that control how a document content is indexed.
98    
99     =item *
100    
101     L<Directives for the File Access method only|/"Directives for the File
102     Access method only"> -- These directives are only applicable to the File
103     Access indexing method.
104    
105     =item *
106    
107     L<Directives for the HTTP Access Method Only|/"Directives for the HTTP
108     Access Method Only"> -- Likewise, these only apply to the HTTP Access
109     method.
110    
111     =item *
112    
113     L<Directives for the prog Access Method Only|/"Directives for the prog
114     Access Method Only"> -- These only apply to the prog Access method.
115    
116     =item *
117    
118     L<Document Filter Directives|/"Document Filter Directives"> -- This is
119     a special section that describes using document filters with Swish-e.
120    
121     =back
122    
123     =head2 Alphabetical Listing of Directives
124    
125     =over 4
126    
127     =item *
128    
129     L<AbsoluteLinks|/"item_AbsoluteLinks"> [yes|NO]
130    
131     =item *
132    
133     L<BeginCharacters|/"item_BeginCharacters"> *string of characters*
134    
135     =item *
136    
137     L<BumpPositionCounterCharacters|/"item_BumpPositionCounterCharacters"> *string*
138    
139     =item *
140    
141     L<Buzzwords|/"item_Buzzwords"> [*list of buzzwords*|File: path]
142    
143    
144     =item *
145    
146     L<ConvertHTMLEntities|/"item_ConvertHTMLEntities"> [YES|no]
147    
148     =item *
149    
150     L<DefaultContents|/"item_DefaultContents"> [TXT|HTML|XML|WML]
151    
152     =item *
153    
154     L<Delay|/"item_Delay"> *seconds*
155    
156     =item *
157    
158     L<DontBumpPositionOnEndTags|/"item_DontBumpPositionOnEndTags"> *list of names*
159    
160     =item *
161    
162     L<DontBumpPositionOnStartTags|/"item_DontBumpPositionOnStartTags"> *list of names*
163    
164     =item *
165    
166     L<EnableAltSearchSyntax|/"item_EnableAltSearchSyntax"> [yes|NO]
167    
168     =item *
169    
170     L<EndCharacter|/"item_EndCharacters"> *string of characters*
171    
172     =item *
173    
174     L<EquivalentServer|/"item_EquivalentServer"> *server alias*
175    
176     =item *
177    
178     L<ExtractPath|/"item_ExtractPath"> *metaname* [replace|remove|prepend|append|regex]
179    
180     =item *
181    
182     L<FileFilter|/"item_FileFilter"> *suffix* *program* [options]
183    
184     =item *
185    
186     L<FileFilterMatch|/"item_FileFilterMatch"> *program* *options* *regex* [*regex* ...]
187    
188     =item *
189    
190     L<FileInfoCompression|/"item_FileInfoCompression"> [yes|NO]
191    
192     =item *
193    
194     L<FileMatch|/"item_FileMatch"> [contains|is|regex] *regular expression*
195    
196     =item *
197    
198     L<FileRules|/"item_FileRules"> [contains|is|regex] *regular expression*
199    
200     =item *
201    
202     L<FuzzyIndexingMode|/"item_FuzzyIndexingMode"> [NONE|Stemming|Soundex|Metaphone|DoubleMetaphone]
203    
204     =item *
205    
206     L<FollowSymLinks|/"item_FollowSymLinks"> [yes|NO]
207    
208     =item *
209    
210     L<HTMLLinksMetaName|/"item_HTMLLinksMetaName"> *metaname*
211    
212     =item *
213    
214     L<IgnoreFirstChar|/"item_IgnoreFirstChar"> *string of characters*
215    
216     =item *
217    
218     L<IgnoreLastChar|/"item_IgnoreLastChar"> *string of characters*
219    
220     =item *
221    
222     L<IgnoreLimit|/"item_IgnoreLimit"> *integer integer*
223    
224     =item *
225    
226     L<IgnoreMetaTags|/"item_IgnoreMetaTags"> *list of names*
227    
228     =item *
229    
230     L<IgnoreNumberChars|/"item_IgnoreNumberChars"> *list of characters*
231    
232     =item *
233    
234     L<IgnoreTotalWordCountWhenRanking|/"item_IgnoreTotalWordCountWhenRanking"> [YES|no]
235    
236     =item *
237    
238     L<IgnoreWords|/"item_IgnoreWords"> [*list of stop words*|File: path]
239    
240     =item *
241    
242     L<ImageLinksMetaName|/"item_ImageLinksMetaName"> *metaname*
243    
244     =item *
245    
246     L<IncludeConfigFile|/"item_IncludeConfigFile">
247    
248     =item *
249    
250     L<IndexAdmin|/"item_IndexAdmin"> *text*
251    
252     =item *
253    
254     L<IndexAltTagMetaName|/"item_IndexAltTagMetaName"> *tagname*|as-text
255    
256     =item *
257    
258     L<IndexComments|/"item_IndexComments"> [YES|no]
259    
260     =item *
261    
262     L<IndexContents|/"item_IndexContents"> [TXT|HTML|XML|WML|TXT2|HTML2|XML2] *file extensions*
263    
264     =item *
265    
266     L<IndexDescription|/"item_IndexDescription"> *text*
267    
268     =item *
269    
270     L<IndexDir|/"item_IndexDir"> [URL|directories or files]
271    
272     =item *
273    
274     L<IndexFile|/"item_IndexFile"> *path*
275    
276     =item *
277    
278     L<IndexName|/"item_IndexName"> *text*
279    
280     =item *
281    
282     L<IndexOnly|/"item_IndexOnly"> *list of file suffixes*
283    
284     =item *
285    
286     L<IndexPointer|/"item_IndexPointer"> *text*
287    
288     =item *
289    
290     L<IndexReport|/"item_IndexReport"> [0|1|2|3]
291    
292     =item *
293    
294     L<MaxDepth|/"item_MaxDepth"> *integer*
295    
296     =item *
297    
298     L<MaxWordLimit|/"item_MaxWordLimit"> *integer*
299    
300     =item *
301    
302     L<MetaNameAlias|/"item_MetaNameAlias"> *meta name* *list of aliases*
303    
304     =item *
305    
306     L<MetaNames|/"item_MetaNames"> *list of names*
307    
308     =item *
309    
310     L<MinWordLimit|/"item_MinWordLimit"> *integer*
311    
312     =item *
313    
314     L<NoContents|/"item_NoContents"> *list of file suffixes*
315    
316     =item *
317    
318     L<obeyRobotsNoIndex|/"item_obeyRobotsNoIndex"> [yes|NO]
319    
320     =item *
321    
322     L<ParserWarnLevel|/"item_ParserWarnLevel"> [0|1|2|3]
323    
324     =item *
325    
326     L<PreSortedIndex|/"item_PreSortedIndex"> *list of property names*
327    
328     =item *
329    
330     L<PropCompressionLevel|/"item_PropCompressionLevel"> [0-9]
331    
332     =item *
333    
334     L<PropertyNameAlias|/"item_PropertyNameAlias"> *property name* *list of aliases*
335    
336     =item *
337    
338     L<PropertyNames|/"item_PropertyNames"> *list of meta names*
339    
340     =item *
341    
342     L<PropertyNamesCompareCase|/"item_PropertyNamesCompareCase"> *list of meta names*
343    
344     =item *
345    
346     L<PropertyNamesIgnoreCase|/"item_PropertyNamesIgnoreCase"> *list of meta names*
347    
348     =item *
349    
350     L<PropertyNamesDate|/"item_PropertyNamesDate"> *list of meta names*
351    
352     =item *
353    
354     L<PropertyNamesNumeric|/"item_PropertyNamesNumeric"> *list of meta names*
355    
356     =item *
357    
358     L<PropertyNamesMaxLength|/"item_PropertyNamesMaxLength"> integer *list of meta names*
359    
360     =item *
361    
362     L<ReplaceRules|/"item_ReplaceRules"> [replace|remove|prepend|append|regex]
363    
364     =item *
365    
366     L<ResultExtFormatName|/"item_ResultExtFormatName"> name -x format string
367    
368     =item *
369    
370     L<SpiderDirectory|/"item_SpiderDirectory"> *path*
371    
372     =item *
373    
374     L<StoreDescription|/"item_StoreDescription"> [XML <tag>|HTML <meta>|TXT size]
375    
376     =item *
377    
378     L<SwishProgParameters|/"item_SwishProgParameters> *list of parameters*
379    
380     =item *
381    
382     L<SwishSearchDefaultRule|/"item_SwishSearchDefaultRule"> [<AND-WORD>|<or-word>]
383    
384     =item *
385    
386     L<SwishSearchOperators|/"item_SwishSearchOperators"> <and-word> <or-word> <not-word>
387    
388     =item *
389    
390     L<TmpDir|/"item_TmpDir"> *path*
391    
392     =item *
393    
394     L<TranslateCharacters|/"item_TranslateCharacters"> [*string1 string2*|:ascii7:]
395    
396     =item *
397    
398     L<TruncateDocSize|/"item_TruncateDocSize">
399     *number of characters*
400    
401     =item *
402    
403     L<UndefinedMetaTags|/"item_UndefinedMetaTags"> [error|ignore|INDEX|auto]
404    
405     =item *
406    
407     L<UndefinedXMLAttributes|/"item_UndefinedXMLAttributes"> [DISABLE| error|ignore|index|auto]
408    
409     =item *
410    
411     L<UseStemming|/"item_UseStemming"> [yes|NO]
412    
413     =item *
414    
415     L<UseSoundex|/"item_UseSoundex"> [yes|NO]
416    
417     =item *
418    
419     L<UseWords|/"item_UseWords"> [*list of words*|File: path]
420    
421     =item *
422    
423     L<WordCharacters|/"item_WordCharacters"> *string of characters*
424    
425     =item *
426    
427     L<XMLClassAttributes|/"item_XMLClassAttributes"> *list of XML attribute names*
428    
429     =back
430    
431     =head2 Directives that Control Swish
432    
433     These configuration directives control the general behavior of Swish-e.
434    
435     =over 4
436    
437     =item IncludeConfigFile *path to config file*
438    
439     This directive can be used to include configuration directives located
440     in another file.
441    
442     IncludeConfigFile /usr/local/swish/conf/site_config.config
443    
444     =item IndexReport [0|1|2|3]
445    
446     This is how detailed you want reporting while indexing. You can specify
447     numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default
448     is 1.
449    
450     This may be overridden from the command line via the C<-v> switch (see
451     L<SWISH-RUN|SWISH-RUN>).
452    
453     =item ParserWarnLevel [0|1|2|3]
454    
455     Sets the error level when using the libxml2 parser for XML and HTML.
456     libxml2 will point out structural errors in your documents.
457    
458     0 = no report
459     1 = fatal errors
460     2 = errors
461     3 = warnings
462    
463     The exception to this is UTF-8 to Latin-1 coversion errors are reported at
464     level 1. This is because words may be indexed incorrecty in these cases.
465    
466     Note that unlike other errors generated by Swish-e, these errors are
467     sent to stderr.
468    
469     =item IndexFile *path*
470    
471     Index file specifies the location of the generated index file. If not
472     specified, Swish-e will create the file F<index.swish-e> in the current
473     directory.
474    
475     IndexFile /usr/local/swish/site.index
476    
477     =item obeyRobotsNoIndex [yes|NO]
478    
479     When enabled, Swish-e will not index any HTML file that contains:
480    
481     <meta name="robots" content="noindex">
482    
483     The default is to ignore these meta tags and index the document.
484     This tag is described at http://www.robotstxt.org/wc/exclusion.html.
485    
486     Note: This feature is only available with the libxml2 HTML parser.
487    
488     Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the the following
489     comments in your documents to prevent indexing:
490    
491     <!-- SwishCommand noindex -->
492     <!-- SwishCommand index -->
493    
494     and/or these may be used also:
495    
496     <!-- noindex -->
497     <!-- index -->
498    
499     For example, these are very helpful to prevent indexing of common headers, footers, and menus.
500    
501    
502     =back
503    
504     B<NOTE>: This following items are currently not available. These items
505     require Swish-e to parse the configuration file while searching.
506    
507    
508     =over 4
509    
510     =item EnableAltSearchSyntax [yes|NO]
511    
512     B<NOTE>: This following item is currently not available.
513    
514     Enable alternate search syntax. Allows the usage of a basic
515     "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search
516     query can contain "+" and "-" as syntax parameter.
517    
518     Example:
519    
520     swish-e -w "+word1 +word2 -word3 word4 word5"
521     "+" = following word has to be in all found documents
522     "-" = following word may not be in any document found
523     " " = following word will be searched in documents
524    
525     =item SwishSearchOperators <and-word> <or-word> <not-word>
526    
527     B<NOTE>: This following item is currently not available.
528    
529     Using this config directive you can change the boolean search operators of
530     Swish-e, e.g. to adapt these to your language.
531     The default is: AND OR NOT
532    
533     Example (german):
534    
535     SwishSearchOperators UND ODER NICHT
536    
537     =item SwishSearchDefaultRule [<AND-WORD>|<or-word>]
538    
539     B<NOTE>: This following item is currently not available.
540    
541     C<SwishSearchDefaultRule> defines the default Boolean operator to use
542     if none is specified between words or phrases. The default is C<AND>.
543    
544     The word you specify must match one of the available
545     C<SwishSearchOperators>.
546    
547     Example:
548    
549     SwishSearchOperators UND ODER NICHT
550     # Make it act like a web search engine
551     SwishSearchDefaultRule ODER
552    
553     =item ResultExtFormatName name -x format string
554    
555     B<NOTE>: This following item is currently not available.
556    
557     The output of Swish-e can be defined by specifying a format string with
558     the C<-x> command line argument. Using C<ResultExtFormatName> you can
559     assign a predefined format string to a name.
560    
561     Examples:
562    
563     ResultExtFormatName moreinfo "%c|%r|%t|%p|<author>|<publishyear>\n"
564    
565     Then when searching you can specify the the format string's name
566    
567     swish-e ... -x moreinfo ...
568    
569     See the C<-x> switch in L<SWISH-RUN|SWISH-RUN> for more information
570     about output formats.
571    
572     =back
573    
574    
575     =head2 Administrative Headers Directives
576    
577     Swish-e stores configuration information in the header of the index file.
578     This information can be retrieved while searching or by functions in
579     the Swish-e C library. There are a number of fields available for your
580     own use. None of these fields are required:
581    
582     =over 4
583    
584     =item IndexName *text*
585    
586     =item IndexDescription *text*
587    
588     =item IndexPointer *text*
589    
590     =item IndexAdmin *text*
591    
592     These variables specify information that goes into index files to help
593     users and administrators. IndexName should be the name of your index,
594     like a book title. IndexDescription is a short description of the index
595     or a URL pointing to a more full description. IndexPointer should be
596     a pointer to the original information, most likely a URL. IndexAdmin
597     should be the name of the index maintainer and can include name and email
598     information. These values should not be more than 70 or so characters
599     and should be contained in quotes. Note that the automatically generated
600     date in index files is in D/M/Y and 24-hour format.
601    
602     Examples:
603    
604     IndexName "Linux Documentation"
605     IndexDescription "This is an index of /usr/doc on our Linux machine."
606     IndexPointer http://localhost/swish/linux/index.html
607     IndexAdmin webmaster
608    
609    
610     =back
611    
612     =head2 Document Source Directives
613    
614     These directives control I<what> documents are indexed and I<how>
615     they are accessed. See also L<Directives for the File Access method
616     only|/"Directives for the File Access method only"> and L<Directives for
617     the HTTP Access Method Only|/"Directives for the HTTP Access Method Only">
618     for directives that are specific to those access methods.
619    
620    
621     =over 4
622    
623     =item IndexDir [directories or files|URL|external program]
624    
625     IndexDir defines the source of the documents for Swish-e. Swish-e
626     currently supports three file access methods: B<File system>, B<HTTP>
627     (also called B<spidering>), and B<prog> for reading files from an
628     external program.
629    
630     The C<-S> command line argument is used to select the file access method.
631    
632     swish-e -c swish.config -S fs - file system
633     swish-e -c swish.config -S http - internal http spider
634     swish-e -c swish.config -S prog - external program of any type
635    
636     For the B<fs> method of access B<IndexDir> is a space-separated
637     list of files and directories to index. Use a forward slash as the path
638     separator in MS Windows.
639    
640     For the B<http> method the B<IndexDir> setting is a list of space-separated
641     URLs.
642    
643     For the B<prog> method the B<IndexDir> setting is a list of space-separated
644     programs to run (which generate documents for swish to index).
645    
646     You may specify more than one B<IndexDir> directive.
647    
648     Any sub-directories of any listed directory will also be indexed.
649    
650     Note: While I<processing> directories, Swish-e will ignore any files
651     or directories that begin with a dot ("."). You may index files
652     or directories that begin with a dot by specifying their name with
653     C<IndexDir> or C<-i>.
654    
655     Examples:
656    
657     # Index this directory an any subdirectories
658     IndexDir /usr/local/home/http
659    
660     # Index the docs directory in current directory
661     IndexDir ./docs
662    
663     # Index these files in the current directory
664     IndexDir ./index.html ./page1.html ./page2.html
665     # and index this directory, too
666     IndexDir ../public_html
667    
668     For the B<HTTP> method of access specify the URL's from which
669     you want the spidering to begin.
670    
671     Example:
672    
673     IndexDir http://www.my-site.com/index.html
674     IndexDir http://localhost/index.html
675    
676     Obviously, using the B<HTTP> method to index is B<much> slower than
677     indexing local files. Be well aware that some sites do not appreciate
678     spidering and may block your IP address. You may wish to contact the
679     remote site before spidering their web site. More information about
680     spidering can be found in L<Directives for the HTTP Access Method
681     Only|/"Directives for the HTTP Access Method Only"> below.
682    
683     For the L<prog|SWISH-RUN/"item_prog"> method of access B<IndexDir>
684     specifies the path to the program(s) to execute. The external program
685     must correctly format the documents being passed back to Swish-e.
686     Examples of external programs are provided in the F<prog-bin> directory.
687    
688     IndexDir ./myprogram.pl
689    
690     See L<prog|SWISH-RUN/"item_prog"> for details.
691    
692    
693     Note: Not all directives work with all methods.
694    
695     =item NoContents *list of file suffixes*
696    
697     Files with these suffixes will B<not> have their contents indexed.
698    
699     If the file's type is HTML (as set by C<IndexContents> or
700     C<DefaultContents>) then the file will be parsed for a HTML title and
701     that title will be indexed. Note that you must set the file's type:
702     C<.html> and C<.htm> are NOT type HTML by default.
703    
704     If a title is found, it will still be checked for C<FileRules title>,
705     and the file will be skipped if a match is found. See C<FileRules>.
706    
707     If the file's type is not HTML, or it is HTML and no title is found,
708     then the file's path will be indexed. For example, you might wish to
709     search for image files by file name.
710    
711     Example:
712    
713     NoContents .gif .xbm .au .mov .mpg .pdf .ps
714    
715     Note: Using this directive will not cause files with those suffixes
716     to be indexed. That is, if you use C<IndexOnly> to limit the types of
717     files that are indexed, then you must specify in C<IndexOnly> the same
718     suffixes listed in C<NoContents>.
719    
720     A C<-S prog> program may set the C<No-Contents:> header (to anything)
721     to enable this feature for a specific document (althought it would be
722     smarter for the C<-S prog> program to simply only send the pathname or
723     title to be indexed.
724    
725     =item ReplaceRules [replace|remove|prepend|append|regex]
726    
727     ReplaceRules allows you to make changes to file pathnames before
728     they're indexed. These changed file names or URLs will be returned in
729     search results.
730    
731     For example, you may index your files locally (with the File system
732     indexing method), yet return a URL in search results. This directive can
733     be used to map the file names to their respective URLs on your web server.
734    
735     There are five operations you can specify: B<replace>, B<append>,
736     B<remove>, B<prepend>, and B<regex> They will parse the pathname in the
737     order you've typed these commands.
738    
739     This directive uses C library regex.h regular expressions.
740    
741     replace "the string you want replaced" "what to change it to"
742     remove "a string to remove"
743     prepend "a string to add before the result"
744     append "a string to add after the result"
745     regex "/search string/replace string/options"
746    
747     Remember, quotes are needed if an expression contains white space,
748     and backslashes have special meaning.
749    
750     Regex is an Extended Regular Expression. The first character found is
751     the delimiter (but it's not smart enough to use matched chars such as [],
752     (), and {}).
753    
754     The B<replace> string may use substitution variables:
755    
756     $0 the entire matched (sub)string
757     $1-$9 returns patterns captured in "(" ")" pairs
758     $` the string before the matched pattern
759     $' the string after the matched pattern
760    
761     The B<options> change the behavior of expression:
762    
763     i ignore the case when matching
764     g repeat the substitution for the entire pattern
765    
766     Examples:
767    
768     ReplaceRules replace testdir/ anotherdir/
769     ReplaceRules replace [a-z_0-9]*_m.*\.html index.html
770    
771     ReplaceRules remove testdir/
772    
773     ReplaceRules prepend http://localhost/
774     ReplaceRules append .html
775    
776     ReplaceRules regex !^/web/(.+)/!http://$1.domain.com/!
777     replaces a file path:
778     /web/search/foo/index.html
779     with
780     http://search.domain.com/foo/index.html
781    
782     ReplaceRules regex #^#http://localhost/www#
783     ReplaceRules prepend http://localhost/www (same thing)
784    
785     # Remove all extensions from C source files
786     ReplaceRules remove .c # ERROR! That "." is *any char*
787     ReplaceRules remove \.c # much better...
788    
789     ReplaceRules remove "\\.c" # if in quotes you need double-backslash!
790     ReplaceRules remove "\.c" # ERROR! "\." -> "." and is *any char*
791    
792    
793     =item IndexContents [TXT|HTML|XML|WML|TXT2|HTML2|XML2] *file extensions*
794    
795     The C<IndexContents> directive assigns one of Swish-e's document parsers
796     to a document, based on the its extension. Swish-e currently knows how
797     to parse TXT, HTML, and XML documents.
798    
799     The XML2, HTML2, and TXT2 parsers are currently only available when
800     Swish-e is configured to use libxml2.
801    
802     Documents that are not assigned a parser with C<IndexContents> will, by
803     default, use the HTML parser. The C<DefaultContents> directive may be
804     used to assign a parser to documents that do not match a file extension
805     defined with the C<IndexContents> directive.
806    
807     Example:
808    
809     IndexContents HTML .htm .html .shtml
810     IndexContents TXT .txt .log .text
811     IndexContents XML .xml
812    
813     HTML is the default type for all files, unless otherwise specified
814     (and this default can be changed by the B<DefaultContents> directive.
815     Swish-e parses titles from HTML files, if available, and keeps track
816     of the context of the text for context searching (see C<-t> in
817     L<SWISH-RUN|SWISH-RUN>). HTML and XML files use different tag formats
818     for B<MetaNames> and B<PropertyNames>.
819    
820     If using filters to convert documents you should include those extensions,
821     too. For example, if using a filter to conver .pdf to .html, you need
822     to tell Swish-e that .pdf should be indexed by the internal HTML parser:
823    
824     FileFilter .pdf pdf2html
825     IndexContent HTML .pdf
826    
827     See also L<Document Filter Directives|/"Document Filter Directives">.
828    
829     B<Note:> Some of this may be changed in the future to use content-types
830     instead of file extensions. See L<SWISH-3.0|SWISH-3.0>
831    
832     =item DefaultContents [TXT|HTML|XML|WML|TXT2|HTML2|XML2]
833    
834     This sets the default parser for documents that are not specified in
835     B<IndexContents>. If not specified the default is HTML.
836    
837     The XML2, HTML2, and TXT2 parsers are currently only available when
838     Swish-e is configured to use libxml2.
839    
840    
841     Example:
842    
843     DefaultContents HTML
844    
845     The C<DefaultContents> directive I<should> be used when spidering,
846     as HTML files may be returned without a file extension (such as when
847     requesting a directory and the default index.html is returned).
848    
849    
850     =item FileInfoCompression [yes|NO]
851    
852     ** This directive is currently not supported **
853    
854     Setting B<FileInfoCompression> to C<yes> will compress the index file to
855     save disk space. This may result in longer indexing times. The default
856     is C<no>.
857    
858     Also see the C<-e> switch in L<SWISH-RUN|SWISH-RUN> for saving RAM
859     during indexing.
860    
861    
862     =back
863    
864     =head2 Document Contents Directives
865    
866     These directives control what information is extracted from your source
867     documents, and how that information is made available during searching.
868    
869     =over 4
870    
871     =item ConvertHTMLEntities [YES|no]
872    
873     ASCII I<entities> can be converted automatically while indexing documents
874     of type HTML. For performance reasons you may wish to set this to C<no>
875     if your documents do not contain HTML entities. The default is C<yes>.
876    
877     If C<ConvertHTMLEntities> is set C<no> the entities will be indexed
878     without conversion.
879    
880     B<NOTE:> Entities within XML files and files parsed with libxml2 are
881     converted regardless of this setting.
882    
883     =item MetaNames *list of names*
884    
885     META names are a way to define "fields" in your XML and HTML documents.
886     You can use the META names in your queries to limit the search to just
887     the words contained in that META name of your document. For example,
888     you might have a META tagged field in your documents called C<subjects>
889     and then you can search your documents for the word "foo" but only return
890     documents where "foo" is within the C<subjects> META tag.
891    
892     swish-e -w subjects=foo
893    
894     (See also the C<-t> switch in L<SWISH-RUN|SWISH-RUN> for information
895     about I<context> searching in HTML documents.)
896    
897     The B<MetaNames> directive is a space separated list. For example:
898    
899     MetaNames meta1 meta2 keywords subjects
900    
901     You may also use L<UndefinedMetaTags|/"item_UndefinedMetaTags"> to specify
902     automatic extraction of meta names from your HTML and XML documents,
903     and also to ignore indexing content of meta tags.
904    
905     META tags can have two formats in your B<HTML> source documents:
906    
907     <META NAME="meta1" CONTENT="some content">
908    
909     and (if using the HTML2/libxml2 parser)
910    
911     <meta1>
912     some content
913     </meta1>
914    
915     But this second version is invalid HTML, and will generate a warning if
916     ParserWarningLevel is set (libxml2 only).
917    
918     And in B<XML> documents, use the format:
919    
920     <meta1>
921     Some Content
922     </meta1>
923    
924     Then you can limit your search to just META B<meta1> like this:
925    
926     swish-e -w 'meta1=(apples or oranges)'
927    
928     You may nest the XML and the start/end tag versions:
929    
930     <keywords>
931     <tag1>
932     some content
933     </tag1>
934     <tag2>
935     some other content
936     </tag2>
937     <keywords>
938    
939     Then you can search in both tag2 and tag2 with:
940    
941     swish-e -w 'keywords=(query words)'
942    
943     Swish-e indexes all text as some metaname. The default is
944     C<swishdefault>, so these two queries are the same:
945    
946     swish-e -w foo
947     swish-e -w swishdefault=foo
948    
949     When indexing HTML Swish-e indexes the HTML title as default text, so
950     when searching Swish-e will find matches in both the HTML body and the
951     HTML title. Swish also, by default, indexes content of meta tags. So:
952    
953     swish-e -w foo
954    
955     will find "foo" in the body, the title, or any meta tags.
956    
957     Currently, there's no way to prevent Swish-e from indexing
958     the title contents along with the body contents, but see
959     L<UndefinedMetaTags|/"item_UndefinedMetaTags"> for how to control the
960     indexing of meta tags.
961    
962     If you would like to search just the title text, you may use:
963    
964     MetaNames swishtitle
965    
966     This will index the title text separately under the built-in swish
967     internal meta name "swishtitle". You may then search like
968    
969     swish-e -w foo -- search for "foo" in title, body (and undefined meta tags)
970     swish-e -w swishtitle=foo -- search for "foo" in title only
971    
972     In addition to swishtitle, you can limit searches to documents' path with:
973    
974     MetaNames swishdocpath
975    
976     Then to search for "foo" but also limit searches to documents that include
977     "manual" or "tutorial" in thier path:
978    
979     swish-e -w foo swishdocpath=(manual or tutorial)
980    
981     See also L<ExtractPath|/"item_ExtractPath">.
982    
983    
984     =item MetaNameAlias *meta name* *list of aliases*
985    
986     MetaNameAlias assigns aliases for a meta name. For example, if your
987     documents contain meta tags "description", "summary", and "overview"
988     that all give a summary of your documents you could do this:
989    
990     MetaNames summary
991     MetaNameAlias summary description overview
992    
993     Then all three tags will get indexed as meta tag "summary". You can
994     then search all the fields as:
995    
996     -w summary=foo
997    
998     The Alias work at search time, too. So these will also limit the searh
999     to the "summary" meta name.
1000    
1001     -w description=foo
1002     -w overview=foo
1003    
1004     =item MetaNamesRank integer *list of meta names*
1005    
1006     * Not implemented yet *
1007    
1008     You can assign a bias to metanames that will affect how ranking is
1009     calculated. The range of values is from -10 to +10, with zero being
1010     no bias.
1011    
1012     MetaNamesRank 4 subject
1013     MetaNamesRank 3 swishdefault
1014     MetaNamesRank 2 author publisher
1015     MetaNamesRank -5 wrongwords
1016    
1017     This feature is not implemented yet
1018    
1019     =item HTMLLinksMetaName *metaname*
1020    
1021     Allows indexing of HTML links. Normally, HTML links (href tags) are
1022     not indexed by Swish-e. This directive defines a metaname, and links
1023     will be indexed under this meta name.
1024    
1025     Example:
1026    
1027     HTMLLinksMetaName links
1028    
1029     Now, to limit searches to files with a link to "home.html" do this:
1030    
1031     -w links='"home.html"'
1032    
1033     The double quotes force a phrase search.
1034    
1035     To make Swish-e index links as normal text, you may use:
1036    
1037     HTMLLinksMetaName swishdefault
1038    
1039     This feature is only available with the libxml2 HTML parser.
1040    
1041     =item ImageLinksMetaName *metaname*
1042    
1043     Allows indexing of image links under a metaname. Normally, image URLs
1044     are not indexed.
1045    
1046     Example:
1047    
1048     ImagesLinksMetaName images
1049    
1050     Now, if you would like to find pages that include a nice image of a beach:
1051    
1052     -w images='beach'
1053    
1054     To make Swish-e index links as normal text, you may use:
1055    
1056     ImageLinksMetaName swishdefault
1057    
1058     This feature is only available with the libxml2 HTML parser.
1059    
1060    
1061     =item IndexAltTagMetaName *tagname*|as-text
1062    
1063     Allows indexing of images <IMG> ALT tag text. Specify either a tag name which will be
1064     used as a metaname, or the special text "as-text" which says to index the ALT text as
1065     if it were plain text at the current location.
1066    
1067     For example, by specifying a tag name:
1068    
1069     IndexAltTagMetaName bar
1070    
1071     would make this markup:
1072    
1073     <foo>
1074     <img src="/someimage.png" alt="Alt text here">
1075     </foo>
1076    
1077     appear like
1078    
1079     <foo>
1080     <bar>Alt text here</bar>
1081     </foo>
1082    
1083     Then the normal rules (C<MetaNames> and C<PropertyNames>) apply to how that text is indexed.
1084    
1085     If you use the special tag "as-text" then
1086    
1087     <foo>
1088     <img src="/someimage.png" alt="Alt text here">
1089     </foo>
1090    
1091     simply becomes
1092    
1093     <foo>
1094     Alt text here
1095     </foo>
1096    
1097     This feature is only available when using the libxml2 parser (HTML2 and XML2).
1098    
1099    
1100     =item AbsoluteLinks [yes|NO]
1101    
1102     If this is set true then Swish-e will attempt to convert relative URIs
1103     extracted from HTML documents for use with C<HTMLLinksMetaName> and
1104     C<ImageLinksMetaName> into absolute URIs. Swish-e will use any <BASE>
1105     tag found in the document, otherwise it will use the file's pathname.
1106     The pathname used will be the pathname *after* C<ReplaceRules> has been
1107     applied to the document's pathname.
1108    
1109     For example, say you wish to index image links under the metaname
1110     "images".
1111    
1112     ImageLinksMetaName images
1113    
1114     If an image is located in http://localhost/vacations/france/index.html
1115     and C<AbsoluteLinks> is set to no, then a image within that document:
1116    
1117     <img src="beach.jpeg">
1118    
1119     will only index "beach.jpeg".
1120    
1121     But, if you want more deatil when searching, you can enable
1122     C<AbsoluteLinks> and Swish-e will index
1123     "http://localhost/vacations/france/beach.jpeg". You can then look for
1124     images of beaches, but only in France:
1125    
1126     -w images=(beach and france)
1127    
1128     This also means you can search for any images within France:
1129    
1130     -w images=(france)
1131    
1132     This feature is only available with the libxml2 HTML parser.
1133    
1134     =item UndefinedMetaTags [error|ignore|INDEX|auto]
1135    
1136     This directive defines the behavior of Swish-e during indexing when a
1137     meta name is found but is B<not> listed in B<MetaNames>. There are
1138     four choices:
1139    
1140    
1141     =over 2
1142    
1143     =item error
1144    
1145     If a meta name is found that is not listed in B<MetaNames>
1146     then indexing will be halted and an error reported.
1147    
1148     =item ignore
1149    
1150     The contents of the meta tag are ignored and B<not> indexed
1151     unless a metaname has been defined with the C<MetaNames> directive.
1152    
1153     =item index
1154    
1155     The contents of the meta tag are indexed, but placed in the
1156     main index unless there's an enclosing metatag already in force. This
1157     is the default.
1158    
1159     =item auto
1160    
1161     This method create meta tags automatically for HTML meta names
1162     and XML elements. Using this is the same as specifying all the meta
1163     names explicitly in a B<MetaNames> dirictive.
1164    
1165     =back
1166    
1167     =item UndefinedXMLAttributes [DISABLE|error|index|auto]
1168    
1169     This is similar to C<UndefinedMetaTags>, but only applies to XML documents (parsed with libxml2).
1170     This allows indexing of attribute content, and provides a way to index the content under a
1171     metaname. For example, C<UndefinedXMLAttributes> can make
1172    
1173     <person age="23">
1174     John Doe
1175     </person>
1176    
1177     look like the following to swish:
1178    
1179     <person>
1180     <person.age>
1181     23
1182     </person.age>
1183     John Doe
1184     </person>
1185    
1186     What happens to the text "23" will depend on the setting of C<UndefinedXMLAttributes>:
1187    
1188     =over 2
1189    
1190     =item disable
1191    
1192     XML attributes are not parsed and not indexed. This is the default.
1193    
1194     =item error
1195    
1196     If the concatenated meta name (e.g. person.age) is not listed in
1197     B<MetaNames> then indexing will be halted and an error reported.
1198    
1199     =item ignore
1200    
1201     The contents of the meta tag are ignored and B<not> indexed unless a
1202     metaname has been defined with the C<MetaNames> directive.
1203    
1204     =item index
1205    
1206     The contents of the meta tag are indexed, but placed in the main index
1207     unless there's an enclosing metatag already in force.
1208    
1209     =item auto
1210    
1211     This method will create meta tags from the combined element and attributes
1212     (and XML Class name) This options should be used with caution as it can
1213     generate a lot of metaname entries.
1214    
1215     See also the example below C<XMLClassAttribues>.
1216    
1217    
1218     =back
1219    
1220     =item XMLClassAttributes *list of XML attribute names*
1221    
1222     Combines an XML class name with the element name to make up a metaname.
1223     For example:
1224    
1225     XMLClassAttributes class
1226    
1227     <person class="first">
1228     John
1229     </person>
1230     <person class="last">
1231     Doe
1232     </person>
1233    
1234     Will appear to Swish-e as:
1235    
1236     <person>
1237     <person.first>
1238     John
1239     </person.first>
1240     </person>
1241     <person>
1242     <person.last>
1243     Doe
1244     </person.last>
1245     </person>
1246    
1247     How the data is indexed depends on C<MetaNames> and C<UndefinedMetaTags>.
1248    
1249     Here's an example using the following configuation which combines the
1250     two directives C<XMLClassAttributes> and C<UndefinedXMLAttributes>.
1251    
1252     XMLClassAttributes class
1253     UndefinedMetaTags auto
1254     UndefinedXMLAttributes auto
1255     IndexContents XML2 .xml
1256    
1257     The source XML file looks like:
1258    
1259     <xml>
1260     <person class="student" phone="555-1212" age="102">
1261     John
1262     </person>
1263     <person greeting="howdy">Bill</person>
1264     </xml>
1265    
1266     Swish-e parses as:
1267    
1268     ./swish-e -c 2 -i 1.xml -T parsed_tags parsed_text -v 0
1269     Indexing Data Source: "File-System"
1270    
1271     <xml> (MetaName)
1272    
1273     <person> (MetaName)
1274     <person.student> (MetaName)
1275     <person.student.phone> (MetaName)
1276     555-1212
1277     </person.student.phone>
1278     <person.student.age> (MetaName)
1279     102
1280     </person.student.age>
1281     John
1282     </person>
1283    
1284     <person> (MetaName)
1285     <person.greeting> (MetaName)
1286     howdy
1287     </person.greeting>
1288     Bill
1289     </person>
1290    
1291     </xml>
1292     Indexing done!
1293    
1294     One thing to note is that the first <person> block finds a class name
1295     "student" so all metanames that are created from attributes use the
1296     combined name "person.student". The second <person> block doesn't contain
1297     a "class" so, the attribute name is combinded directly with the element
1298     name (e.g. "person.greeting").
1299    
1300     =item ExtractPath *metaname* [replace|remove|prepend|append|regex]
1301    
1302     This directive can be used to index extracted parts of a document's path.
1303     A common use would to to limit searches to specific areas of your
1304     file tree.
1305    
1306     The extracted string will be indexed under the specified meta name.
1307    
1308     See C<ReplaceRules> for a description of the various pattern replacement
1309     methods, but you will use the I<regex> method.
1310    
1311     For example, say your file system (or web tree) was organized into departments:
1312    
1313     /web/sales/foo...
1314     /web/parts/foo...
1315     /web/accounting/foo...
1316    
1317     And you wanted a way to limit searches to just documents under "sales".
1318    
1319     ExtractPath department regex !^/web/([^/]+)/.*$!$1!
1320    
1321     Which says, extract out the department name (as substring $1) and index
1322     it as meta name C<department>. Then to limit a search to the sales
1323     department:
1324    
1325     swish-e -w foo AND department=sales
1326    
1327     Note that the C<regex> method uses a substitution pattern, so to index
1328     only a sub-string match the I<entire> document path in the regular
1329     expression, as shown above.
1330    
1331     See the C<ExtractPathDefault> option for a way to set a value if not
1332     patterns match.
1333    
1334     Although unlikely, you may use more than one C<ExtractPath> direcive.
1335     More than one directive of the I<same> meta name will operate successively
1336     (in order listed in the configuration file) on the path. This allows
1337     you to use regular expressions on the results of the previous pattern
1338     substitution (as if piping the output from one expression to the patter
1339     of the next).
1340    
1341     ExtractPath foo regex !^(...).+$!$1!
1342     ExtractPath foo regex !^.+(.)$!$1!
1343    
1344     So, the third letter is indexed as meta name "foo" if both patterns match.
1345    
1346     ExtractPath foo regex !^X(...).+$!$1!
1347     ExtractPath foo regex !^.+(.)$!$1!
1348    
1349     Now (not the "X"), if the first pattern doesn't match, the last character
1350     of the path name is indexed. You must be clear on this behavior if you
1351     are using more than one C<ExtractPath> directive with the same metaname.
1352    
1353     The document path operated on is the real path swish used to access
1354     the document. That is, the C<ReplaceRules> directive has no effect on
1355     the path used with C<ExtractPath>.
1356    
1357     The full path is used for each meta name if more than one C<ExtractPath>
1358     directive is used. That is, changes to the path used in C<ExtractPath
1359     foo> do not affect the path used by C<ExtractPath bar>.
1360    
1361     =item ExtractPathDefault *metaname* default_value
1362    
1363     This can be used with C<ExtractPath> to set a default string to index
1364     under the given metaname if none of the C<ExtractPath> patterns match.
1365    
1366     For example, say your want to index each document with a metaname
1367     "department" based on the following path examples:
1368    
1369     /web/sales/foo...
1370     /web/parts/foo...
1371     /web/accounting/foo...
1372    
1373     But you are also indexing documents that do not follow that pattern and you want to search those
1374     seperately, too.
1375    
1376     ExtractPath department regex !^/web/([^/]+)/.*$!$1!
1377     ExtractPathDefault department other
1378    
1379     Now, you may search like this:
1380    
1381     -w foo department=(sales) - limit searches to the sales documents
1382     -w foo department=(parts) - limit searches to the parts documents
1383     -w foo department=(accounting) - limit searches to the accounting documents
1384     -w foo department=(other) - everything but sales, parts, and accounting.
1385    
1386     This basically is a shortcut for:
1387    
1388     -w foo not department=(sales or parts or accounting)
1389    
1390     but you don't need to keep track of what was extracted.
1391    
1392     =item PropertyNames *list of meta names*
1393    
1394     =item PropertyNamesCompareCase *list of meta names*
1395    
1396     =item PropertyNamesIgnoreCase *list of meta names*
1397    
1398     Swish-e allows you to specify certain META tags that can be used as
1399     B<document properties>. The contents of any META tag that has been
1400     identified as a document property can be returned as part of the search
1401     results along with the rank, file name, title, and document size (see
1402     the C<-p> and C<-x> switches in L<SWISH-RUN|SWISH-RUN>).
1403    
1404     Properties are useful for returning additional data from documents in
1405     search results -- this saves the effort of reading and parsing the source
1406     files while reading Swish-e search results, and is especially useful
1407     when the source documents are no longer available or slow to access
1408     (e.g. over http).
1409    
1410     Another feature of properties is that Swish-e can use the PropertyNames
1411     for sorting the search results (see the C<-s> switch).
1412    
1413     PropertyNames author subjects
1414    
1415     Two variations are available. C<PropertyNamesCompareCase> and
1416     C<PropertyNamesIgnoreCase>. These tell Swish-e to either ignore or
1417     compare case when sorting results. The default for C<PropertyNames>
1418     is to ignore the case.
1419    
1420     PropertyNamesIgnoreCase subject
1421     PropertyNamesCompareCase keyword
1422    
1423     The defaults for "internal" properties are:
1424    
1425     swishtitle -- ignore the case
1426     swishdocpath -- compare case
1427     swishdescription -- compare case
1428    
1429     These can be overridden with C<PropertyNamesCompareCase> and
1430     C<PropertyNamesIgnoreCase>.
1431    
1432     PropertyNamesCompareCase swishtitle
1433    
1434     Use of PropertyNames will increase the size of your index files,
1435     sometimes significantly. Properties will be compressed if Swish-e is
1436     compiled with zlib as described in the L<INSTALL|INSTALL> manual page.
1437    
1438     If Swish-e finds more than one property of the same name in a document
1439     the property's contents will be concatinated for strings, and a warning
1440     issues for numeric (or date) properties.
1441    
1442    
1443     =item PropertyNamesNumeric
1444    
1445     This directive is similar to C<PropertyNames>, but it flags the property
1446     as being a string of digits (integer value) that will be stored as binary data instead
1447     of a string. This allows sorting with C<-s> and limiting with C<-L>
1448     to sort and limit the property correctly.
1449    
1450     Swish-e uses C<strtoul(3)> to convert the string into an unsigned long
1451     integer. Therefore, only positive integers can be stored.
1452    
1453     Future versions of Swish-e may be able to store different property types
1454     (such as negative integers and real numbers). This directive may change
1455     in future releases of Swish.
1456    
1457     =item PropertyNamesDate
1458    
1459     This directive is exactly like C<PropertyNamesNumeric>, but it also
1460     flags the number as a machine timestamp (seconds since Epoch), and
1461     will print a formatted date when returning this property. See C<-x>
1462     in L<SWISH-RUN|SWISH-RUN>.
1463    
1464     Swish-e will not parse dates when indexing; you must use a timestamp.
1465    
1466     =item PropertyNameAlias *property name* *list of aliases*
1467    
1468     This allows aliases for a property name. For example, if you are indexing
1469     HTML files, plus XML files that are written in English, German, and
1470     Spanish and thus use the tags "title", "titel", and "título" you can use:
1471    
1472     PropertyNameAlias swishtitle title titel título titulo
1473    
1474     Note that "swishtitle" is the built-in property used to store the title of
1475     a document, and therefore you do not need to specify it as a PropertyName
1476     before use.
1477    
1478     =item PropertyNamesMaxLength integer *list of meta names*
1479    
1480     This option will set the max length of the text stored in a property.
1481     You must specify a number between 0 and the max integer size on your
1482     platform, and a list of properties. The properties specified must not
1483     be aliases.
1484    
1485     If any of the property names do not exist they will be created (e.g. you
1486     do not need to define the property with PropertyNames first).
1487    
1488     In general, this feature will only be useful when parsing HTML or XML
1489     with the libxml2 parser.
1490    
1491     For example:
1492    
1493     PropertyNamesMaxLength 1000 swishdescription
1494     PropertyNameAlias swishdescription body
1495    
1496     Is somewhat like
1497    
1498     StoreDescription HTML <body> 1000
1499     StoreDescription XML <body> 1000
1500     StoreDescription HTML2 <body> 1000
1501     StoreDescription XML2 <body> 1000
1502    
1503     but StoreDescription allows setting the tag for each parser type.
1504    
1505     PropertyNamesMaxLength 1000 headings
1506     PropertyNameAlias headings h1 h2 h3 h4
1507    
1508     collects all the heading text into a single property called "headings", not
1509     to exceed 1000 characters.
1510    
1511    
1512     =item PreSortedIndex *list of property names*
1513    
1514     By default Swish-e generates presorted tables while indexing for each
1515     property name. This allows faster sorting when generating results.
1516     On large document collections this presorting may add to the indexing
1517     time, and also adds to the total size of the index. This directive can
1518     be used to customize exactly which properties will be presorted.
1519    
1520     If C<PreSortedIndex> it is I<not> present in the config file (default
1521     action), all the properties will be presorted at indexing time. If it
1522     is present without any parameter, no properties will be presorted.
1523     Otherwise, only the property names specified will be presorted.
1524    
1525     For example, if you only wish to sort results by a property called
1526     C<title>:
1527    
1528     PropertyNames title age time
1529     PreSortedIndex title
1530    
1531    
1532     =item StoreDescription [XML <tag> size|HTML <meta> size|TXT size]
1533    
1534     B<StoreDescription> allows you to store a document description in the
1535     index file, and this description can be returned in your search results
1536     when the C<-x> switch is used to include the I<swishdescription> for
1537     extended results.
1538    
1539     For text documents you specify the type C<TXT> and the number of I<characters> to capture.
1540    
1541     StoreDescription TXT 20
1542    
1543     The above stores only the first twenty characters from the text file in the Swish-e index
1544     file.
1545    
1546     For HTML, and XML file types, specify the the tag to use for the
1547     description, and optionally the number of characters to capture. If not
1548     specified will capture the entire contents of the tag.
1549    
1550     StoreDescription HTML <body> 20000
1551     StoreDescription XML <desc> 40
1552    
1553     Note that documents must be assigned a document type with C<IndexContents>
1554     or C<DefaultContents> to use this feature.
1555    
1556     Swish-e will compress the descriptions (or any other large property)
1557     if compiled to use zlib (see L<INSTALL|INSTALL>). This is recommended when using
1558     StoreDescription and a large number of documents. Compression of 30% to 50% is
1559     not uncomon with HTML files.
1560    
1561     =item PropCompressionLevel [0-9]
1562    
1563     This directive sets the compression level used when storing properties
1564     to disk. A setting of zero is no compression, and a setting of nine is
1565     the most compression.
1566    
1567     The default depends on the default setting compiled with zlib, but is
1568     typicaly six.
1569    
1570     This option is useful when using C<StoreDescription> to store a large
1571     amount text in properties (or if using C<PropertyNames> with large
1572     property sizes).
1573    
1574     Properties must be over a value defined in F<config.h> (100 is the
1575     default) before compression will be attempted. Swish-e will never store
1576     the results of the compression if the compressed data is larger than
1577     the original data.
1578    
1579     This option is only available when Swish-e is compiled with zlib support.
1580    
1581    
1582     =item TruncateDocSize *number of characters*
1583    
1584     TruncateDocSize limits the size of a document while indexing documents
1585     and/or using filters. This config directive truncates the numbers of
1586     read bytes of a document to the specified size. This means: if a document
1587     is larger, read only the specified numbers of bytes of the document.
1588    
1589     Example:
1590    
1591     TruncateDocSize 10000000
1592    
1593     The default is zero, which means read all data.
1594    
1595    
1596     Warning: If you use TruncateDocSize, use it with care! TruncateDocSize
1597     is a safty belt only, to limit e.g. filteroutput, when accessing
1598     databases, or to limit "runnaway" filters. Truncating doc input may
1599     destroy document structures for Swish-e (e.g. swish may miss closing
1600     tags for XML or HTML documents).
1601    
1602     TruncateDocSize does not currently work with the C<prog> input source
1603     method.
1604    
1605     =item FuzzyIndexingMode NONE|Stemming|Soundex|Metaphone|DoubleMetaphone
1606    
1607     Selects the type of index to create. Only one type of index may be created.
1608    
1609     It's a good idea to create both a normal index and a fuzzy index and
1610     allow your search interface select which index to use. Many people find the
1611     fuzzy searches to be too fuzzy.
1612    
1613     The available fuzzy indexing options are:
1614    
1615     =over 4
1616    
1617     =item None
1618    
1619     Words are stored in the index without any conversion. This is the default.
1620    
1621     =item Stemming
1622    
1623     Words are converted using the Porter stemming algorithm.
1624    
1625     From: http://www.tartarus.org/~martin/PorterStemmer/
1626    
1627     The Porter stemming algorithm (or ‘Porter stemmer’) is a
1628     process for removing the commoner morphological and inflexional
1629     endings from words in English. Its main use is as part of a
1630     term normalisation process that is usually done when setting up
1631     Information Retrieval systems.
1632    
1633    
1634     This will help a search for "running" to also find "run" and "runs", for example.
1635    
1636     The stemming function does not convert words to their root, rather
1637     programmatically removes endings on words in an attempt to make similar
1638     words with different endings stem to the same string of characters.
1639     It's not a perfect system, and searches on stemmed indexes often return
1640     curious results. For example, two entirely different words may stem to
1641     the same word.
1642    
1643     Stemming also can be confusing when used with a wildcard (truncation).
1644     For example, you might expect to find the word "running" by searching for
1645     "runn*". But this fails when using a stemmed index, as "running" stems to
1646     "run", yet searching for "runn*" looks for words that start with "runn".
1647    
1648     =item Soundex
1649    
1650     Soundex was developed in the 1880s so records for people with similar
1651     sounding names could be found more readily. Soundex is a coded surname
1652     based on the way a surname sounds rather than spelling. Surnames that
1653     sound similar, like Smith and Smyth, are filed together under the same
1654     Soundex code. This is mostly useful for US English.
1655    
1656     Soundex should not be used to search for sound-alike words. Metaphone
1657     would be more appropriate for generic sound matching of words. Soundex
1658     should only be used where you need to search multiple documents for
1659     proper names which sound similar. This is primarily used for indexing
1660     genealogical records. This may be useful for indexing other collections
1661     of data consisting mostly of names. Many common name variations are
1662     matched by Soundex. The only notable exception is the first letter of
1663     the name. The first letter is not matched for sound.
1664    
1665     =item Metaphone and DoubleMetaphone
1666    
1667     Words are transformed into a short series of letters representing the sound of the word (in English).
1668     Metaphone algorithms are often used for looking up mis-spelled words in dictionary programs.
1669    
1670     From: http://aspell.sourceforge.net/metaphone/
1671    
1672     Lawrence Philips' Metaphone Algorithm is an algorithm which returns
1673     the rough approximation of how an English word sounds.
1674    
1675     The C<DoubleMetaphone> mode will sometimes generate two different metaphones for the same word.
1676     This is supposed to be useful when a word may be pronounced more than one way.
1677    
1678     A metaphone index should give results somewhere in between Soundex and Stemming.
1679    
1680     =back
1681    
1682     =item UseStemming [yes|NO]
1683    
1684     Put yes to apply word stemming algorithm during indexing, else no.
1685    
1686     UseStemming no
1687     UseStemming yes
1688    
1689     When UseStemming is set to C<yes> every word is stemmed before placing
1690     it in to the index.
1691    
1692     This option is depreciated. It has been superceded by C<FuzzyIndexingMode>.
1693    
1694     =item UseSoundex [yes|NO]
1695    
1696     When UseSoundex is set to C<yes> every word is converted to a Soundex
1697     code before placing it in to the index.
1698    
1699     This option is depreciated. It has been superceded by C<FuzzyIndexingMode>.
1700    
1701     =item IgnoreTotalWordCountWhenRanking [YES|no]
1702    
1703     Put yes to ignore the total number of words in the file when calculating
1704     ranking. Often better with merges and small files. Default is yes.
1705    
1706     IgnoreTotalWordCountWhenRanking no
1707    
1708     The default was changed from no to yes in version 2.2.
1709    
1710     =item MinWordLimit *integer*
1711    
1712     Set the minimum length of an word. Shorter words will not be indexed.
1713     The default is 1 (as defined in F<src/config.h>).
1714    
1715     MinWordLimit 5
1716    
1717     =item MaxWordLimit *integer*
1718    
1719     Set the maximum length of an indexable word. Every longer word will not
1720     be indexed. The Default is 40 (as defined in F<src/config.h>).
1721    
1722     =item WordCharacters *string of characters*
1723    
1724     =item IgnoreFirstChar *string of characters*
1725    
1726     =item IgnoreLastChar *string of characters*
1727    
1728     =item BeginCharacters *string of characters*
1729    
1730     =item EndCharacter *string of characters*
1731    
1732    
1733     These settings define what a word consists of to the Swish-e indexing engine.
1734     Compiled in defaults are in F<src/config.h>.
1735    
1736     When indexing Swish-e uses B<WordCharacters> to split up the document
1737     into words. Words are defined by any string of non-blank characters
1738     that contain only the characters listed in WordCharacters. If a string
1739     of characters includes a character that is not in WordCharacters then
1740     the word will be spit into two or more separate words.
1741    
1742     For example:
1743    
1744     WordCharacters abde
1745    
1746     Would turn "abcde" into two words "ab" and "de".
1747    
1748     Next, of these words, any characters defined in B<IgnoreFirstChar> are
1749     stripped off the start of the word, and B<IgnoreLastChar> characters
1750     are stripped off the end of the word. This allows, for example,
1751     periods within a word (www.slashdot.com), but not at the end of
1752     a word. Characters in IgnoreFirstChar and IgnoreLastChar must be in
1753     WordCharacters.
1754    
1755     Finally, the resulting words MUST begin with one of the characters
1756     listed in B<BeginCharacters> and end with one of the characters listed in
1757     B<EndCharacters>. BeginCharacters and EndCharacters must be a subset of
1758     the characters in WordCharacters. Often, WordCharacters, BeginCharacters
1759     and EndCharacters will all be the same.
1760    
1761     Note that the same process applies to the query while searching.
1762    
1763     Getting these settings correct will take careful consideration and
1764     practice. It's helpful to create an index of a single test file, and
1765     then look at the words that are placed in the index (see the C<-v 4>,
1766     C<-D> and C<-k> searching switches).
1767    
1768     Currently there is only support for eight-bit characters.
1769    
1770     Example:
1771    
1772     WordCharacters .abcdefghijklmnopqrstuvwxyz
1773     BeginCharacters abcdefghijklmnopqrstuvwxyz
1774     EndCharacters abcdefghijklmnopqrstuvwxyz
1775     IgnoreFirstChar .
1776     IgnoreLastChar .
1777    
1778     So the string
1779    
1780     Please visit http://www.example.com/path/to/file.html.
1781    
1782     will be indexed as the following words:
1783    
1784     please
1785     visit
1786     http
1787     www.example.com
1788     path
1789     to
1790     file.html
1791    
1792     Which means that you can search for C<www.example.com> as a single word,
1793     but searching for just C<example> will not find the document.
1794    
1795     Note: when indexing HTML documents HTML entities are converted to their
1796     character equivalents before being processed with these directives.
1797     This is a change from previous versions of Swish-e where you were
1798     required to include the characters C<0123456789&#;> to index entities.
1799     See also L<ConvertHTMLEntities|/"item_ConvertHTMLEntities">
1800    
1801     =item Buzzwords [*list of buzzwords*|File: path]
1802    
1803     The Buzzwords option allows you to specify words that will be indexed
1804     regardless of WordCharacters, BeginCharacters, EndCharacters, stemming,
1805     soundex and many of the other checks do on words while indexing.
1806    
1807     Buzzwords are case insensitive.
1808    
1809     Buzzwords should be separated by spaces and may span multiple directives.
1810     If the special format C<File:filename> is used then the Buzzwords will
1811     be read from an external file during indexing.
1812    
1813     Examples:
1814    
1815     Buzzwords C++ TCP/IP
1816    
1817     Buzzwords File: ./buzzwords.lst
1818    
1819     If a Buzzword contains search operator characters they must be backslashed
1820     when searching. For example:
1821    
1822     Buzzwords C++ TCP/IP web=http
1823    
1824     ./swish-e -w 'web\=http'
1825    
1826     Buzzwords are found by splitting the text on whitespace, removing
1827     C<IgnoreFirstChar> and C<IgnoreLastChar> characters from the word,
1828     and then comparing with the list of C<Buzzwords>. Therefore, if
1829     adding C<Buzzwords> to an index you will probably want to define
1830     C<IgnoreFirstChar> and C<IgnoreLastChar> settings.
1831    
1832     Note: Buzzwords specific settings for C<IgnoreFirstChar> and
1833     C<IgnoreLastChar> may be used in the future.
1834    
1835    
1836     =item IgnoreWords [*list of stop words*|File: path]
1837    
1838     The IgnoreWords option allows you to specify words to ignore, called
1839     I<stopwords>. The default is to not use any stopwords.
1840    
1841     Words should be separated by spaces and may span multiple directives.
1842     If the special format C<File:filename> is used then the stop words will
1843     be read from an external file during indexing.
1844    
1845     In previous versions of Swish-e you could use the directive
1846    
1847     IgnoreWords swishdefault - obsolete!
1848    
1849     to include a default list of compiled in stopwords. This keyword is no
1850     longer supported.
1851    
1852     Examples:
1853    
1854     IgnoreWords www http a an the of and or
1855    
1856     IgnoreWords File: ./stopwords.de
1857    
1858     =item UseWords [*list of words*|File: path]
1859    
1860     UseWords defines the words that Swish-e will index. B<Only> the words
1861     listed will be indexed.
1862    
1863     You can specify a list of words following the directive (you may specify
1864     more than one C<UseWords> directive in a config file), and/or use the
1865     C<File:> form to specify a path to a file containing the words:
1866    
1867     UseWords perl python pascal fortran basic cobal php
1868     UseWords File: /path/to/my/wordlist
1869    
1870     Please drop the Swish-e list a note if you actually use this feature.
1871     It may be removed from future versions.
1872    
1873     =item IgnoreLimit *integer integer*
1874    
1875     This automatically omits words that appear too often in the files (these
1876     words are called stopwords). Specify a whole percentage and a number,
1877     such as "80 256". This omits words that occur in over 80% of the files
1878     and appear in over 256 files. Comment out to turn off auto-stopwording.
1879    
1880     IgnoreLimit 50 1000
1881    
1882     Swish-e must do extra processing to adjust the entire index when this
1883     feature is used. It is recommended that instead of using this feature
1884     that you decided what words are stopwords and add them to B<IngoreWords>
1885     in your configuration file. To do this, use IgnoreLimit one time and
1886     note the stop words that are found while indexing. Add this list to
1887     IgnoreWords, and then remove IgnoreLimit from the configuration file.
1888    
1889     =item IgnoreMetaTags *list of names*
1890    
1891     C<IgnoreMetaTags> defines a list of metantags to ignore while indexing
1892     XML files (and HTML files if using libxml2 for parsing HTML). All text
1893     within the tags will be ignored -- both for indexing (C<MetaNames>)
1894     and properties (C<PropertyNames>). To still parse properties, yet do
1895     not index the text, see L<UndefinedMetaTags|/"item_UndefinedMetaTags">.
1896    
1897     This option is useful to avoid indexing specific data from a file.
1898     For example:
1899    
1900     <person>
1901     <first_name>
1902     William
1903     </first_name> <last_name>
1904     Shakespeare
1905     </last_name> <updated_date>
1906     April 25, 1999
1907     </updated_date>
1908     </person>
1909    
1910     In the above example you might B<not> want to index the updated date,
1911     and therefore prevent finding this record by searching
1912    
1913     -w 'person=(April)'
1914    
1915     This is solved by:
1916    
1917     IgnoreMetaTags updated_date
1918    
1919    
1920     See also L<UndefinedMetaTags|/"item_UndefinedMetaTags">.
1921    
1922     =item IgnoreNumberChars *list of characters*
1923    
1924     Experimental Feature
1925    
1926     This experimental feature can be used to define a set of characters
1927     that describe a number. If a word is found to contain only those
1928     characters it will not be indexed. The characters listed must be part
1929     of C<WordCharacters> settings. In other words, the "word" checked is
1930     a word that Swish-e would otherwise index.
1931    
1932     For example,
1933    
1934     IgnoreNumberChars 0123456789$.,
1935    
1936     Then Swish-e would not index the following:
1937    
1938     123
1939     123,456.78
1940     $123.45
1941    
1942     You might be tempted to avoid indexing hex numbers with:
1943    
1944     IgnoreNumberChars 0123456789abcdef
1945    
1946     which will not index 0D31, but will also not index the word "bad".
1947    
1948     This is an experimental feature that may change in future versions.
1949     One possible change is to use regular expressions instead.
1950    
1951    
1952     =item IndexComments [NO|yes]
1953    
1954     This option allows the user decide if to index the contents of HTML
1955     comments. Default is no. Set to yes if comment indexing is required.
1956    
1957     IndexComments yes
1958    
1959     Note: This is a change in the default behavior prior to version 2.2.
1960    
1961     =item TranslateCharacters [*string1 string2*|:ascii7:]
1962    
1963     The TranslateCharacters directive maps the characters in string1 to the
1964     characters listed in string2.
1965    
1966     For example:
1967    
1968     # This will index a_b as a-b and ámo as amo
1969     TranslateCharacters _á -a
1970    
1971     C<TranslateCharacters :ascii7:> is a predefined set of characters that
1972     will translate eight bit characters to ascii7 characters. Using the
1973     :ascii7: rule will translate "Ääç" to "aac". This means: searching
1974     "Çelik", "çelik" or "celik" will all match the same word.
1975    
1976     TranslateCharacters is done early in the indexing process, after
1977     converting HTML entities but before splitting the input text into words
1978     based on B<WordCharacters>. So characterters you are translating I<from>
1979     do not need to be listed in word characters.
1980    
1981     The same character translations take place when searching.
1982    
1983     =item BumpPositionCounterCharacters *string*
1984    
1985     When indexing Swish-e assigns a word position to each word. This enables
1986     phrase searching. There may be cases where you would like to prevent
1987     phrase matching. The BumpPositionCounterCharacters directive allows
1988     you to specify a set of characters that when found in the text will
1989     increment the word position -- effectively preventing phrase matches
1990     across that character.
1991    
1992     For example, if you have a tag:
1993    
1994     <subjects>
1995     computer programming | apple computers
1996     </subjects>
1997    
1998     You might want to prevent matching "programming apple" in that meta name.
1999    
2000     BumpPositionCounterCharacters |
2001    
2002     There is no default, and you may list a string of characters.
2003    
2004     =item DontBumpPositionOnEndTags *list of names*
2005    
2006     =item DontBumpPositionOnStartTags *list of names*
2007    
2008     Since metatags are typically separate data fields, the word position
2009     counter is automatically bumped between metatags (actally, bumpted when a
2010     start tag is found and when an end tag is found). This prevents matching
2011     a phrase that spans more than one metaname. C<DontBumpPositionOnEndTags>
2012     and C<DontBumpPositionOnStartTags> disables this feature for the listed
2013     metanames.
2014    
2015     For example,
2016    
2017     <person>
2018     <first_name>
2019     William
2020     </first_name>
2021     <last_name>
2022     Shakespeare
2023     </last_name>
2024     <updated_date>
2025     April 25, 1999
2026     </updated_date>
2027     </person>
2028    
2029     In the conifuration file:
2030    
2031     DontBumpPositionOnEndTags first_name
2032     DontBumpPositionOnStartTags last_name
2033    
2034     This configuration allows this phrase search
2035    
2036     -w 'person=("william shakespeare")'
2037    
2038     but this phrase search will fail
2039    
2040     -w 'person=("shakespeare april")'
2041    
2042    
2043    
2044     =back
2045    
2046    
2047     =head2 Directives for the File Access method only
2048    
2049     Some directives have different uses depending on the source of the
2050     documents. These directives are only valid when using the B<File system>
2051     method of indexing.
2052    
2053     =over 4
2054    
2055     =item IndexOnly *list of file suffixes*
2056    
2057     This directive specifies the allowable file suffixes (extensions) while
2058     indexing. The default is to index all files specified in B<IndexDir>.
2059    
2060     # Only index .html .htm and .q files
2061     IndexOnly .html .htm .q
2062    
2063     C<IndexOnly> checks that the file end in the characters listed. It does
2064     not check "extensions". C<IndexOnly> is tested right before C<FileRules>
2065     is processed.
2066    
2067     =item FollowSymLinks [yes|NO]
2068    
2069     Put "yes" to follow symbolic links in indexing, else "no". Default is no.
2070    
2071     FollowSymLinks no
2072     FollowSymLinks yes
2073    
2074     Note that when set to C<no> extra stat(2) system calls must be made for
2075     each file. For large number of files you may see a small reduction in
2076     indexing time by setting this to C<yes>.
2077    
2078     See also the C<-l> switch in L<SWISH-RUN|SWISH-RUN>.
2079    
2080     =item FileRules [type] [contains|is|regex] *regular expression*
2081    
2082     =item FileMatch [type] [contains|is|regex] *regular expression*
2083    
2084     FileRules and FileMatch are used to, respectively, exclude and include
2085     files and directories to index. Since, by default, Swish-e indexes all
2086     files and recurses all directories (but see also C<FollowSymLinks>) you
2087     will typically only use C<FileRules> to exclude files or directories.
2088     C<FileMatch> is useful in a few cases, for example, to override the
2089     behavior of C<IndexOnly>. Some examples are included below.
2090    
2091     Except for C<FileRules title ...>, this feature is only available for
2092     file access method (-S fs), which is the default indexing mode. Also,
2093     any pathname modification with C<ReplaceRules> happens after the check
2094     for C<FileRules>. (It's unlikly that you would exclude files with
2095     C<FileRules> based on text you added with C<ReplaceRules>!)
2096    
2097     The regular expression is a C regex.h extended regular expression.
2098     You may supply more than one regular expression per line, or use
2099     separate directives. Preceeding the regular expression with the word
2100     "not" negates the match.
2101    
2102     The regular expression is compared against B<[type]> as described below.
2103    
2104     For historical reasons, you can specify C<contains> or C<is>. C<is>
2105     simply forces the regular expression to match at the start and end
2106     of the string (by internally prepending "^" and appending "$" to the
2107     regular expression).
2108    
2109     The C<regex> option requires delimiter characters:
2110    
2111     FileRules title regex /^private/i
2112    
2113     The only advantage of C<regex> is if you want to do case insensitive
2114     matches, or simply like your regular expressions to look like perl
2115     regular expressions. You must use matching delimiters; (), {}, and [],
2116     are not currently supported for no good reason other than laziness.
2117    
2118     Use quotes (" or ') around a pattern if it contains any white space.
2119     Note that the backslash character becomes the escape character within
2120     quotes.
2121    
2122     For example, these sets generate the same regular expressions.
2123    
2124     FileRules title is hello
2125     FileRules title contains ^hello$
2126     FileRules title regex /^hello$/
2127    
2128     These all need quotes due to the included space character
2129    
2130     FileRules title is "hello there"
2131     FileRules title contains "^hello there$"
2132     FileRules title regex "!^hello there$!"
2133    
2134     These show how the backslash must be doubled inside of quotes.
2135     Swish-e converts a double-backslash into a single backslash, and then
2136     passes that single onto the regular expression compiler.
2137    
2138     FileRules filename regex /\.pdf/
2139     FileRules filename regex "/\\.pdf/"
2140    
2141     FileRules filename regex !hello\\there! # need double for real backslash
2142     FileRules filename regex "!hello\\\\there!" # need double-double inside of quotes
2143    
2144    
2145     B<Matching Types>
2146    
2147     The following types of match strings my be supplied:
2148    
2149     FileRules pathname
2150     FileRules dirname
2151     FileRules filename
2152     FileRules directory
2153     FileRules title
2154    
2155     FileMatch pathname
2156     FileMatch filename
2157     FileMatch dirname
2158     FileMatch directory
2159    
2160     B<pathname> matches the regular expression against the current pathname.
2161     The pathname may or may not be absolute depending on what you supplied
2162     to C<IndexDir>.
2163    
2164     Example:
2165    
2166     # Don't index paths that contain private or hidden
2167     FileRules pathname contains (private|hidden)
2168    
2169     # Same thing
2170     FileRules pathname regex /(private|hidden)/
2171    
2172     # Don't index exe files
2173     FileRules pathname contains \.exe$
2174    
2175     B<dirname> and B<filename> split the path name by the last delimiter
2176     character into a directory name, and a file name. Then these are compared
2177     against the patterns supplied. Directory names do B<not> have a trailing
2178     slash. All path names use the forward slash as a delimiter within Swish-e.
2179    
2180     Example:
2181    
2182     # Same as last example - don't index *.exe files.
2183     FileRules filename contains \.exe$
2184    
2185     # Don't index any file called test.html files
2186     FileRules filename contains ^test\.html$
2187    
2188     # Same thing
2189     FileRules filename is test\.html
2190    
2191     # Don't index any directoires that contain "old" (/usr/local/myold/docs)
2192     FileRules dirname contains old
2193    
2194     # Don't index any directories that contain the path segment "old" (/usr/local/old/foo)
2195     FileRules dirname contains /old/
2196    
2197     # Index only .htm, .html, plus any all-digit file names
2198     IndexOnly .htm .html
2199     FileMatch filename contains ^\d+$
2200    
2201     # Same as previous, but maybe a little slower
2202     FileRules filename regex not !\.(htm|html)$!
2203     FileMatch filename contains ^\d+$
2204    
2205     Swish-e checks these settings in the order of C<pathname>, C<dirname>, and
2206     C<filename>, and C<FileMatch> patterns are checked before C<FileRules>,
2207     in general. This allows you to exclude most files with C<FileRules>,
2208     yet allow in a few special cases with C<FileMatch>. For example:
2209    
2210     # Exclude all files of .exe, .bin, and .bat
2211     FileRules filename contains \.(exe|bin|bat)$
2212     # But, let these two in
2213     FileMatch filename is baseball\.bat incoming_mail\.bin
2214    
2215     # Same, but as a single pattern
2216     FileMatch filename is (baseball\.bat|incoming_mail\.bin)
2217    
2218     The C<directory> type is somewhat unique. When Swish-e recurses into a
2219     directory it will compare all the I<files> in the directory with the
2220     pattern and then decide if that entire directory should or should not
2221     be indexed (or recursed). Note that you are matching against file names
2222     in a directory -- and some of those names may be directory names.
2223    
2224     A C<FileRules directory> match will cause Swish-e to ignore all files and
2225     sub-directories in the current directory.
2226    
2227     Warning: A match with C<FileMatch directory> says to index B<everything>
2228     in the *current* directory and B<ignore> any FileRules for this directory.
2229    
2230    
2231     Example:
2232    
2233     # Don't index any directories (and sub directories) that contain
2234     # a file (or sub-directory) called "index.skip"
2235     FileRules directory contains ^index\.skip$
2236    
2237     # Don't index directories that contain a .htaccess file.
2238     FileRules directory contains ^\.htaccess
2239    
2240     Note: While I<processing> directories, Swish-e will ignore any files
2241     or directories that begin with a dot ("."). You may index files
2242     or directories that begin with a dot by specifying their name with
2243     C<IndexDir> or C<-i>.
2244    
2245     C<title> checks for a pattern match in an HTML title.
2246    
2247     Example:
2248    
2249     FileRules title contains construction example pointers
2250    
2251     # This example says to ignore case
2252     FileRules title regex "/^Internal document/i"
2253    
2254     Note: C<FileRules title> works for any input method (fs, prog, or http)
2255     that is parsed as HTML, and where a title was found in the document.
2256    
2257     In case all this seems a bit confusing, processing a directory happens
2258     in the following order.
2259    
2260     First the directory name is checked:
2261    
2262     FileRules dirname - reject entire directory if matches
2263    
2264     Next the directory is scanned and each file name (which might be the
2265     name of a sub-directory) is checked:
2266    
2267     FileRules directory - reject entire dir if any files match FileMatch
2268     directory - accept *entire* dir if any files match
2269    
2270     Then, unless C<FileMatch directory> matched, each file is tested with
2271     FileMatch. A match says to index the file without further testing
2272     (i.e. overrides FileRules and IndexOnly):
2273    
2274     FileMatch pathname \
2275     FileMatch dirname - file is accepted if any match
2276     FileMatch filename /
2277    
2278     otherwise
2279    
2280     IndexOnly - file is checked for the correct file extension
2281    
2282     FileRules pathname \
2283     FileRules dirname - file is rejected if any match
2284     FileRules filename /
2285    
2286     finally, the file is indexed.
2287    
2288     Files (not directories) listed with C<IndexDir> or C<-i> are processed
2289     in a similar way:
2290    
2291     FileMatch pathname \
2292     FileMatch dirname - file is accepted if any match
2293     FileMatch filename /
2294    
2295     otherwise, the file is rejected if it doesn't have the correct extension
2296     or a FileRules matches.
2297    
2298     IndexOnly - file is checked for the correct file extension
2299    
2300     FileRules pathname \
2301     FileRules dirname - file is rejected if any match
2302     FileRules filename /
2303    
2304     Note: If things are not indexing as you expect, create a directory
2305     with some test files and use the C<-T regex> trace option to see how
2306     file names are checked. Start with very simple tests!
2307    
2308    
2309     =back
2310    
2311     =head2 Directives for the HTTP Access Method Only
2312    
2313     These directives are available when using the HTTP Access Method of indexing.
2314    
2315     =over 4
2316    
2317     =item MaxDepth *integer*
2318    
2319     MaxDepth defines how many links the spider should follow before stopping.
2320     A value of 0 configures the spider to traverse all links. The default
2321     is MaxDepth 5.
2322    
2323     MaxDepth 5
2324    
2325     =item Delay *seconds*
2326    
2327     The number of seconds to wait between issuing requests to a server.
2328     This setting allows for more friendly spidering of remote sites.
2329     The default is 60 seconds.
2330    
2331     Delay 1
2332    
2333     =item TmpDir *path*
2334    
2335     The location of a writable temp directory on your system. The HTTP
2336     access method tells the Perl helper to place its files in this location,
2337     and the C<-e> switch causes Swish-e to use this directory while indexing.
2338     There is no default.
2339    
2340     TmpDir /tmp/swish
2341    
2342     If this directory does not exist or is not writable Swish-e will fail
2343     with an error during indexing.
2344    
2345     Note, the environment variables of C<TMPDIR>, C<TMP>, and C<TEMP>
2346     (in that order) will B<override> this setting.
2347    
2348     =item SpiderDirectory *path*
2349    
2350     The location of the Perl helper script called F<swishspider>. If you
2351     use a relative directory, it is relative to your directory when you run
2352     Swish-e, not to the directory that Swish-e is in. The default is C<./>
2353    
2354     SpiderDirectory /usr/local/swish
2355    
2356     =item EquivalentServer *server alias*
2357    
2358     Often times the same site may be referred to by different names.
2359     A common example is that often http://www.some-server.com and
2360     http://some-server.com are the same. Each line should have a list of
2361     all the method/names that should be considered equivalent. Multiple
2362     EquivalentServer directives may be used. Each directive defines its
2363     own set of equivalent servers.
2364    
2365     EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu
2366     EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu
2367    
2368     =back
2369    
2370     =head2 Directives for the prog Access Method Only
2371    
2372     This section details the directives that are only available for the
2373     "prog" document source feature of Swish-e. The "prog" access method runs
2374     an external program that "feeds" documents to Swish-e. This allows indexing
2375     and filtering of documents from any source.
2376    
2377     See L<prog - general purpose access method|SWISH-RUN/"item_prog"> in
2378     the SWISH-RUN man page for more information.
2379    
2380    
2381     A number of example programs for use with the "prog" access method are
2382     provided in the F<prog-bin> directory. Please see those example if you
2383     have questions about implementing a "prog" input program.
2384    
2385     =over 4
2386    
2387     =item SwishProgParameters *list of parameters*
2388    
2389     This is a list of parameters that will be sent to the external program
2390     when running with the "prog" document source method.
2391    
2392     SwishProgParameters /path/to/config hello there
2393     IndexDir /path/to/program.pl
2394    
2395     Then running:
2396    
2397     swish-e -c config -S prog
2398    
2399     Swish-e will execute C</path/to/program.pl> and pass C</path/to/config
2400     hello there> as three command line arguments to the program. This
2401     directive makes it easy to pass settings from the Swish-e configuration
2402     file to the external program.
2403    
2404     For example, the C<spider.pl> program (included in the C<prog-bin>
2405     directory) uses the C<SwishProgParameters> to specify what file to read
2406     for configuation information.
2407    
2408     SwishProgParameters spider.config
2409     IndexDir ./spider.pl
2410    
2411     The C<spider.pl> program also has a default action so you can avoid
2412     using a configuration file:
2413    
2414     SwishProgParameters default http://www.swishe.org/ http://some.other.site/
2415     IndexDir ./spider.pl
2416    
2417     And the spider program will use default settings for spidering those sites.
2418    
2419     =back
2420    
2421     B<Notes when using MS Windows>
2422    
2423     You should use unix style path separators to specify your external
2424     program. Swish will convert forward slashes to backslashes before
2425     calling the external program. This is only true for the program name
2426     specified with C<IndexDir> or the C<-i> command line option.
2427    
2428     In addition, Swish-e will make sure the program specified actually exists,
2429     which means you need to use the full name of the program.
2430    
2431     For example, to run the perl spider program F<spider.pl> you would need
2432     a Swish-e configuration file such as:
2433    
2434     IndexDir e:/perl/bin/perl.exe
2435     SwishProgParameters prog-bin/spider.pl default http://swish-e.org
2436    
2437     and run indexing with the command:
2438    
2439     swish-e -c swish.cfg -S prog -v 9
2440    
2441     The C<IndexDir> command tells Swish-e the name of the program to run.
2442     Under unix you can just specify the name of the script, since unix will
2443     figure out the program from the first line of the script.
2444    
2445     The C<SwishProgParameters> are the parameters passed to the program
2446     specified by C<IndexDir> (perl.exe in this case). The first parameter
2447     is the perl script to run (F<prog-bin/spider.pl>). Perl passes the rest
2448     of the parameters directly to the perl script. The second parameter
2449     F<default> tells the F<spider.pl> program to use default settings for
2450     spidering (or you could specify a spider config file -- see C<perldoc
2451     spider.pl> for details), and lastly, the URL is passed into the spider
2452     program.
2453    
2454    
2455     =head2 Document Filter Directives
2456    
2457     Internally, Swish-e knows how to parse only text, HTML, and XML documents.
2458     With Swish-e filters you can index other types of documents. For example,
2459     if all your web pages are in gzip format a filter can uncompress these
2460     on the fly for indexing.
2461    
2462     A filter is an external program that Swish-e executes while processing
2463     a document of a given type. Swish-e will execute the filter program
2464     for each file that matches the file suffix (extension) set in the
2465     B<FileFilter> or B<FileFilterMatch> directives. B<FileFilterMatch>
2466     matches using regular expressions and is described below.
2467    
2468     Swish-e calls the external program passing as B<default> arguments:
2469    
2470     =over 4
2471    
2472     =item $0
2473    
2474     the name of the filter program
2475    
2476     =item $1
2477    
2478     the physical path name of the file to read. This may be a temporary
2479     file location if indexing by the http method.
2480    
2481     =item $2
2482    
2483     When indexing under the file system this will be the same as $1 (the
2484     path to the source file), but when indexing under the http method this
2485     will be the URL of the source document.
2486    
2487     =back
2488    
2489     Swish-e can also pass other parameters to the filter program. These
2490     parameters can be defined using the B<FileFilter> or B<FileFilterMatch>
2491     directives. See Filter Options below.
2492    
2493     The filter program must open the file, process its contents, and return
2494     it to Swish-e by printing to STDOUT.
2495    
2496     Note that this can add a significant amount of time to the indexing
2497     process if your external program is a perl or shell script. If you
2498     have many files to filter you should consider writing your filter in C
2499     instead of a shell or perl script, or using the "prog" Access Method.
2500    
2501     =over 4
2502    
2503     =item FilterDir *path-to-directory*
2504    
2505     This is the path to a directory where the filter programs are stored.
2506     Swish-e looks in this directory to find the filter specified in the
2507     B<FileFilter> directive. If this directive is omitted, you have to
2508     specify the full path to the filterscript on each FileFilter directive.
2509    
2510     This feature does *not* apply to the C<FileFilterMatch> directive.
2511    
2512     Example:
2513    
2514     FilterDir /usr/local/swish/filters
2515    
2516     =item FileFilter *suffix* "filter-prog" ["filter-options"]
2517    
2518     This maps file suffixe (extension) to a filter program. If I<filter-prog>
2519     starts with a directory delimiter (absolute path), Swish-e doesn't use
2520     the FilterDir settings, but uses the given I<filter-prog> path directly.
2521    
2522     Filter options:
2523    
2524     Filter options are a string passed as arguments to the I<filter-prog>.
2525     Filter options can contain variables, replaced by Swish-e. If you ommit
2526     I<filter-options> Swish-e will use default parameters for the options
2527     listed above.
2528    
2529     Default: "'%p' '%P'"
2530     Which means: pass "workfile path" and "documentfile path" to filter (each quoted).
2531    
2532     Variables in filter options:
2533    
2534     %% = %
2535     %P = Full document pathname (e.g. URL, or path on filesystem)
2536     %p = Full pathname to work file (maybe a tmpfile or the real document path on filesystem)
2537     %F = Filename stripped from full document pathname
2538     %f = Filename stripped from "work" pathname
2539     %D = Directoryname stripped from full document pathname
2540     %d = Directoryname stripped from full "work" pathname
2541    
2542     Examples of strings passed:
2543    
2544     %P = document pathname: http://myserver/path1/mydoc.txt
2545     %p = work pathname: /tmp/tmp.1234.mydoc.txt
2546     %F = mydoc.txt
2547     %f = tmp.1234.mydoc.txt
2548     %D = http://myserver/path1
2549     %d = /tmp
2550    
2551     Important hint for security:
2552    
2553     When using variable substitution, use quotes to ensure filename integrity.
2554    
2555     e.g. "'%f'" --> 'file name with spaces.doc'.
2556    
2557     If you don't use this, your system security may be compromised, or
2558     filtering may not work for these files.
2559    
2560     B<Notes when using MS Windows>
2561    
2562     Windows uses double quotes to escape shell metacharacters, so reverse
2563     the quotes in the examples above. e.g.:
2564    
2565     '"%f"' --> "file name with spaced.doc"
2566    
2567     You can specify the filter program using forward slashes (unix style).
2568     Swish will convert the slashes to backslashes before running your program.
2569    
2570     FileFilter .mydoc c:/some/path/mydocfilter.exe '-d "%d" -example -url "%P" "%f"'
2571    
2572    
2573     Examples of filters:
2574    
2575     FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"
2576     FileFilter .pdf pdftotext "'%p' -"
2577     FileFilter .html.gz gzip "-c '%p'"
2578     FileFilter .mydoc "/some/path/mydocfilter" "-d '%d' -example -url '%P' '%f'"
2579    
2580     The above examples are running a I<binary> filter program. For more
2581     complicated filtering needs you may use a scripting language such as
2582     Perl or a shell script. Here's some examples of calling a shell and
2583     perl script:
2584    
2585     FileFilter .pdf pdf2html.sh
2586     FileFilter .ps ghostscript-filter.pl
2587    
2588     Using a scripting language (or any language that has a large startup
2589     cost) can B<greatly increase the indexing time>. For small indexing
2590     jobs, this may not be an issue, but for large collections of files that
2591     require processing by a scripting language, you may be better off using
2592     the C<-S prog> access method where the script will only be compiled once,
2593     instead of for each document.
2594    
2595     Filters are probably easier to write than a C<-S prog> program. Which you
2596     decide to use depends on your requirements. Examples of filter scripts
2597     can be found in the F<filter-bin> directory, and examples of C<-S prog>
2598     programs can be found in the F<prog-bin> directory.
2599    
2600     =item FileFilterMatch *filter-prog* *filter-options* *regex* [*regex* ...]
2601    
2602     This is similar to C<FileMatch> except uses regular expressions to
2603     match against the file name. *filter-prog* is the path to the program.
2604     Unlike C<FileFilter> this does B<not> use the C<FilterDir> option.
2605     Also unlike C<FileFilter> you B<must> specify the *filter-options*.
2606    
2607     Examples:
2608    
2609     FileFilterMatch ./pdftotext "'%p' -" /\.pdf$/
2610    
2611     Note that will also match a file called ".pdf", so you may want to use
2612     something that requires a filename that has more than just an extension.
2613     For example:
2614    
2615     FileFilterMatch ./pdftotext "'%p' -" /.\.pdf$/
2616    
2617     To specify more than one extension:
2618    
2619     FileFilterMatch ./check_title.pl "%p" /\.html$/ /\.htm$/
2620    
2621     Or a few ways to do the same thing:
2622    
2623     FileFilterMatch ./check_title.pl %p /\.(html|html)$/
2624     FileFilterMatch ./check_title.pl %p /\.html?$/
2625    
2626     And to ignore case:
2627    
2628     FileFilterMatch ./check_title.pl %p /\.html?$/i
2629    
2630     You may also precede an expression with "not" to negate regular expression
2631     that follow. For example, to match files that do not have an extension:
2632    
2633     FileFilterMatch ./convert "%p %P" not /\..+$/
2634    
2635     =back
2636    
2637     =head1 Document Info
2638    
2639     $Id: SWISH-CONFIG.pod,v 1.60 2002/08/28 14:30:23 whmoseley Exp $
2640    
2641     .

  ViewVC Help
Powered by ViewVC 1.1.22