/[MITgcm]/mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-CONFIG.pod
ViewVC logotype

Contents of /mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-CONFIG.pod

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.1.1.1 - (show annotations) (download) (vendor branch)
Fri Sep 20 19:47:29 2002 UTC (22 years, 10 months ago) by adcroft
Branch: Import, MAIN
CVS Tags: baseline, HEAD
Changes since 1.1: +0 -0 lines
Error occurred while calculating annotation data.
Importing web-site building process.

1 =head1 NAME
2
3 SWISH-CONFIG - Configuration File Directives
4
5 =head1 Swish-e CONFIGURATION FILE
6
7 What files Swish-e indexes and how they are indexed, and where the index
8 is written can be controlled by a configuration file.
9
10 The configuration file is a text file composed of comments, blank
11 lines, and B<configuration directives>. The order of the directives
12 is not important. Some directives may be used more than once in the
13 configuration file, while others can only be used once (e.g. additional
14 directives will overwrite preceding directives). Case of the directive
15 is not important -- you may use upper, lower, or mixed case.
16
17 Comments are any line that begin with a "#".
18
19 # This is a comment
20
21 Directives may take more than one parameter. Enclose single parameters
22 that include whitespace in quotes (single or double). Inside of quotes
23 the backslash escapes the next character.
24
25 ReplaceRules append "foo bar" <- define "foo bar" as a single parameter
26
27 If you need to include a quote character in the value either use a
28 backslash to escape it, or enclose it in quotes of the other type.
29
30 For example, under unix you can use quotes to include white space in a
31 single paramter. Here, to protect against path names (%p) that might
32 have white space embedded use single quotes (this also protects against
33 shell expansion or metacharacters):
34
35 FileFilter .foo foofilter "'%p'" <- parameter passed through the shell in single quotes
36 FileFilter .foo foofilter '"%p"' <- windows uses double-quotes
37 FileFilter .foo foofilter '\'%p\''<- silly example
38
39
40 Backslashes also have special meaning in regular expressions.
41
42 FileFilterMatch pdftotext "'%p' -" /\.pdf$/
43
44 This says that the dot is a real dot (instead of matching any character).
45 If you place the regular expression in quotes then you must use
46 double-backslashes.
47
48 FileFilterMatch pdftotext "'%p' -" "/\\.pdf$/"
49
50 Swish-e will convert the double backslash into a single backslash before
51 passing the parameter to the regular expression compiler.
52
53 Commented example configuration files are included in the F<conf>
54 directory of the Swish-e distribution.
55
56 Some command line arguments can override directives specified in the
57 configuration file. Please see also the L<SWISH-RUN|SWISH-RUN> for
58 instructions on running Swish-e, and the L<SWISH-SEARCH|SWISH-SEARCH>
59 page for information and examples on how to search your index.
60
61 The configuration file is specified to Swish-e by the C<-c> switch.
62 For example,
63
64 swish-e -c myconfig.conf
65
66 You may also split your directives up into different configuration files.
67 This allows you to have a master configuration file used for many
68 different indexes, and smaller configuration files for each separate
69 index. You can specify the different configuration files when running
70 from the command line with the C<-c> switch (see L<SWISH-RUN|SWISH-RUN>),
71 or you may include other Configuration file with the B<IncludeConfigFile>
72 directive below.
73
74 Typically, in a configuration file the directives are grouped together in
75 some logical order -- that is, directives that control the source of the
76 documents would be grouped together first, and directives that control
77 how each document is filtered or its words index in another group of
78 directives. (The directives listed below are grouped in this order).
79
80 The configuration file directives are listed below in these groups:
81
82 =over 4
83
84 =item *
85
86 L<Administrative Headers Directives|/"Administrative Headers Directives">
87 -- You may add administrative information to the header of the index file.
88
89 =item *
90
91 L<Document Source Directives|/"Document Source Directives"> -- Directives
92 for selecting the source documents and the location of the index file.
93
94 =item *
95
96 L<Document Contents Directives|/"Document Contents Directives"> --
97 Directives that control how a document content is indexed.
98
99 =item *
100
101 L<Directives for the File Access method only|/"Directives for the File
102 Access method only"> -- These directives are only applicable to the File
103 Access indexing method.
104
105 =item *
106
107 L<Directives for the HTTP Access Method Only|/"Directives for the HTTP
108 Access Method Only"> -- Likewise, these only apply to the HTTP Access
109 method.
110
111 =item *
112
113 L<Directives for the prog Access Method Only|/"Directives for the prog
114 Access Method Only"> -- These only apply to the prog Access method.
115
116 =item *
117
118 L<Document Filter Directives|/"Document Filter Directives"> -- This is
119 a special section that describes using document filters with Swish-e.
120
121 =back
122
123 =head2 Alphabetical Listing of Directives
124
125 =over 4
126
127 =item *
128
129 L<AbsoluteLinks|/"item_AbsoluteLinks"> [yes|NO]
130
131 =item *
132
133 L<BeginCharacters|/"item_BeginCharacters"> *string of characters*
134
135 =item *
136
137 L<BumpPositionCounterCharacters|/"item_BumpPositionCounterCharacters"> *string*
138
139 =item *
140
141 L<Buzzwords|/"item_Buzzwords"> [*list of buzzwords*|File: path]
142
143
144 =item *
145
146 L<ConvertHTMLEntities|/"item_ConvertHTMLEntities"> [YES|no]
147
148 =item *
149
150 L<DefaultContents|/"item_DefaultContents"> [TXT|HTML|XML|WML]
151
152 =item *
153
154 L<Delay|/"item_Delay"> *seconds*
155
156 =item *
157
158 L<DontBumpPositionOnEndTags|/"item_DontBumpPositionOnEndTags"> *list of names*
159
160 =item *
161
162 L<DontBumpPositionOnStartTags|/"item_DontBumpPositionOnStartTags"> *list of names*
163
164 =item *
165
166 L<EnableAltSearchSyntax|/"item_EnableAltSearchSyntax"> [yes|NO]
167
168 =item *
169
170 L<EndCharacter|/"item_EndCharacters"> *string of characters*
171
172 =item *
173
174 L<EquivalentServer|/"item_EquivalentServer"> *server alias*
175
176 =item *
177
178 L<ExtractPath|/"item_ExtractPath"> *metaname* [replace|remove|prepend|append|regex]
179
180 =item *
181
182 L<FileFilter|/"item_FileFilter"> *suffix* *program* [options]
183
184 =item *
185
186 L<FileFilterMatch|/"item_FileFilterMatch"> *program* *options* *regex* [*regex* ...]
187
188 =item *
189
190 L<FileInfoCompression|/"item_FileInfoCompression"> [yes|NO]
191
192 =item *
193
194 L<FileMatch|/"item_FileMatch"> [contains|is|regex] *regular expression*
195
196 =item *
197
198 L<FileRules|/"item_FileRules"> [contains|is|regex] *regular expression*
199
200 =item *
201
202 L<FuzzyIndexingMode|/"item_FuzzyIndexingMode"> [NONE|Stemming|Soundex|Metaphone|DoubleMetaphone]
203
204 =item *
205
206 L<FollowSymLinks|/"item_FollowSymLinks"> [yes|NO]
207
208 =item *
209
210 L<HTMLLinksMetaName|/"item_HTMLLinksMetaName"> *metaname*
211
212 =item *
213
214 L<IgnoreFirstChar|/"item_IgnoreFirstChar"> *string of characters*
215
216 =item *
217
218 L<IgnoreLastChar|/"item_IgnoreLastChar"> *string of characters*
219
220 =item *
221
222 L<IgnoreLimit|/"item_IgnoreLimit"> *integer integer*
223
224 =item *
225
226 L<IgnoreMetaTags|/"item_IgnoreMetaTags"> *list of names*
227
228 =item *
229
230 L<IgnoreNumberChars|/"item_IgnoreNumberChars"> *list of characters*
231
232 =item *
233
234 L<IgnoreTotalWordCountWhenRanking|/"item_IgnoreTotalWordCountWhenRanking"> [YES|no]
235
236 =item *
237
238 L<IgnoreWords|/"item_IgnoreWords"> [*list of stop words*|File: path]
239
240 =item *
241
242 L<ImageLinksMetaName|/"item_ImageLinksMetaName"> *metaname*
243
244 =item *
245
246 L<IncludeConfigFile|/"item_IncludeConfigFile">
247
248 =item *
249
250 L<IndexAdmin|/"item_IndexAdmin"> *text*
251
252 =item *
253
254 L<IndexAltTagMetaName|/"item_IndexAltTagMetaName"> *tagname*|as-text
255
256 =item *
257
258 L<IndexComments|/"item_IndexComments"> [YES|no]
259
260 =item *
261
262 L<IndexContents|/"item_IndexContents"> [TXT|HTML|XML|WML|TXT2|HTML2|XML2] *file extensions*
263
264 =item *
265
266 L<IndexDescription|/"item_IndexDescription"> *text*
267
268 =item *
269
270 L<IndexDir|/"item_IndexDir"> [URL|directories or files]
271
272 =item *
273
274 L<IndexFile|/"item_IndexFile"> *path*
275
276 =item *
277
278 L<IndexName|/"item_IndexName"> *text*
279
280 =item *
281
282 L<IndexOnly|/"item_IndexOnly"> *list of file suffixes*
283
284 =item *
285
286 L<IndexPointer|/"item_IndexPointer"> *text*
287
288 =item *
289
290 L<IndexReport|/"item_IndexReport"> [0|1|2|3]
291
292 =item *
293
294 L<MaxDepth|/"item_MaxDepth"> *integer*
295
296 =item *
297
298 L<MaxWordLimit|/"item_MaxWordLimit"> *integer*
299
300 =item *
301
302 L<MetaNameAlias|/"item_MetaNameAlias"> *meta name* *list of aliases*
303
304 =item *
305
306 L<MetaNames|/"item_MetaNames"> *list of names*
307
308 =item *
309
310 L<MinWordLimit|/"item_MinWordLimit"> *integer*
311
312 =item *
313
314 L<NoContents|/"item_NoContents"> *list of file suffixes*
315
316 =item *
317
318 L<obeyRobotsNoIndex|/"item_obeyRobotsNoIndex"> [yes|NO]
319
320 =item *
321
322 L<ParserWarnLevel|/"item_ParserWarnLevel"> [0|1|2|3]
323
324 =item *
325
326 L<PreSortedIndex|/"item_PreSortedIndex"> *list of property names*
327
328 =item *
329
330 L<PropCompressionLevel|/"item_PropCompressionLevel"> [0-9]
331
332 =item *
333
334 L<PropertyNameAlias|/"item_PropertyNameAlias"> *property name* *list of aliases*
335
336 =item *
337
338 L<PropertyNames|/"item_PropertyNames"> *list of meta names*
339
340 =item *
341
342 L<PropertyNamesCompareCase|/"item_PropertyNamesCompareCase"> *list of meta names*
343
344 =item *
345
346 L<PropertyNamesIgnoreCase|/"item_PropertyNamesIgnoreCase"> *list of meta names*
347
348 =item *
349
350 L<PropertyNamesDate|/"item_PropertyNamesDate"> *list of meta names*
351
352 =item *
353
354 L<PropertyNamesNumeric|/"item_PropertyNamesNumeric"> *list of meta names*
355
356 =item *
357
358 L<PropertyNamesMaxLength|/"item_PropertyNamesMaxLength"> integer *list of meta names*
359
360 =item *
361
362 L<ReplaceRules|/"item_ReplaceRules"> [replace|remove|prepend|append|regex]
363
364 =item *
365
366 L<ResultExtFormatName|/"item_ResultExtFormatName"> name -x format string
367
368 =item *
369
370 L<SpiderDirectory|/"item_SpiderDirectory"> *path*
371
372 =item *
373
374 L<StoreDescription|/"item_StoreDescription"> [XML <tag>|HTML <meta>|TXT size]
375
376 =item *
377
378 L<SwishProgParameters|/"item_SwishProgParameters> *list of parameters*
379
380 =item *
381
382 L<SwishSearchDefaultRule|/"item_SwishSearchDefaultRule"> [<AND-WORD>|<or-word>]
383
384 =item *
385
386 L<SwishSearchOperators|/"item_SwishSearchOperators"> <and-word> <or-word> <not-word>
387
388 =item *
389
390 L<TmpDir|/"item_TmpDir"> *path*
391
392 =item *
393
394 L<TranslateCharacters|/"item_TranslateCharacters"> [*string1 string2*|:ascii7:]
395
396 =item *
397
398 L<TruncateDocSize|/"item_TruncateDocSize">
399 *number of characters*
400
401 =item *
402
403 L<UndefinedMetaTags|/"item_UndefinedMetaTags"> [error|ignore|INDEX|auto]
404
405 =item *
406
407 L<UndefinedXMLAttributes|/"item_UndefinedXMLAttributes"> [DISABLE| error|ignore|index|auto]
408
409 =item *
410
411 L<UseStemming|/"item_UseStemming"> [yes|NO]
412
413 =item *
414
415 L<UseSoundex|/"item_UseSoundex"> [yes|NO]
416
417 =item *
418
419 L<UseWords|/"item_UseWords"> [*list of words*|File: path]
420
421 =item *
422
423 L<WordCharacters|/"item_WordCharacters"> *string of characters*
424
425 =item *
426
427 L<XMLClassAttributes|/"item_XMLClassAttributes"> *list of XML attribute names*
428
429 =back
430
431 =head2 Directives that Control Swish
432
433 These configuration directives control the general behavior of Swish-e.
434
435 =over 4
436
437 =item IncludeConfigFile *path to config file*
438
439 This directive can be used to include configuration directives located
440 in another file.
441
442 IncludeConfigFile /usr/local/swish/conf/site_config.config
443
444 =item IndexReport [0|1|2|3]
445
446 This is how detailed you want reporting while indexing. You can specify
447 numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default
448 is 1.
449
450 This may be overridden from the command line via the C<-v> switch (see
451 L<SWISH-RUN|SWISH-RUN>).
452
453 =item ParserWarnLevel [0|1|2|3]
454
455 Sets the error level when using the libxml2 parser for XML and HTML.
456 libxml2 will point out structural errors in your documents.
457
458 0 = no report
459 1 = fatal errors
460 2 = errors
461 3 = warnings
462
463 The exception to this is UTF-8 to Latin-1 coversion errors are reported at
464 level 1. This is because words may be indexed incorrecty in these cases.
465
466 Note that unlike other errors generated by Swish-e, these errors are
467 sent to stderr.
468
469 =item IndexFile *path*
470
471 Index file specifies the location of the generated index file. If not
472 specified, Swish-e will create the file F<index.swish-e> in the current
473 directory.
474
475 IndexFile /usr/local/swish/site.index
476
477 =item obeyRobotsNoIndex [yes|NO]
478
479 When enabled, Swish-e will not index any HTML file that contains:
480
481 <meta name="robots" content="noindex">
482
483 The default is to ignore these meta tags and index the document.
484 This tag is described at http://www.robotstxt.org/wc/exclusion.html.
485
486 Note: This feature is only available with the libxml2 HTML parser.
487
488 Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the the following
489 comments in your documents to prevent indexing:
490
491 <!-- SwishCommand noindex -->
492 <!-- SwishCommand index -->
493
494 and/or these may be used also:
495
496 <!-- noindex -->
497 <!-- index -->
498
499 For example, these are very helpful to prevent indexing of common headers, footers, and menus.
500
501
502 =back
503
504 B<NOTE>: This following items are currently not available. These items
505 require Swish-e to parse the configuration file while searching.
506
507
508 =over 4
509
510 =item EnableAltSearchSyntax [yes|NO]
511
512 B<NOTE>: This following item is currently not available.
513
514 Enable alternate search syntax. Allows the usage of a basic
515 "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search
516 query can contain "+" and "-" as syntax parameter.
517
518 Example:
519
520 swish-e -w "+word1 +word2 -word3 word4 word5"
521 "+" = following word has to be in all found documents
522 "-" = following word may not be in any document found
523 " " = following word will be searched in documents
524
525 =item SwishSearchOperators <and-word> <or-word> <not-word>
526
527 B<NOTE>: This following item is currently not available.
528
529 Using this config directive you can change the boolean search operators of
530 Swish-e, e.g. to adapt these to your language.
531 The default is: AND OR NOT
532
533 Example (german):
534
535 SwishSearchOperators UND ODER NICHT
536
537 =item SwishSearchDefaultRule [<AND-WORD>|<or-word>]
538
539 B<NOTE>: This following item is currently not available.
540
541 C<SwishSearchDefaultRule> defines the default Boolean operator to use
542 if none is specified between words or phrases. The default is C<AND>.
543
544 The word you specify must match one of the available
545 C<SwishSearchOperators>.
546
547 Example:
548
549 SwishSearchOperators UND ODER NICHT
550 # Make it act like a web search engine
551 SwishSearchDefaultRule ODER
552
553 =item ResultExtFormatName name -x format string
554
555 B<NOTE>: This following item is currently not available.
556
557 The output of Swish-e can be defined by specifying a format string with
558 the C<-x> command line argument. Using C<ResultExtFormatName> you can
559 assign a predefined format string to a name.
560
561 Examples:
562
563 ResultExtFormatName moreinfo "%c|%r|%t|%p|<author>|<publishyear>\n"
564
565 Then when searching you can specify the the format string's name
566
567 swish-e ... -x moreinfo ...
568
569 See the C<-x> switch in L<SWISH-RUN|SWISH-RUN> for more information
570 about output formats.
571
572 =back
573
574
575 =head2 Administrative Headers Directives
576
577 Swish-e stores configuration information in the header of the index file.
578 This information can be retrieved while searching or by functions in
579 the Swish-e C library. There are a number of fields available for your
580 own use. None of these fields are required:
581
582 =over 4
583
584 =item IndexName *text*
585
586 =item IndexDescription *text*
587
588 =item IndexPointer *text*
589
590 =item IndexAdmin *text*
591
592 These variables specify information that goes into index files to help
593 users and administrators. IndexName should be the name of your index,
594 like a book title. IndexDescription is a short description of the index
595 or a URL pointing to a more full description. IndexPointer should be
596 a pointer to the original information, most likely a URL. IndexAdmin
597 should be the name of the index maintainer and can include name and email
598 information. These values should not be more than 70 or so characters
599 and should be contained in quotes. Note that the automatically generated
600 date in index files is in D/M/Y and 24-hour format.
601
602 Examples:
603
604 IndexName "Linux Documentation"
605 IndexDescription "This is an index of /usr/doc on our Linux machine."
606 IndexPointer http://localhost/swish/linux/index.html
607 IndexAdmin webmaster
608
609
610 =back
611
612 =head2 Document Source Directives
613
614 These directives control I<what> documents are indexed and I<how>
615 they are accessed. See also L<Directives for the File Access method
616 only|/"Directives for the File Access method only"> and L<Directives for
617 the HTTP Access Method Only|/"Directives for the HTTP Access Method Only">
618 for directives that are specific to those access methods.
619
620
621 =over 4
622
623 =item IndexDir [directories or files|URL|external program]
624
625 IndexDir defines the source of the documents for Swish-e. Swish-e
626 currently supports three file access methods: B<File system>, B<HTTP>
627 (also called B<spidering>), and B<prog> for reading files from an
628 external program.
629
630 The C<-S> command line argument is used to select the file access method.
631
632 swish-e -c swish.config -S fs - file system
633 swish-e -c swish.config -S http - internal http spider
634 swish-e -c swish.config -S prog - external program of any type
635
636 For the B<fs> method of access B<IndexDir> is a space-separated
637 list of files and directories to index. Use a forward slash as the path
638 separator in MS Windows.
639
640 For the B<http> method the B<IndexDir> setting is a list of space-separated
641 URLs.
642
643 For the B<prog> method the B<IndexDir> setting is a list of space-separated
644 programs to run (which generate documents for swish to index).
645
646 You may specify more than one B<IndexDir> directive.
647
648 Any sub-directories of any listed directory will also be indexed.
649
650 Note: While I<processing> directories, Swish-e will ignore any files
651 or directories that begin with a dot ("."). You may index files
652 or directories that begin with a dot by specifying their name with
653 C<IndexDir> or C<-i>.
654
655 Examples:
656
657 # Index this directory an any subdirectories
658 IndexDir /usr/local/home/http
659
660 # Index the docs directory in current directory
661 IndexDir ./docs
662
663 # Index these files in the current directory
664 IndexDir ./index.html ./page1.html ./page2.html
665 # and index this directory, too
666 IndexDir ../public_html
667
668 For the B<HTTP> method of access specify the URL's from which
669 you want the spidering to begin.
670
671 Example:
672
673 IndexDir http://www.my-site.com/index.html
674 IndexDir http://localhost/index.html
675
676 Obviously, using the B<HTTP> method to index is B<much> slower than
677 indexing local files. Be well aware that some sites do not appreciate
678 spidering and may block your IP address. You may wish to contact the
679 remote site before spidering their web site. More information about
680 spidering can be found in L<Directives for the HTTP Access Method
681 Only|/"Directives for the HTTP Access Method Only"> below.
682
683 For the L<prog|SWISH-RUN/"item_prog"> method of access B<IndexDir>
684 specifies the path to the program(s) to execute. The external program
685 must correctly format the documents being passed back to Swish-e.
686 Examples of external programs are provided in the F<prog-bin> directory.
687
688 IndexDir ./myprogram.pl
689
690 See L<prog|SWISH-RUN/"item_prog"> for details.
691
692
693 Note: Not all directives work with all methods.
694
695 =item NoContents *list of file suffixes*
696
697 Files with these suffixes will B<not> have their contents indexed.
698
699 If the file's type is HTML (as set by C<IndexContents> or
700 C<DefaultContents>) then the file will be parsed for a HTML title and
701 that title will be indexed. Note that you must set the file's type:
702 C<.html> and C<.htm> are NOT type HTML by default.
703
704 If a title is found, it will still be checked for C<FileRules title>,
705 and the file will be skipped if a match is found. See C<FileRules>.
706
707 If the file's type is not HTML, or it is HTML and no title is found,
708 then the file's path will be indexed. For example, you might wish to
709 search for image files by file name.
710
711 Example:
712
713 NoContents .gif .xbm .au .mov .mpg .pdf .ps
714
715 Note: Using this directive will not cause files with those suffixes
716 to be indexed. That is, if you use C<IndexOnly> to limit the types of
717 files that are indexed, then you must specify in C<IndexOnly> the same
718 suffixes listed in C<NoContents>.
719
720 A C<-S prog> program may set the C<No-Contents:> header (to anything)
721 to enable this feature for a specific document (althought it would be
722 smarter for the C<-S prog> program to simply only send the pathname or
723 title to be indexed.
724
725 =item ReplaceRules [replace|remove|prepend|append|regex]
726
727 ReplaceRules allows you to make changes to file pathnames before
728 they're indexed. These changed file names or URLs will be returned in
729 search results.
730
731 For example, you may index your files locally (with the File system
732 indexing method), yet return a URL in search results. This directive can
733 be used to map the file names to their respective URLs on your web server.
734
735 There are five operations you can specify: B<replace>, B<append>,
736 B<remove>, B<prepend>, and B<regex> They will parse the pathname in the
737 order you've typed these commands.
738
739 This directive uses C library regex.h regular expressions.
740
741 replace "the string you want replaced" "what to change it to"
742 remove "a string to remove"
743 prepend "a string to add before the result"
744 append "a string to add after the result"
745 regex "/search string/replace string/options"
746
747 Remember, quotes are needed if an expression contains white space,
748 and backslashes have special meaning.
749
750 Regex is an Extended Regular Expression. The first character found is
751 the delimiter (but it's not smart enough to use matched chars such as [],
752 (), and {}).
753
754 The B<replace> string may use substitution variables:
755
756 $0 the entire matched (sub)string
757 $1-$9 returns patterns captured in "(" ")" pairs
758 $` the string before the matched pattern
759 $' the string after the matched pattern
760
761 The B<options> change the behavior of expression:
762
763 i ignore the case when matching
764 g repeat the substitution for the entire pattern
765
766 Examples:
767
768 ReplaceRules replace testdir/ anotherdir/
769 ReplaceRules replace [a-z_0-9]*_m.*\.html index.html
770
771 ReplaceRules remove testdir/
772
773 ReplaceRules prepend http://localhost/
774 ReplaceRules append .html
775
776 ReplaceRules regex !^/web/(.+)/!http://$1.domain.com/!
777 replaces a file path:
778 /web/search/foo/index.html
779 with
780 http://search.domain.com/foo/index.html
781
782 ReplaceRules regex #^#http://localhost/www#
783 ReplaceRules prepend http://localhost/www (same thing)
784
785 # Remove all extensions from C source files
786 ReplaceRules remove .c # ERROR! That "." is *any char*
787 ReplaceRules remove \.c # much better...
788
789 ReplaceRules remove "\\.c" # if in quotes you need double-backslash!
790 ReplaceRules remove "\.c" # ERROR! "\." -> "." and is *any char*
791
792
793 =item IndexContents [TXT|HTML|XML|WML|TXT2|HTML2|XML2] *file extensions*
794
795 The C<IndexContents> directive assigns one of Swish-e's document parsers
796 to a document, based on the its extension. Swish-e currently knows how
797 to parse TXT, HTML, and XML documents.
798
799 The XML2, HTML2, and TXT2 parsers are currently only available when
800 Swish-e is configured to use libxml2.
801
802 Documents that are not assigned a parser with C<IndexContents> will, by
803 default, use the HTML parser. The C<DefaultContents> directive may be
804 used to assign a parser to documents that do not match a file extension
805 defined with the C<IndexContents> directive.
806
807 Example:
808
809 IndexContents HTML .htm .html .shtml
810 IndexContents TXT .txt .log .text
811 IndexContents XML .xml
812
813 HTML is the default type for all files, unless otherwise specified
814 (and this default can be changed by the B<DefaultContents> directive.
815 Swish-e parses titles from HTML files, if available, and keeps track
816 of the context of the text for context searching (see C<-t> in
817 L<SWISH-RUN|SWISH-RUN>). HTML and XML files use different tag formats
818 for B<MetaNames> and B<PropertyNames>.
819
820 If using filters to convert documents you should include those extensions,
821 too. For example, if using a filter to conver .pdf to .html, you need
822 to tell Swish-e that .pdf should be indexed by the internal HTML parser:
823
824 FileFilter .pdf pdf2html
825 IndexContent HTML .pdf
826
827 See also L<Document Filter Directives|/"Document Filter Directives">.
828
829 B<Note:> Some of this may be changed in the future to use content-types
830 instead of file extensions. See L<SWISH-3.0|SWISH-3.0>
831
832 =item DefaultContents [TXT|HTML|XML|WML|TXT2|HTML2|XML2]
833
834 This sets the default parser for documents that are not specified in
835 B<IndexContents>. If not specified the default is HTML.
836
837 The XML2, HTML2, and TXT2 parsers are currently only available when
838 Swish-e is configured to use libxml2.
839
840
841 Example:
842
843 DefaultContents HTML
844
845 The C<DefaultContents> directive I<should> be used when spidering,
846 as HTML files may be returned without a file extension (such as when
847 requesting a directory and the default index.html is returned).
848
849
850 =item FileInfoCompression [yes|NO]
851
852 ** This directive is currently not supported **
853
854 Setting B<FileInfoCompression> to C<yes> will compress the index file to
855 save disk space. This may result in longer indexing times. The default
856 is C<no>.
857
858 Also see the C<-e> switch in L<SWISH-RUN|SWISH-RUN> for saving RAM
859 during indexing.
860
861
862 =back
863
864 =head2 Document Contents Directives
865
866 These directives control what information is extracted from your source
867 documents, and how that information is made available during searching.
868
869 =over 4
870
871 =item ConvertHTMLEntities [YES|no]
872
873 ASCII I<entities> can be converted automatically while indexing documents
874 of type HTML. For performance reasons you may wish to set this to C<no>
875 if your documents do not contain HTML entities. The default is C<yes>.
876
877 If C<ConvertHTMLEntities> is set C<no> the entities will be indexed
878 without conversion.
879
880 B<NOTE:> Entities within XML files and files parsed with libxml2 are
881 converted regardless of this setting.
882
883 =item MetaNames *list of names*
884
885 META names are a way to define "fields" in your XML and HTML documents.
886 You can use the META names in your queries to limit the search to just
887 the words contained in that META name of your document. For example,
888 you might have a META tagged field in your documents called C<subjects>
889 and then you can search your documents for the word "foo" but only return
890 documents where "foo" is within the C<subjects> META tag.
891
892 swish-e -w subjects=foo
893
894 (See also the C<-t> switch in L<SWISH-RUN|SWISH-RUN> for information
895 about I<context> searching in HTML documents.)
896
897 The B<MetaNames> directive is a space separated list. For example:
898
899 MetaNames meta1 meta2 keywords subjects
900
901 You may also use L<UndefinedMetaTags|/"item_UndefinedMetaTags"> to specify
902 automatic extraction of meta names from your HTML and XML documents,
903 and also to ignore indexing content of meta tags.
904
905 META tags can have two formats in your B<HTML> source documents:
906
907 <META NAME="meta1" CONTENT="some content">
908
909 and (if using the HTML2/libxml2 parser)
910
911 <meta1>
912 some content
913 </meta1>
914
915 But this second version is invalid HTML, and will generate a warning if
916 ParserWarningLevel is set (libxml2 only).
917
918 And in B<XML> documents, use the format:
919
920 <meta1>
921 Some Content
922 </meta1>
923
924 Then you can limit your search to just META B<meta1> like this:
925
926 swish-e -w 'meta1=(apples or oranges)'
927
928 You may nest the XML and the start/end tag versions:
929
930 <keywords>
931 <tag1>
932 some content
933 </tag1>
934 <tag2>
935 some other content
936 </tag2>
937 <keywords>
938
939 Then you can search in both tag2 and tag2 with:
940
941 swish-e -w 'keywords=(query words)'
942
943 Swish-e indexes all text as some metaname. The default is
944 C<swishdefault>, so these two queries are the same:
945
946 swish-e -w foo
947 swish-e -w swishdefault=foo
948
949 When indexing HTML Swish-e indexes the HTML title as default text, so
950 when searching Swish-e will find matches in both the HTML body and the
951 HTML title. Swish also, by default, indexes content of meta tags. So:
952
953 swish-e -w foo
954
955 will find "foo" in the body, the title, or any meta tags.
956
957 Currently, there's no way to prevent Swish-e from indexing
958 the title contents along with the body contents, but see
959 L<UndefinedMetaTags|/"item_UndefinedMetaTags"> for how to control the
960 indexing of meta tags.
961
962 If you would like to search just the title text, you may use:
963
964 MetaNames swishtitle
965
966 This will index the title text separately under the built-in swish
967 internal meta name "swishtitle". You may then search like
968
969 swish-e -w foo -- search for "foo" in title, body (and undefined meta tags)
970 swish-e -w swishtitle=foo -- search for "foo" in title only
971
972 In addition to swishtitle, you can limit searches to documents' path with:
973
974 MetaNames swishdocpath
975
976 Then to search for "foo" but also limit searches to documents that include
977 "manual" or "tutorial" in thier path:
978
979 swish-e -w foo swishdocpath=(manual or tutorial)
980
981 See also L<ExtractPath|/"item_ExtractPath">.
982
983
984 =item MetaNameAlias *meta name* *list of aliases*
985
986 MetaNameAlias assigns aliases for a meta name. For example, if your
987 documents contain meta tags "description", "summary", and "overview"
988 that all give a summary of your documents you could do this:
989
990 MetaNames summary
991 MetaNameAlias summary description overview
992
993 Then all three tags will get indexed as meta tag "summary". You can
994 then search all the fields as:
995
996 -w summary=foo
997
998 The Alias work at search time, too. So these will also limit the searh
999 to the "summary" meta name.
1000
1001 -w description=foo
1002 -w overview=foo
1003
1004 =item MetaNamesRank integer *list of meta names*
1005
1006 * Not implemented yet *
1007
1008 You can assign a bias to metanames that will affect how ranking is
1009 calculated. The range of values is from -10 to +10, with zero being
1010 no bias.
1011
1012 MetaNamesRank 4 subject
1013 MetaNamesRank 3 swishdefault
1014 MetaNamesRank 2 author publisher
1015 MetaNamesRank -5 wrongwords
1016
1017 This feature is not implemented yet
1018
1019 =item HTMLLinksMetaName *metaname*
1020
1021 Allows indexing of HTML links. Normally, HTML links (href tags) are
1022 not indexed by Swish-e. This directive defines a metaname, and links
1023 will be indexed under this meta name.
1024
1025 Example:
1026
1027 HTMLLinksMetaName links
1028
1029 Now, to limit searches to files with a link to "home.html" do this:
1030
1031 -w links='"home.html"'
1032
1033 The double quotes force a phrase search.
1034
1035 To make Swish-e index links as normal text, you may use:
1036
1037 HTMLLinksMetaName swishdefault
1038
1039 This feature is only available with the libxml2 HTML parser.
1040
1041 =item ImageLinksMetaName *metaname*
1042
1043 Allows indexing of image links under a metaname. Normally, image URLs
1044 are not indexed.
1045
1046 Example:
1047
1048 ImagesLinksMetaName images
1049
1050 Now, if you would like to find pages that include a nice image of a beach:
1051
1052 -w images='beach'
1053
1054 To make Swish-e index links as normal text, you may use:
1055
1056 ImageLinksMetaName swishdefault
1057
1058 This feature is only available with the libxml2 HTML parser.
1059
1060
1061 =item IndexAltTagMetaName *tagname*|as-text
1062
1063 Allows indexing of images <IMG> ALT tag text. Specify either a tag name which will be
1064 used as a metaname, or the special text "as-text" which says to index the ALT text as
1065 if it were plain text at the current location.
1066
1067 For example, by specifying a tag name:
1068
1069 IndexAltTagMetaName bar
1070
1071 would make this markup:
1072
1073 <foo>
1074 <img src="/someimage.png" alt="Alt text here">
1075 </foo>
1076
1077 appear like
1078
1079 <foo>
1080 <bar>Alt text here</bar>
1081 </foo>
1082
1083 Then the normal rules (C<MetaNames> and C<PropertyNames>) apply to how that text is indexed.
1084
1085 If you use the special tag "as-text" then
1086
1087 <foo>
1088 <img src="/someimage.png" alt="Alt text here">
1089 </foo>
1090
1091 simply becomes
1092
1093 <foo>
1094 Alt text here
1095 </foo>
1096
1097 This feature is only available when using the libxml2 parser (HTML2 and XML2).
1098
1099
1100 =item AbsoluteLinks [yes|NO]
1101
1102 If this is set true then Swish-e will attempt to convert relative URIs
1103 extracted from HTML documents for use with C<HTMLLinksMetaName> and
1104 C<ImageLinksMetaName> into absolute URIs. Swish-e will use any <BASE>
1105 tag found in the document, otherwise it will use the file's pathname.
1106 The pathname used will be the pathname *after* C<ReplaceRules> has been
1107 applied to the document's pathname.
1108
1109 For example, say you wish to index image links under the metaname
1110 "images".
1111
1112 ImageLinksMetaName images
1113
1114 If an image is located in http://localhost/vacations/france/index.html
1115 and C<AbsoluteLinks> is set to no, then a image within that document:
1116
1117 <img src="beach.jpeg">
1118
1119 will only index "beach.jpeg".
1120
1121 But, if you want more deatil when searching, you can enable
1122 C<AbsoluteLinks> and Swish-e will index
1123 "http://localhost/vacations/france/beach.jpeg". You can then look for
1124 images of beaches, but only in France:
1125
1126 -w images=(beach and france)
1127
1128 This also means you can search for any images within France:
1129
1130 -w images=(france)
1131
1132 This feature is only available with the libxml2 HTML parser.
1133
1134 =item UndefinedMetaTags [error|ignore|INDEX|auto]
1135
1136 This directive defines the behavior of Swish-e during indexing when a
1137 meta name is found but is B<not> listed in B<MetaNames>. There are
1138 four choices:
1139
1140
1141 =over 2
1142
1143 =item error
1144
1145 If a meta name is found that is not listed in B<MetaNames>
1146 then indexing will be halted and an error reported.
1147
1148 =item ignore
1149
1150 The contents of the meta tag are ignored and B<not> indexed
1151 unless a metaname has been defined with the C<MetaNames> directive.
1152
1153 =item index
1154
1155 The contents of the meta tag are indexed, but placed in the
1156 main index unless there's an enclosing metatag already in force. This
1157 is the default.
1158
1159 =item auto
1160
1161 This method create meta tags automatically for HTML meta names
1162 and XML elements. Using this is the same as specifying all the meta
1163 names explicitly in a B<MetaNames> dirictive.
1164
1165 =back
1166
1167 =item UndefinedXMLAttributes [DISABLE|error|index|auto]
1168
1169 This is similar to C<UndefinedMetaTags>, but only applies to XML documents (parsed with libxml2).
1170 This allows indexing of attribute content, and provides a way to index the content under a
1171 metaname. For example, C<UndefinedXMLAttributes> can make
1172
1173 <person age="23">
1174 John Doe
1175 </person>
1176
1177 look like the following to swish:
1178
1179 <person>
1180 <person.age>
1181 23
1182 </person.age>
1183 John Doe
1184 </person>
1185
1186 What happens to the text "23" will depend on the setting of C<UndefinedXMLAttributes>:
1187
1188 =over 2
1189
1190 =item disable
1191
1192 XML attributes are not parsed and not indexed. This is the default.
1193
1194 =item error
1195
1196 If the concatenated meta name (e.g. person.age) is not listed in
1197 B<MetaNames> then indexing will be halted and an error reported.
1198
1199 =item ignore
1200
1201 The contents of the meta tag are ignored and B<not> indexed unless a
1202 metaname has been defined with the C<MetaNames> directive.
1203
1204 =item index
1205
1206 The contents of the meta tag are indexed, but placed in the main index
1207 unless there's an enclosing metatag already in force.
1208
1209 =item auto
1210
1211 This method will create meta tags from the combined element and attributes
1212 (and XML Class name) This options should be used with caution as it can
1213 generate a lot of metaname entries.
1214
1215 See also the example below C<XMLClassAttribues>.
1216
1217
1218 =back
1219
1220 =item XMLClassAttributes *list of XML attribute names*
1221
1222 Combines an XML class name with the element name to make up a metaname.
1223 For example:
1224
1225 XMLClassAttributes class
1226
1227 <person class="first">
1228 John
1229 </person>
1230 <person class="last">
1231 Doe
1232 </person>
1233
1234 Will appear to Swish-e as:
1235
1236 <person>
1237 <person.first>
1238 John
1239 </person.first>
1240 </person>
1241 <person>
1242 <person.last>
1243 Doe
1244 </person.last>
1245 </person>
1246
1247 How the data is indexed depends on C<MetaNames> and C<UndefinedMetaTags>.
1248
1249 Here's an example using the following configuation which combines the
1250 two directives C<XMLClassAttributes> and C<UndefinedXMLAttributes>.
1251
1252 XMLClassAttributes class
1253 UndefinedMetaTags auto
1254 UndefinedXMLAttributes auto
1255 IndexContents XML2 .xml
1256
1257 The source XML file looks like:
1258
1259 <xml>
1260 <person class="student" phone="555-1212" age="102">
1261 John
1262 </person>
1263 <person greeting="howdy">Bill</person>
1264 </xml>
1265
1266 Swish-e parses as:
1267
1268 ./swish-e -c 2 -i 1.xml -T parsed_tags parsed_text -v 0
1269 Indexing Data Source: "File-System"
1270
1271 <xml> (MetaName)
1272
1273 <person> (MetaName)
1274 <person.student> (MetaName)
1275 <person.student.phone> (MetaName)
1276 555-1212
1277 </person.student.phone>
1278 <person.student.age> (MetaName)
1279 102
1280 </person.student.age>
1281 John
1282 </person>
1283
1284 <person> (MetaName)
1285 <person.greeting> (MetaName)
1286 howdy
1287 </person.greeting>
1288 Bill
1289 </person>
1290
1291 </xml>
1292 Indexing done!
1293
1294 One thing to note is that the first <person> block finds a class name
1295 "student" so all metanames that are created from attributes use the
1296 combined name "person.student". The second <person> block doesn't contain
1297 a "class" so, the attribute name is combinded directly with the element
1298 name (e.g. "person.greeting").
1299
1300 =item ExtractPath *metaname* [replace|remove|prepend|append|regex]
1301
1302 This directive can be used to index extracted parts of a document's path.
1303 A common use would to to limit searches to specific areas of your
1304 file tree.
1305
1306 The extracted string will be indexed under the specified meta name.
1307
1308 See C<ReplaceRules> for a description of the various pattern replacement
1309 methods, but you will use the I<regex> method.
1310
1311 For example, say your file system (or web tree) was organized into departments:
1312
1313 /web/sales/foo...
1314 /web/parts/foo...
1315 /web/accounting/foo...
1316
1317 And you wanted a way to limit searches to just documents under "sales".
1318
1319 ExtractPath department regex !^/web/([^/]+)/.*$!$1!
1320
1321 Which says, extract out the department name (as substring $1) and index
1322 it as meta name C<department>. Then to limit a search to the sales
1323 department:
1324
1325 swish-e -w foo AND department=sales
1326
1327 Note that the C<regex> method uses a substitution pattern, so to index
1328 only a sub-string match the I<entire> document path in the regular
1329 expression, as shown above.
1330
1331 See the C<ExtractPathDefault> option for a way to set a value if not
1332 patterns match.
1333
1334 Although unlikely, you may use more than one C<ExtractPath> direcive.
1335 More than one directive of the I<same> meta name will operate successively
1336 (in order listed in the configuration file) on the path. This allows
1337 you to use regular expressions on the results of the previous pattern
1338 substitution (as if piping the output from one expression to the patter
1339 of the next).
1340
1341 ExtractPath foo regex !^(...).+$!$1!
1342 ExtractPath foo regex !^.+(.)$!$1!
1343
1344 So, the third letter is indexed as meta name "foo" if both patterns match.
1345
1346 ExtractPath foo regex !^X(...).+$!$1!
1347 ExtractPath foo regex !^.+(.)$!$1!
1348
1349 Now (not the "X"), if the first pattern doesn't match, the last character
1350 of the path name is indexed. You must be clear on this behavior if you
1351 are using more than one C<ExtractPath> directive with the same metaname.
1352
1353 The document path operated on is the real path swish used to access
1354 the document. That is, the C<ReplaceRules> directive has no effect on
1355 the path used with C<ExtractPath>.
1356
1357 The full path is used for each meta name if more than one C<ExtractPath>
1358 directive is used. That is, changes to the path used in C<ExtractPath
1359 foo> do not affect the path used by C<ExtractPath bar>.
1360
1361 =item ExtractPathDefault *metaname* default_value
1362
1363 This can be used with C<ExtractPath> to set a default string to index
1364 under the given metaname if none of the C<ExtractPath> patterns match.
1365
1366 For example, say your want to index each document with a metaname
1367 "department" based on the following path examples:
1368
1369 /web/sales/foo...
1370 /web/parts/foo...
1371 /web/accounting/foo...
1372
1373 But you are also indexing documents that do not follow that pattern and you want to search those
1374 seperately, too.
1375
1376 ExtractPath department regex !^/web/([^/]+)/.*$!$1!
1377 ExtractPathDefault department other
1378
1379 Now, you may search like this:
1380
1381 -w foo department=(sales) - limit searches to the sales documents
1382 -w foo department=(parts) - limit searches to the parts documents
1383 -w foo department=(accounting) - limit searches to the accounting documents
1384 -w foo department=(other) - everything but sales, parts, and accounting.
1385
1386 This basically is a shortcut for:
1387
1388 -w foo not department=(sales or parts or accounting)
1389
1390 but you don't need to keep track of what was extracted.
1391
1392 =item PropertyNames *list of meta names*
1393
1394 =item PropertyNamesCompareCase *list of meta names*
1395
1396 =item PropertyNamesIgnoreCase *list of meta names*
1397
1398 Swish-e allows you to specify certain META tags that can be used as
1399 B<document properties>. The contents of any META tag that has been
1400 identified as a document property can be returned as part of the search
1401 results along with the rank, file name, title, and document size (see
1402 the C<-p> and C<-x> switches in L<SWISH-RUN|SWISH-RUN>).
1403
1404 Properties are useful for returning additional data from documents in
1405 search results -- this saves the effort of reading and parsing the source
1406 files while reading Swish-e search results, and is especially useful
1407 when the source documents are no longer available or slow to access
1408 (e.g. over http).
1409
1410 Another feature of properties is that Swish-e can use the PropertyNames
1411 for sorting the search results (see the C<-s> switch).
1412
1413 PropertyNames author subjects
1414
1415 Two variations are available. C<PropertyNamesCompareCase> and
1416 C<PropertyNamesIgnoreCase>. These tell Swish-e to either ignore or
1417 compare case when sorting results. The default for C<PropertyNames>
1418 is to ignore the case.
1419
1420 PropertyNamesIgnoreCase subject
1421 PropertyNamesCompareCase keyword
1422
1423 The defaults for "internal" properties are:
1424
1425 swishtitle -- ignore the case
1426 swishdocpath -- compare case
1427 swishdescription -- compare case
1428
1429 These can be overridden with C<PropertyNamesCompareCase> and
1430 C<PropertyNamesIgnoreCase>.
1431
1432 PropertyNamesCompareCase swishtitle
1433
1434 Use of PropertyNames will increase the size of your index files,
1435 sometimes significantly. Properties will be compressed if Swish-e is
1436 compiled with zlib as described in the L<INSTALL|INSTALL> manual page.
1437
1438 If Swish-e finds more than one property of the same name in a document
1439 the property's contents will be concatinated for strings, and a warning
1440 issues for numeric (or date) properties.
1441
1442
1443 =item PropertyNamesNumeric
1444
1445 This directive is similar to C<PropertyNames>, but it flags the property
1446 as being a string of digits (integer value) that will be stored as binary data instead
1447 of a string. This allows sorting with C<-s> and limiting with C<-L>
1448 to sort and limit the property correctly.
1449
1450 Swish-e uses C<strtoul(3)> to convert the string into an unsigned long
1451 integer. Therefore, only positive integers can be stored.
1452
1453 Future versions of Swish-e may be able to store different property types
1454 (such as negative integers and real numbers). This directive may change
1455 in future releases of Swish.
1456
1457 =item PropertyNamesDate
1458
1459 This directive is exactly like C<PropertyNamesNumeric>, but it also
1460 flags the number as a machine timestamp (seconds since Epoch), and
1461 will print a formatted date when returning this property. See C<-x>
1462 in L<SWISH-RUN|SWISH-RUN>.
1463
1464 Swish-e will not parse dates when indexing; you must use a timestamp.
1465
1466 =item PropertyNameAlias *property name* *list of aliases*
1467
1468 This allows aliases for a property name. For example, if you are indexing
1469 HTML files, plus XML files that are written in English, German, and
1470 Spanish and thus use the tags "title", "titel", and "título" you can use:
1471
1472 PropertyNameAlias swishtitle title titel título titulo
1473
1474 Note that "swishtitle" is the built-in property used to store the title of
1475 a document, and therefore you do not need to specify it as a PropertyName
1476 before use.
1477
1478 =item PropertyNamesMaxLength integer *list of meta names*
1479
1480 This option will set the max length of the text stored in a property.
1481 You must specify a number between 0 and the max integer size on your
1482 platform, and a list of properties. The properties specified must not
1483 be aliases.
1484
1485 If any of the property names do not exist they will be created (e.g. you
1486 do not need to define the property with PropertyNames first).
1487
1488 In general, this feature will only be useful when parsing HTML or XML
1489 with the libxml2 parser.
1490
1491 For example:
1492
1493 PropertyNamesMaxLength 1000 swishdescription
1494 PropertyNameAlias swishdescription body
1495
1496 Is somewhat like
1497
1498 StoreDescription HTML <body> 1000
1499 StoreDescription XML <body> 1000
1500 StoreDescription HTML2 <body> 1000
1501 StoreDescription XML2 <body> 1000
1502
1503 but StoreDescription allows setting the tag for each parser type.
1504
1505 PropertyNamesMaxLength 1000 headings
1506 PropertyNameAlias headings h1 h2 h3 h4
1507
1508 collects all the heading text into a single property called "headings", not
1509 to exceed 1000 characters.
1510
1511
1512 =item PreSortedIndex *list of property names*
1513
1514 By default Swish-e generates presorted tables while indexing for each
1515 property name. This allows faster sorting when generating results.
1516 On large document collections this presorting may add to the indexing
1517 time, and also adds to the total size of the index. This directive can
1518 be used to customize exactly which properties will be presorted.
1519
1520 If C<PreSortedIndex> it is I<not> present in the config file (default
1521 action), all the properties will be presorted at indexing time. If it
1522 is present without any parameter, no properties will be presorted.
1523 Otherwise, only the property names specified will be presorted.
1524
1525 For example, if you only wish to sort results by a property called
1526 C<title>:
1527
1528 PropertyNames title age time
1529 PreSortedIndex title
1530
1531
1532 =item StoreDescription [XML <tag> size|HTML <meta> size|TXT size]
1533
1534 B<StoreDescription> allows you to store a document description in the
1535 index file, and this description can be returned in your search results
1536 when the C<-x> switch is used to include the I<swishdescription> for
1537 extended results.
1538
1539 For text documents you specify the type C<TXT> and the number of I<characters> to capture.
1540
1541 StoreDescription TXT 20
1542
1543 The above stores only the first twenty characters from the text file in the Swish-e index
1544 file.
1545
1546 For HTML, and XML file types, specify the the tag to use for the
1547 description, and optionally the number of characters to capture. If not
1548 specified will capture the entire contents of the tag.
1549
1550 StoreDescription HTML <body> 20000
1551 StoreDescription XML <desc> 40
1552
1553 Note that documents must be assigned a document type with C<IndexContents>
1554 or C<DefaultContents> to use this feature.
1555
1556 Swish-e will compress the descriptions (or any other large property)
1557 if compiled to use zlib (see L<INSTALL|INSTALL>). This is recommended when using
1558 StoreDescription and a large number of documents. Compression of 30% to 50% is
1559 not uncomon with HTML files.
1560
1561 =item PropCompressionLevel [0-9]
1562
1563 This directive sets the compression level used when storing properties
1564 to disk. A setting of zero is no compression, and a setting of nine is
1565 the most compression.
1566
1567 The default depends on the default setting compiled with zlib, but is
1568 typicaly six.
1569
1570 This option is useful when using C<StoreDescription> to store a large
1571 amount text in properties (or if using C<PropertyNames> with large
1572 property sizes).
1573
1574 Properties must be over a value defined in F<config.h> (100 is the
1575 default) before compression will be attempted. Swish-e will never store
1576 the results of the compression if the compressed data is larger than
1577 the original data.
1578
1579 This option is only available when Swish-e is compiled with zlib support.
1580
1581
1582 =item TruncateDocSize *number of characters*
1583
1584 TruncateDocSize limits the size of a document while indexing documents
1585 and/or using filters. This config directive truncates the numbers of
1586 read bytes of a document to the specified size. This means: if a document
1587 is larger, read only the specified numbers of bytes of the document.
1588
1589 Example:
1590
1591 TruncateDocSize 10000000
1592
1593 The default is zero, which means read all data.
1594
1595
1596 Warning: If you use TruncateDocSize, use it with care! TruncateDocSize
1597 is a safty belt only, to limit e.g. filteroutput, when accessing
1598 databases, or to limit "runnaway" filters. Truncating doc input may
1599 destroy document structures for Swish-e (e.g. swish may miss closing
1600 tags for XML or HTML documents).
1601
1602 TruncateDocSize does not currently work with the C<prog> input source
1603 method.
1604
1605 =item FuzzyIndexingMode NONE|Stemming|Soundex|Metaphone|DoubleMetaphone
1606
1607 Selects the type of index to create. Only one type of index may be created.
1608
1609 It's a good idea to create both a normal index and a fuzzy index and
1610 allow your search interface select which index to use. Many people find the
1611 fuzzy searches to be too fuzzy.
1612
1613 The available fuzzy indexing options are:
1614
1615 =over 4
1616
1617 =item None
1618
1619 Words are stored in the index without any conversion. This is the default.
1620
1621 =item Stemming
1622
1623 Words are converted using the Porter stemming algorithm.
1624
1625 From: http://www.tartarus.org/~martin/PorterStemmer/
1626
1627 The Porter stemming algorithm (or ‘Porter stemmer’) is a
1628 process for removing the commoner morphological and inflexional
1629 endings from words in English. Its main use is as part of a
1630 term normalisation process that is usually done when setting up
1631 Information Retrieval systems.
1632
1633
1634 This will help a search for "running" to also find "run" and "runs", for example.
1635
1636 The stemming function does not convert words to their root, rather
1637 programmatically removes endings on words in an attempt to make similar
1638 words with different endings stem to the same string of characters.
1639 It's not a perfect system, and searches on stemmed indexes often return
1640 curious results. For example, two entirely different words may stem to
1641 the same word.
1642
1643 Stemming also can be confusing when used with a wildcard (truncation).
1644 For example, you might expect to find the word "running" by searching for
1645 "runn*". But this fails when using a stemmed index, as "running" stems to
1646 "run", yet searching for "runn*" looks for words that start with "runn".
1647
1648 =item Soundex
1649
1650 Soundex was developed in the 1880s so records for people with similar
1651 sounding names could be found more readily. Soundex is a coded surname
1652 based on the way a surname sounds rather than spelling. Surnames that
1653 sound similar, like Smith and Smyth, are filed together under the same
1654 Soundex code. This is mostly useful for US English.
1655
1656 Soundex should not be used to search for sound-alike words. Metaphone
1657 would be more appropriate for generic sound matching of words. Soundex
1658 should only be used where you need to search multiple documents for
1659 proper names which sound similar. This is primarily used for indexing
1660 genealogical records. This may be useful for indexing other collections
1661 of data consisting mostly of names. Many common name variations are
1662 matched by Soundex. The only notable exception is the first letter of
1663 the name. The first letter is not matched for sound.
1664
1665 =item Metaphone and DoubleMetaphone
1666
1667 Words are transformed into a short series of letters representing the sound of the word (in English).
1668 Metaphone algorithms are often used for looking up mis-spelled words in dictionary programs.
1669
1670 From: http://aspell.sourceforge.net/metaphone/
1671
1672 Lawrence Philips' Metaphone Algorithm is an algorithm which returns
1673 the rough approximation of how an English word sounds.
1674
1675 The C<DoubleMetaphone> mode will sometimes generate two different metaphones for the same word.
1676 This is supposed to be useful when a word may be pronounced more than one way.
1677
1678 A metaphone index should give results somewhere in between Soundex and Stemming.
1679
1680 =back
1681
1682 =item UseStemming [yes|NO]
1683
1684 Put yes to apply word stemming algorithm during indexing, else no.
1685
1686 UseStemming no
1687 UseStemming yes
1688
1689 When UseStemming is set to C<yes> every word is stemmed before placing
1690 it in to the index.
1691
1692 This option is depreciated. It has been superceded by C<FuzzyIndexingMode>.
1693
1694 =item UseSoundex [yes|NO]
1695
1696 When UseSoundex is set to C<yes> every word is converted to a Soundex
1697 code before placing it in to the index.
1698
1699 This option is depreciated. It has been superceded by C<FuzzyIndexingMode>.
1700
1701 =item IgnoreTotalWordCountWhenRanking [YES|no]
1702
1703 Put yes to ignore the total number of words in the file when calculating
1704 ranking. Often better with merges and small files. Default is yes.
1705
1706 IgnoreTotalWordCountWhenRanking no
1707
1708 The default was changed from no to yes in version 2.2.
1709
1710 =item MinWordLimit *integer*
1711
1712 Set the minimum length of an word. Shorter words will not be indexed.
1713 The default is 1 (as defined in F<src/config.h>).
1714
1715 MinWordLimit 5
1716
1717 =item MaxWordLimit *integer*
1718
1719 Set the maximum length of an indexable word. Every longer word will not
1720 be indexed. The Default is 40 (as defined in F<src/config.h>).
1721
1722 =item WordCharacters *string of characters*
1723
1724 =item IgnoreFirstChar *string of characters*
1725
1726 =item IgnoreLastChar *string of characters*
1727
1728 =item BeginCharacters *string of characters*
1729
1730 =item EndCharacter *string of characters*
1731
1732
1733 These settings define what a word consists of to the Swish-e indexing engine.
1734 Compiled in defaults are in F<src/config.h>.
1735
1736 When indexing Swish-e uses B<WordCharacters> to split up the document
1737 into words. Words are defined by any string of non-blank characters
1738 that contain only the characters listed in WordCharacters. If a string
1739 of characters includes a character that is not in WordCharacters then
1740 the word will be spit into two or more separate words.
1741
1742 For example:
1743
1744 WordCharacters abde
1745
1746 Would turn "abcde" into two words "ab" and "de".
1747
1748 Next, of these words, any characters defined in B<IgnoreFirstChar> are
1749 stripped off the start of the word, and B<IgnoreLastChar> characters
1750 are stripped off the end of the word. This allows, for example,
1751 periods within a word (www.slashdot.com), but not at the end of
1752 a word. Characters in IgnoreFirstChar and IgnoreLastChar must be in
1753 WordCharacters.
1754
1755 Finally, the resulting words MUST begin with one of the characters
1756 listed in B<BeginCharacters> and end with one of the characters listed in
1757 B<EndCharacters>. BeginCharacters and EndCharacters must be a subset of
1758 the characters in WordCharacters. Often, WordCharacters, BeginCharacters
1759 and EndCharacters will all be the same.
1760
1761 Note that the same process applies to the query while searching.
1762
1763 Getting these settings correct will take careful consideration and
1764 practice. It's helpful to create an index of a single test file, and
1765 then look at the words that are placed in the index (see the C<-v 4>,
1766 C<-D> and C<-k> searching switches).
1767
1768 Currently there is only support for eight-bit characters.
1769
1770 Example:
1771
1772 WordCharacters .abcdefghijklmnopqrstuvwxyz
1773 BeginCharacters abcdefghijklmnopqrstuvwxyz
1774 EndCharacters abcdefghijklmnopqrstuvwxyz
1775 IgnoreFirstChar .
1776 IgnoreLastChar .
1777
1778 So the string
1779
1780 Please visit http://www.example.com/path/to/file.html.
1781
1782 will be indexed as the following words:
1783
1784 please
1785 visit
1786 http
1787 www.example.com
1788 path
1789 to
1790 file.html
1791
1792 Which means that you can search for C<www.example.com> as a single word,
1793 but searching for just C<example> will not find the document.
1794
1795 Note: when indexing HTML documents HTML entities are converted to their
1796 character equivalents before being processed with these directives.
1797 This is a change from previous versions of Swish-e where you were
1798 required to include the characters C<0123456789&#;> to index entities.
1799 See also L<ConvertHTMLEntities|/"item_ConvertHTMLEntities">
1800
1801 =item Buzzwords [*list of buzzwords*|File: path]
1802
1803 The Buzzwords option allows you to specify words that will be indexed
1804 regardless of WordCharacters, BeginCharacters, EndCharacters, stemming,
1805 soundex and many of the other checks do on words while indexing.
1806
1807 Buzzwords are case insensitive.
1808
1809 Buzzwords should be separated by spaces and may span multiple directives.
1810 If the special format C<File:filename> is used then the Buzzwords will
1811 be read from an external file during indexing.
1812
1813 Examples:
1814
1815 Buzzwords C++ TCP/IP
1816
1817 Buzzwords File: ./buzzwords.lst
1818
1819 If a Buzzword contains search operator characters they must be backslashed
1820 when searching. For example:
1821
1822 Buzzwords C++ TCP/IP web=http
1823
1824 ./swish-e -w 'web\=http'
1825
1826 Buzzwords are found by splitting the text on whitespace, removing
1827 C<IgnoreFirstChar> and C<IgnoreLastChar> characters from the word,
1828 and then comparing with the list of C<Buzzwords>. Therefore, if
1829 adding C<Buzzwords> to an index you will probably want to define
1830 C<IgnoreFirstChar> and C<IgnoreLastChar> settings.
1831
1832 Note: Buzzwords specific settings for C<IgnoreFirstChar> and
1833 C<IgnoreLastChar> may be used in the future.
1834
1835
1836 =item IgnoreWords [*list of stop words*|File: path]
1837
1838 The IgnoreWords option allows you to specify words to ignore, called
1839 I<stopwords>. The default is to not use any stopwords.
1840
1841 Words should be separated by spaces and may span multiple directives.
1842 If the special format C<File:filename> is used then the stop words will
1843 be read from an external file during indexing.
1844
1845 In previous versions of Swish-e you could use the directive
1846
1847 IgnoreWords swishdefault - obsolete!
1848
1849 to include a default list of compiled in stopwords. This keyword is no
1850 longer supported.
1851
1852 Examples:
1853
1854 IgnoreWords www http a an the of and or
1855
1856 IgnoreWords File: ./stopwords.de
1857
1858 =item UseWords [*list of words*|File: path]
1859
1860 UseWords defines the words that Swish-e will index. B<Only> the words
1861 listed will be indexed.
1862
1863 You can specify a list of words following the directive (you may specify
1864 more than one C<UseWords> directive in a config file), and/or use the
1865 C<File:> form to specify a path to a file containing the words:
1866
1867 UseWords perl python pascal fortran basic cobal php
1868 UseWords File: /path/to/my/wordlist
1869
1870 Please drop the Swish-e list a note if you actually use this feature.
1871 It may be removed from future versions.
1872
1873 =item IgnoreLimit *integer integer*
1874
1875 This automatically omits words that appear too often in the files (these
1876 words are called stopwords). Specify a whole percentage and a number,
1877 such as "80 256". This omits words that occur in over 80% of the files
1878 and appear in over 256 files. Comment out to turn off auto-stopwording.
1879
1880 IgnoreLimit 50 1000
1881
1882 Swish-e must do extra processing to adjust the entire index when this
1883 feature is used. It is recommended that instead of using this feature
1884 that you decided what words are stopwords and add them to B<IngoreWords>
1885 in your configuration file. To do this, use IgnoreLimit one time and
1886 note the stop words that are found while indexing. Add this list to
1887 IgnoreWords, and then remove IgnoreLimit from the configuration file.
1888
1889 =item IgnoreMetaTags *list of names*
1890
1891 C<IgnoreMetaTags> defines a list of metantags to ignore while indexing
1892 XML files (and HTML files if using libxml2 for parsing HTML). All text
1893 within the tags will be ignored -- both for indexing (C<MetaNames>)
1894 and properties (C<PropertyNames>). To still parse properties, yet do
1895 not index the text, see L<UndefinedMetaTags|/"item_UndefinedMetaTags">.
1896
1897 This option is useful to avoid indexing specific data from a file.
1898 For example:
1899
1900 <person>
1901 <first_name>
1902 William
1903 </first_name> <last_name>
1904 Shakespeare
1905 </last_name> <updated_date>
1906 April 25, 1999
1907 </updated_date>
1908 </person>
1909
1910 In the above example you might B<not> want to index the updated date,
1911 and therefore prevent finding this record by searching
1912
1913 -w 'person=(April)'
1914
1915 This is solved by:
1916
1917 IgnoreMetaTags updated_date
1918
1919
1920 See also L<UndefinedMetaTags|/"item_UndefinedMetaTags">.
1921
1922 =item IgnoreNumberChars *list of characters*
1923
1924 Experimental Feature
1925
1926 This experimental feature can be used to define a set of characters
1927 that describe a number. If a word is found to contain only those
1928 characters it will not be indexed. The characters listed must be part
1929 of C<WordCharacters> settings. In other words, the "word" checked is
1930 a word that Swish-e would otherwise index.
1931
1932 For example,
1933
1934 IgnoreNumberChars 0123456789$.,
1935
1936 Then Swish-e would not index the following:
1937
1938 123
1939 123,456.78
1940 $123.45
1941
1942 You might be tempted to avoid indexing hex numbers with:
1943
1944 IgnoreNumberChars 0123456789abcdef
1945
1946 which will not index 0D31, but will also not index the word "bad".
1947
1948 This is an experimental feature that may change in future versions.
1949 One possible change is to use regular expressions instead.
1950
1951
1952 =item IndexComments [NO|yes]
1953
1954 This option allows the user decide if to index the contents of HTML
1955 comments. Default is no. Set to yes if comment indexing is required.
1956
1957 IndexComments yes
1958
1959 Note: This is a change in the default behavior prior to version 2.2.
1960
1961 =item TranslateCharacters [*string1 string2*|:ascii7:]
1962
1963 The TranslateCharacters directive maps the characters in string1 to the
1964 characters listed in string2.
1965
1966 For example:
1967
1968 # This will index a_b as a-b and ámo as amo
1969 TranslateCharacters _á -a
1970
1971 C<TranslateCharacters :ascii7:> is a predefined set of characters that
1972 will translate eight bit characters to ascii7 characters. Using the
1973 :ascii7: rule will translate "Ääç" to "aac". This means: searching
1974 "Çelik", "çelik" or "celik" will all match the same word.
1975
1976 TranslateCharacters is done early in the indexing process, after
1977 converting HTML entities but before splitting the input text into words
1978 based on B<WordCharacters>. So characterters you are translating I<from>
1979 do not need to be listed in word characters.
1980
1981 The same character translations take place when searching.
1982
1983 =item BumpPositionCounterCharacters *string*
1984
1985 When indexing Swish-e assigns a word position to each word. This enables
1986 phrase searching. There may be cases where you would like to prevent
1987 phrase matching. The BumpPositionCounterCharacters directive allows
1988 you to specify a set of characters that when found in the text will
1989 increment the word position -- effectively preventing phrase matches
1990 across that character.
1991
1992 For example, if you have a tag:
1993
1994 <subjects>
1995 computer programming | apple computers
1996 </subjects>
1997
1998 You might want to prevent matching "programming apple" in that meta name.
1999
2000 BumpPositionCounterCharacters |
2001
2002 There is no default, and you may list a string of characters.
2003
2004 =item DontBumpPositionOnEndTags *list of names*
2005
2006 =item DontBumpPositionOnStartTags *list of names*
2007
2008 Since metatags are typically separate data fields, the word position
2009 counter is automatically bumped between metatags (actally, bumpted when a
2010 start tag is found and when an end tag is found). This prevents matching
2011 a phrase that spans more than one metaname. C<DontBumpPositionOnEndTags>
2012 and C<DontBumpPositionOnStartTags> disables this feature for the listed
2013 metanames.
2014
2015 For example,
2016
2017 <person>
2018 <first_name>
2019 William
2020 </first_name>
2021 <last_name>
2022 Shakespeare
2023 </last_name>
2024 <updated_date>
2025 April 25, 1999
2026 </updated_date>
2027 </person>
2028
2029 In the conifuration file:
2030
2031 DontBumpPositionOnEndTags first_name
2032 DontBumpPositionOnStartTags last_name
2033
2034 This configuration allows this phrase search
2035
2036 -w 'person=("william shakespeare")'
2037
2038 but this phrase search will fail
2039
2040 -w 'person=("shakespeare april")'
2041
2042
2043
2044 =back
2045
2046
2047 =head2 Directives for the File Access method only
2048
2049 Some directives have different uses depending on the source of the
2050 documents. These directives are only valid when using the B<File system>
2051 method of indexing.
2052
2053 =over 4
2054
2055 =item IndexOnly *list of file suffixes*
2056
2057 This directive specifies the allowable file suffixes (extensions) while
2058 indexing. The default is to index all files specified in B<IndexDir>.
2059
2060 # Only index .html .htm and .q files
2061 IndexOnly .html .htm .q
2062
2063 C<IndexOnly> checks that the file end in the characters listed. It does
2064 not check "extensions". C<IndexOnly> is tested right before C<FileRules>
2065 is processed.
2066
2067 =item FollowSymLinks [yes|NO]
2068
2069 Put "yes" to follow symbolic links in indexing, else "no". Default is no.
2070
2071 FollowSymLinks no
2072 FollowSymLinks yes
2073
2074 Note that when set to C<no> extra stat(2) system calls must be made for
2075 each file. For large number of files you may see a small reduction in
2076 indexing time by setting this to C<yes>.
2077
2078 See also the C<-l> switch in L<SWISH-RUN|SWISH-RUN>.
2079
2080 =item FileRules [type] [contains|is|regex] *regular expression*
2081
2082 =item FileMatch [type] [contains|is|regex] *regular expression*
2083
2084 FileRules and FileMatch are used to, respectively, exclude and include
2085 files and directories to index. Since, by default, Swish-e indexes all
2086 files and recurses all directories (but see also C<FollowSymLinks>) you
2087 will typically only use C<FileRules> to exclude files or directories.
2088 C<FileMatch> is useful in a few cases, for example, to override the
2089 behavior of C<IndexOnly>. Some examples are included below.
2090
2091 Except for C<FileRules title ...>, this feature is only available for
2092 file access method (-S fs), which is the default indexing mode. Also,
2093 any pathname modification with C<ReplaceRules> happens after the check
2094 for C<FileRules>. (It's unlikly that you would exclude files with
2095 C<FileRules> based on text you added with C<ReplaceRules>!)
2096
2097 The regular expression is a C regex.h extended regular expression.
2098 You may supply more than one regular expression per line, or use
2099 separate directives. Preceeding the regular expression with the word
2100 "not" negates the match.
2101
2102 The regular expression is compared against B<[type]> as described below.
2103
2104 For historical reasons, you can specify C<contains> or C<is>. C<is>
2105 simply forces the regular expression to match at the start and end
2106 of the string (by internally prepending "^" and appending "$" to the
2107 regular expression).
2108
2109 The C<regex> option requires delimiter characters:
2110
2111 FileRules title regex /^private/i
2112
2113 The only advantage of C<regex> is if you want to do case insensitive
2114 matches, or simply like your regular expressions to look like perl
2115 regular expressions. You must use matching delimiters; (), {}, and [],
2116 are not currently supported for no good reason other than laziness.
2117
2118 Use quotes (" or ') around a pattern if it contains any white space.
2119 Note that the backslash character becomes the escape character within
2120 quotes.
2121
2122 For example, these sets generate the same regular expressions.
2123
2124 FileRules title is hello
2125 FileRules title contains ^hello$
2126 FileRules title regex /^hello$/
2127
2128 These all need quotes due to the included space character
2129
2130 FileRules title is "hello there"
2131 FileRules title contains "^hello there$"
2132 FileRules title regex "!^hello there$!"
2133
2134 These show how the backslash must be doubled inside of quotes.
2135 Swish-e converts a double-backslash into a single backslash, and then
2136 passes that single onto the regular expression compiler.
2137
2138 FileRules filename regex /\.pdf/
2139 FileRules filename regex "/\\.pdf/"
2140
2141 FileRules filename regex !hello\\there! # need double for real backslash
2142 FileRules filename regex "!hello\\\\there!" # need double-double inside of quotes
2143
2144
2145 B<Matching Types>
2146
2147 The following types of match strings my be supplied:
2148
2149 FileRules pathname
2150 FileRules dirname
2151 FileRules filename
2152 FileRules directory
2153 FileRules title
2154
2155 FileMatch pathname
2156 FileMatch filename
2157 FileMatch dirname
2158 FileMatch directory
2159
2160 B<pathname> matches the regular expression against the current pathname.
2161 The pathname may or may not be absolute depending on what you supplied
2162 to C<IndexDir>.
2163
2164 Example:
2165
2166 # Don't index paths that contain private or hidden
2167 FileRules pathname contains (private|hidden)
2168
2169 # Same thing
2170 FileRules pathname regex /(private|hidden)/
2171
2172 # Don't index exe files
2173 FileRules pathname contains \.exe$
2174
2175 B<dirname> and B<filename> split the path name by the last delimiter
2176 character into a directory name, and a file name. Then these are compared
2177 against the patterns supplied. Directory names do B<not> have a trailing
2178 slash. All path names use the forward slash as a delimiter within Swish-e.
2179
2180 Example:
2181
2182 # Same as last example - don't index *.exe files.
2183 FileRules filename contains \.exe$
2184
2185 # Don't index any file called test.html files
2186 FileRules filename contains ^test\.html$
2187
2188 # Same thing
2189 FileRules filename is test\.html
2190
2191 # Don't index any directoires that contain "old" (/usr/local/myold/docs)
2192 FileRules dirname contains old
2193
2194 # Don't index any directories that contain the path segment "old" (/usr/local/old/foo)
2195 FileRules dirname contains /old/
2196
2197 # Index only .htm, .html, plus any all-digit file names
2198 IndexOnly .htm .html
2199 FileMatch filename contains ^\d+$
2200
2201 # Same as previous, but maybe a little slower
2202 FileRules filename regex not !\.(htm|html)$!
2203 FileMatch filename contains ^\d+$
2204
2205 Swish-e checks these settings in the order of C<pathname>, C<dirname>, and
2206 C<filename>, and C<FileMatch> patterns are checked before C<FileRules>,
2207 in general. This allows you to exclude most files with C<FileRules>,
2208 yet allow in a few special cases with C<FileMatch>. For example:
2209
2210 # Exclude all files of .exe, .bin, and .bat
2211 FileRules filename contains \.(exe|bin|bat)$
2212 # But, let these two in
2213 FileMatch filename is baseball\.bat incoming_mail\.bin
2214
2215 # Same, but as a single pattern
2216 FileMatch filename is (baseball\.bat|incoming_mail\.bin)
2217
2218 The C<directory> type is somewhat unique. When Swish-e recurses into a
2219 directory it will compare all the I<files> in the directory with the
2220 pattern and then decide if that entire directory should or should not
2221 be indexed (or recursed). Note that you are matching against file names
2222 in a directory -- and some of those names may be directory names.
2223
2224 A C<FileRules directory> match will cause Swish-e to ignore all files and
2225 sub-directories in the current directory.
2226
2227 Warning: A match with C<FileMatch directory> says to index B<everything>
2228 in the *current* directory and B<ignore> any FileRules for this directory.
2229
2230
2231 Example:
2232
2233 # Don't index any directories (and sub directories) that contain
2234 # a file (or sub-directory) called "index.skip"
2235 FileRules directory contains ^index\.skip$
2236
2237 # Don't index directories that contain a .htaccess file.
2238 FileRules directory contains ^\.htaccess
2239
2240 Note: While I<processing> directories, Swish-e will ignore any files
2241 or directories that begin with a dot ("."). You may index files
2242 or directories that begin with a dot by specifying their name with
2243 C<IndexDir> or C<-i>.
2244
2245 C<title> checks for a pattern match in an HTML title.
2246
2247 Example:
2248
2249 FileRules title contains construction example pointers
2250
2251 # This example says to ignore case
2252 FileRules title regex "/^Internal document/i"
2253
2254 Note: C<FileRules title> works for any input method (fs, prog, or http)
2255 that is parsed as HTML, and where a title was found in the document.
2256
2257 In case all this seems a bit confusing, processing a directory happens
2258 in the following order.
2259
2260 First the directory name is checked:
2261
2262 FileRules dirname - reject entire directory if matches
2263
2264 Next the directory is scanned and each file name (which might be the
2265 name of a sub-directory) is checked:
2266
2267 FileRules directory - reject entire dir if any files match FileMatch
2268 directory - accept *entire* dir if any files match
2269
2270 Then, unless C<FileMatch directory> matched, each file is tested with
2271 FileMatch. A match says to index the file without further testing
2272 (i.e. overrides FileRules and IndexOnly):
2273
2274 FileMatch pathname \
2275 FileMatch dirname - file is accepted if any match
2276 FileMatch filename /
2277
2278 otherwise
2279
2280 IndexOnly - file is checked for the correct file extension
2281
2282 FileRules pathname \
2283 FileRules dirname - file is rejected if any match
2284 FileRules filename /
2285
2286 finally, the file is indexed.
2287
2288 Files (not directories) listed with C<IndexDir> or C<-i> are processed
2289 in a similar way:
2290
2291 FileMatch pathname \
2292 FileMatch dirname - file is accepted if any match
2293 FileMatch filename /
2294
2295 otherwise, the file is rejected if it doesn't have the correct extension
2296 or a FileRules matches.
2297
2298 IndexOnly - file is checked for the correct file extension
2299
2300 FileRules pathname \
2301 FileRules dirname - file is rejected if any match
2302 FileRules filename /
2303
2304 Note: If things are not indexing as you expect, create a directory
2305 with some test files and use the C<-T regex> trace option to see how
2306 file names are checked. Start with very simple tests!
2307
2308
2309 =back
2310
2311 =head2 Directives for the HTTP Access Method Only
2312
2313 These directives are available when using the HTTP Access Method of indexing.
2314
2315 =over 4
2316
2317 =item MaxDepth *integer*
2318
2319 MaxDepth defines how many links the spider should follow before stopping.
2320 A value of 0 configures the spider to traverse all links. The default
2321 is MaxDepth 5.
2322
2323 MaxDepth 5
2324
2325 =item Delay *seconds*
2326
2327 The number of seconds to wait between issuing requests to a server.
2328 This setting allows for more friendly spidering of remote sites.
2329 The default is 60 seconds.
2330
2331 Delay 1
2332
2333 =item TmpDir *path*
2334
2335 The location of a writable temp directory on your system. The HTTP
2336 access method tells the Perl helper to place its files in this location,
2337 and the C<-e> switch causes Swish-e to use this directory while indexing.
2338 There is no default.
2339
2340 TmpDir /tmp/swish
2341
2342 If this directory does not exist or is not writable Swish-e will fail
2343 with an error during indexing.
2344
2345 Note, the environment variables of C<TMPDIR>, C<TMP>, and C<TEMP>
2346 (in that order) will B<override> this setting.
2347
2348 =item SpiderDirectory *path*
2349
2350 The location of the Perl helper script called F<swishspider>. If you
2351 use a relative directory, it is relative to your directory when you run
2352 Swish-e, not to the directory that Swish-e is in. The default is C<./>
2353
2354 SpiderDirectory /usr/local/swish
2355
2356 =item EquivalentServer *server alias*
2357
2358 Often times the same site may be referred to by different names.
2359 A common example is that often http://www.some-server.com and
2360 http://some-server.com are the same. Each line should have a list of
2361 all the method/names that should be considered equivalent. Multiple
2362 EquivalentServer directives may be used. Each directive defines its
2363 own set of equivalent servers.
2364
2365 EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu
2366 EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu
2367
2368 =back
2369
2370 =head2 Directives for the prog Access Method Only
2371
2372 This section details the directives that are only available for the
2373 "prog" document source feature of Swish-e. The "prog" access method runs
2374 an external program that "feeds" documents to Swish-e. This allows indexing
2375 and filtering of documents from any source.
2376
2377 See L<prog - general purpose access method|SWISH-RUN/"item_prog"> in
2378 the SWISH-RUN man page for more information.
2379
2380
2381 A number of example programs for use with the "prog" access method are
2382 provided in the F<prog-bin> directory. Please see those example if you
2383 have questions about implementing a "prog" input program.
2384
2385 =over 4
2386
2387 =item SwishProgParameters *list of parameters*
2388
2389 This is a list of parameters that will be sent to the external program
2390 when running with the "prog" document source method.
2391
2392 SwishProgParameters /path/to/config hello there
2393 IndexDir /path/to/program.pl
2394
2395 Then running:
2396
2397 swish-e -c config -S prog
2398
2399 Swish-e will execute C</path/to/program.pl> and pass C</path/to/config
2400 hello there> as three command line arguments to the program. This
2401 directive makes it easy to pass settings from the Swish-e configuration
2402 file to the external program.
2403
2404 For example, the C<spider.pl> program (included in the C<prog-bin>
2405 directory) uses the C<SwishProgParameters> to specify what file to read
2406 for configuation information.
2407
2408 SwishProgParameters spider.config
2409 IndexDir ./spider.pl
2410
2411 The C<spider.pl> program also has a default action so you can avoid
2412 using a configuration file:
2413
2414 SwishProgParameters default http://www.swishe.org/ http://some.other.site/
2415 IndexDir ./spider.pl
2416
2417 And the spider program will use default settings for spidering those sites.
2418
2419 =back
2420
2421 B<Notes when using MS Windows>
2422
2423 You should use unix style path separators to specify your external
2424 program. Swish will convert forward slashes to backslashes before
2425 calling the external program. This is only true for the program name
2426 specified with C<IndexDir> or the C<-i> command line option.
2427
2428 In addition, Swish-e will make sure the program specified actually exists,
2429 which means you need to use the full name of the program.
2430
2431 For example, to run the perl spider program F<spider.pl> you would need
2432 a Swish-e configuration file such as:
2433
2434 IndexDir e:/perl/bin/perl.exe
2435 SwishProgParameters prog-bin/spider.pl default http://swish-e.org
2436
2437 and run indexing with the command:
2438
2439 swish-e -c swish.cfg -S prog -v 9
2440
2441 The C<IndexDir> command tells Swish-e the name of the program to run.
2442 Under unix you can just specify the name of the script, since unix will
2443 figure out the program from the first line of the script.
2444
2445 The C<SwishProgParameters> are the parameters passed to the program
2446 specified by C<IndexDir> (perl.exe in this case). The first parameter
2447 is the perl script to run (F<prog-bin/spider.pl>). Perl passes the rest
2448 of the parameters directly to the perl script. The second parameter
2449 F<default> tells the F<spider.pl> program to use default settings for
2450 spidering (or you could specify a spider config file -- see C<perldoc
2451 spider.pl> for details), and lastly, the URL is passed into the spider
2452 program.
2453
2454
2455 =head2 Document Filter Directives
2456
2457 Internally, Swish-e knows how to parse only text, HTML, and XML documents.
2458 With Swish-e filters you can index other types of documents. For example,
2459 if all your web pages are in gzip format a filter can uncompress these
2460 on the fly for indexing.
2461
2462 A filter is an external program that Swish-e executes while processing
2463 a document of a given type. Swish-e will execute the filter program
2464 for each file that matches the file suffix (extension) set in the
2465 B<FileFilter> or B<FileFilterMatch> directives. B<FileFilterMatch>
2466 matches using regular expressions and is described below.
2467
2468 Swish-e calls the external program passing as B<default> arguments:
2469
2470 =over 4
2471
2472 =item $0
2473
2474 the name of the filter program
2475
2476 =item $1
2477
2478 the physical path name of the file to read. This may be a temporary
2479 file location if indexing by the http method.
2480
2481 =item $2
2482
2483 When indexing under the file system this will be the same as $1 (the
2484 path to the source file), but when indexing under the http method this
2485 will be the URL of the source document.
2486
2487 =back
2488
2489 Swish-e can also pass other parameters to the filter program. These
2490 parameters can be defined using the B<FileFilter> or B<FileFilterMatch>
2491 directives. See Filter Options below.
2492
2493 The filter program must open the file, process its contents, and return
2494 it to Swish-e by printing to STDOUT.
2495
2496 Note that this can add a significant amount of time to the indexing
2497 process if your external program is a perl or shell script. If you
2498 have many files to filter you should consider writing your filter in C
2499 instead of a shell or perl script, or using the "prog" Access Method.
2500
2501 =over 4
2502
2503 =item FilterDir *path-to-directory*
2504
2505 This is the path to a directory where the filter programs are stored.
2506 Swish-e looks in this directory to find the filter specified in the
2507 B<FileFilter> directive. If this directive is omitted, you have to
2508 specify the full path to the filterscript on each FileFilter directive.
2509
2510 This feature does *not* apply to the C<FileFilterMatch> directive.
2511
2512 Example:
2513
2514 FilterDir /usr/local/swish/filters
2515
2516 =item FileFilter *suffix* "filter-prog" ["filter-options"]
2517
2518 This maps file suffixe (extension) to a filter program. If I<filter-prog>
2519 starts with a directory delimiter (absolute path), Swish-e doesn't use
2520 the FilterDir settings, but uses the given I<filter-prog> path directly.
2521
2522 Filter options:
2523
2524 Filter options are a string passed as arguments to the I<filter-prog>.
2525 Filter options can contain variables, replaced by Swish-e. If you ommit
2526 I<filter-options> Swish-e will use default parameters for the options
2527 listed above.
2528
2529 Default: "'%p' '%P'"
2530 Which means: pass "workfile path" and "documentfile path" to filter (each quoted).
2531
2532 Variables in filter options:
2533
2534 %% = %
2535 %P = Full document pathname (e.g. URL, or path on filesystem)
2536 %p = Full pathname to work file (maybe a tmpfile or the real document path on filesystem)
2537 %F = Filename stripped from full document pathname
2538 %f = Filename stripped from "work" pathname
2539 %D = Directoryname stripped from full document pathname
2540 %d = Directoryname stripped from full "work" pathname
2541
2542 Examples of strings passed:
2543
2544 %P = document pathname: http://myserver/path1/mydoc.txt
2545 %p = work pathname: /tmp/tmp.1234.mydoc.txt
2546 %F = mydoc.txt
2547 %f = tmp.1234.mydoc.txt
2548 %D = http://myserver/path1
2549 %d = /tmp
2550
2551 Important hint for security:
2552
2553 When using variable substitution, use quotes to ensure filename integrity.
2554
2555 e.g. "'%f'" --> 'file name with spaces.doc'.
2556
2557 If you don't use this, your system security may be compromised, or
2558 filtering may not work for these files.
2559
2560 B<Notes when using MS Windows>
2561
2562 Windows uses double quotes to escape shell metacharacters, so reverse
2563 the quotes in the examples above. e.g.:
2564
2565 '"%f"' --> "file name with spaced.doc"
2566
2567 You can specify the filter program using forward slashes (unix style).
2568 Swish will convert the slashes to backslashes before running your program.
2569
2570 FileFilter .mydoc c:/some/path/mydocfilter.exe '-d "%d" -example -url "%P" "%f"'
2571
2572
2573 Examples of filters:
2574
2575 FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"
2576 FileFilter .pdf pdftotext "'%p' -"
2577 FileFilter .html.gz gzip "-c '%p'"
2578 FileFilter .mydoc "/some/path/mydocfilter" "-d '%d' -example -url '%P' '%f'"
2579
2580 The above examples are running a I<binary> filter program. For more
2581 complicated filtering needs you may use a scripting language such as
2582 Perl or a shell script. Here's some examples of calling a shell and
2583 perl script:
2584
2585 FileFilter .pdf pdf2html.sh
2586 FileFilter .ps ghostscript-filter.pl
2587
2588 Using a scripting language (or any language that has a large startup
2589 cost) can B<greatly increase the indexing time>. For small indexing
2590 jobs, this may not be an issue, but for large collections of files that
2591 require processing by a scripting language, you may be better off using
2592 the C<-S prog> access method where the script will only be compiled once,
2593 instead of for each document.
2594
2595 Filters are probably easier to write than a C<-S prog> program. Which you
2596 decide to use depends on your requirements. Examples of filter scripts
2597 can be found in the F<filter-bin> directory, and examples of C<-S prog>
2598 programs can be found in the F<prog-bin> directory.
2599
2600 =item FileFilterMatch *filter-prog* *filter-options* *regex* [*regex* ...]
2601
2602 This is similar to C<FileMatch> except uses regular expressions to
2603 match against the file name. *filter-prog* is the path to the program.
2604 Unlike C<FileFilter> this does B<not> use the C<FilterDir> option.
2605 Also unlike C<FileFilter> you B<must> specify the *filter-options*.
2606
2607 Examples:
2608
2609 FileFilterMatch ./pdftotext "'%p' -" /\.pdf$/
2610
2611 Note that will also match a file called ".pdf", so you may want to use
2612 something that requires a filename that has more than just an extension.
2613 For example:
2614
2615 FileFilterMatch ./pdftotext "'%p' -" /.\.pdf$/
2616
2617 To specify more than one extension:
2618
2619 FileFilterMatch ./check_title.pl "%p" /\.html$/ /\.htm$/
2620
2621 Or a few ways to do the same thing:
2622
2623 FileFilterMatch ./check_title.pl %p /\.(html|html)$/
2624 FileFilterMatch ./check_title.pl %p /\.html?$/
2625
2626 And to ignore case:
2627
2628 FileFilterMatch ./check_title.pl %p /\.html?$/i
2629
2630 You may also precede an expression with "not" to negate regular expression
2631 that follow. For example, to match files that do not have an extension:
2632
2633 FileFilterMatch ./convert "%p %P" not /\..+$/
2634
2635 =back
2636
2637 =head1 Document Info
2638
2639 $Id: SWISH-CONFIG.pod,v 1.60 2002/08/28 14:30:23 whmoseley Exp $
2640
2641 .

  ViewVC Help
Powered by ViewVC 1.1.22