1 |
=head1 NAME |
2 |
|
3 |
SWISH-CONFIG - Configuration File Directives |
4 |
|
5 |
=head1 Swish-e CONFIGURATION FILE |
6 |
|
7 |
What files Swish-e indexes and how they are indexed, and where the index |
8 |
is written can be controlled by a configuration file. |
9 |
|
10 |
The configuration file is a text file composed of comments, blank |
11 |
lines, and B<configuration directives>. The order of the directives |
12 |
is not important. Some directives may be used more than once in the |
13 |
configuration file, while others can only be used once (e.g. additional |
14 |
directives will overwrite preceding directives). Case of the directive |
15 |
is not important -- you may use upper, lower, or mixed case. |
16 |
|
17 |
Comments are any line that begin with a "#". |
18 |
|
19 |
# This is a comment |
20 |
|
21 |
Directives may take more than one parameter. Enclose single parameters |
22 |
that include whitespace in quotes (single or double). Inside of quotes |
23 |
the backslash escapes the next character. |
24 |
|
25 |
ReplaceRules append "foo bar" <- define "foo bar" as a single parameter |
26 |
|
27 |
If you need to include a quote character in the value either use a |
28 |
backslash to escape it, or enclose it in quotes of the other type. |
29 |
|
30 |
For example, under unix you can use quotes to include white space in a |
31 |
single paramter. Here, to protect against path names (%p) that might |
32 |
have white space embedded use single quotes (this also protects against |
33 |
shell expansion or metacharacters): |
34 |
|
35 |
FileFilter .foo foofilter "'%p'" <- parameter passed through the shell in single quotes |
36 |
FileFilter .foo foofilter '"%p"' <- windows uses double-quotes |
37 |
FileFilter .foo foofilter '\'%p\''<- silly example |
38 |
|
39 |
|
40 |
Backslashes also have special meaning in regular expressions. |
41 |
|
42 |
FileFilterMatch pdftotext "'%p' -" /\.pdf$/ |
43 |
|
44 |
This says that the dot is a real dot (instead of matching any character). |
45 |
If you place the regular expression in quotes then you must use |
46 |
double-backslashes. |
47 |
|
48 |
FileFilterMatch pdftotext "'%p' -" "/\\.pdf$/" |
49 |
|
50 |
Swish-e will convert the double backslash into a single backslash before |
51 |
passing the parameter to the regular expression compiler. |
52 |
|
53 |
Commented example configuration files are included in the F<conf> |
54 |
directory of the Swish-e distribution. |
55 |
|
56 |
Some command line arguments can override directives specified in the |
57 |
configuration file. Please see also the L<SWISH-RUN|SWISH-RUN> for |
58 |
instructions on running Swish-e, and the L<SWISH-SEARCH|SWISH-SEARCH> |
59 |
page for information and examples on how to search your index. |
60 |
|
61 |
The configuration file is specified to Swish-e by the C<-c> switch. |
62 |
For example, |
63 |
|
64 |
swish-e -c myconfig.conf |
65 |
|
66 |
You may also split your directives up into different configuration files. |
67 |
This allows you to have a master configuration file used for many |
68 |
different indexes, and smaller configuration files for each separate |
69 |
index. You can specify the different configuration files when running |
70 |
from the command line with the C<-c> switch (see L<SWISH-RUN|SWISH-RUN>), |
71 |
or you may include other Configuration file with the B<IncludeConfigFile> |
72 |
directive below. |
73 |
|
74 |
Typically, in a configuration file the directives are grouped together in |
75 |
some logical order -- that is, directives that control the source of the |
76 |
documents would be grouped together first, and directives that control |
77 |
how each document is filtered or its words index in another group of |
78 |
directives. (The directives listed below are grouped in this order). |
79 |
|
80 |
The configuration file directives are listed below in these groups: |
81 |
|
82 |
=over 4 |
83 |
|
84 |
=item * |
85 |
|
86 |
L<Administrative Headers Directives|/"Administrative Headers Directives"> |
87 |
-- You may add administrative information to the header of the index file. |
88 |
|
89 |
=item * |
90 |
|
91 |
L<Document Source Directives|/"Document Source Directives"> -- Directives |
92 |
for selecting the source documents and the location of the index file. |
93 |
|
94 |
=item * |
95 |
|
96 |
L<Document Contents Directives|/"Document Contents Directives"> -- |
97 |
Directives that control how a document content is indexed. |
98 |
|
99 |
=item * |
100 |
|
101 |
L<Directives for the File Access method only|/"Directives for the File |
102 |
Access method only"> -- These directives are only applicable to the File |
103 |
Access indexing method. |
104 |
|
105 |
=item * |
106 |
|
107 |
L<Directives for the HTTP Access Method Only|/"Directives for the HTTP |
108 |
Access Method Only"> -- Likewise, these only apply to the HTTP Access |
109 |
method. |
110 |
|
111 |
=item * |
112 |
|
113 |
L<Directives for the prog Access Method Only|/"Directives for the prog |
114 |
Access Method Only"> -- These only apply to the prog Access method. |
115 |
|
116 |
=item * |
117 |
|
118 |
L<Document Filter Directives|/"Document Filter Directives"> -- This is |
119 |
a special section that describes using document filters with Swish-e. |
120 |
|
121 |
=back |
122 |
|
123 |
=head2 Alphabetical Listing of Directives |
124 |
|
125 |
=over 4 |
126 |
|
127 |
=item * |
128 |
|
129 |
L<AbsoluteLinks|/"item_AbsoluteLinks"> [yes|NO] |
130 |
|
131 |
=item * |
132 |
|
133 |
L<BeginCharacters|/"item_BeginCharacters"> *string of characters* |
134 |
|
135 |
=item * |
136 |
|
137 |
L<BumpPositionCounterCharacters|/"item_BumpPositionCounterCharacters"> *string* |
138 |
|
139 |
=item * |
140 |
|
141 |
L<Buzzwords|/"item_Buzzwords"> [*list of buzzwords*|File: path] |
142 |
|
143 |
|
144 |
=item * |
145 |
|
146 |
L<ConvertHTMLEntities|/"item_ConvertHTMLEntities"> [YES|no] |
147 |
|
148 |
=item * |
149 |
|
150 |
L<DefaultContents|/"item_DefaultContents"> [TXT|HTML|XML|WML] |
151 |
|
152 |
=item * |
153 |
|
154 |
L<Delay|/"item_Delay"> *seconds* |
155 |
|
156 |
=item * |
157 |
|
158 |
L<DontBumpPositionOnEndTags|/"item_DontBumpPositionOnEndTags"> *list of names* |
159 |
|
160 |
=item * |
161 |
|
162 |
L<DontBumpPositionOnStartTags|/"item_DontBumpPositionOnStartTags"> *list of names* |
163 |
|
164 |
=item * |
165 |
|
166 |
L<EnableAltSearchSyntax|/"item_EnableAltSearchSyntax"> [yes|NO] |
167 |
|
168 |
=item * |
169 |
|
170 |
L<EndCharacter|/"item_EndCharacters"> *string of characters* |
171 |
|
172 |
=item * |
173 |
|
174 |
L<EquivalentServer|/"item_EquivalentServer"> *server alias* |
175 |
|
176 |
=item * |
177 |
|
178 |
L<ExtractPath|/"item_ExtractPath"> *metaname* [replace|remove|prepend|append|regex] |
179 |
|
180 |
=item * |
181 |
|
182 |
L<FileFilter|/"item_FileFilter"> *suffix* *program* [options] |
183 |
|
184 |
=item * |
185 |
|
186 |
L<FileFilterMatch|/"item_FileFilterMatch"> *program* *options* *regex* [*regex* ...] |
187 |
|
188 |
=item * |
189 |
|
190 |
L<FileInfoCompression|/"item_FileInfoCompression"> [yes|NO] |
191 |
|
192 |
=item * |
193 |
|
194 |
L<FileMatch|/"item_FileMatch"> [contains|is|regex] *regular expression* |
195 |
|
196 |
=item * |
197 |
|
198 |
L<FileRules|/"item_FileRules"> [contains|is|regex] *regular expression* |
199 |
|
200 |
=item * |
201 |
|
202 |
L<FuzzyIndexingMode|/"item_FuzzyIndexingMode"> [NONE|Stemming|Soundex|Metaphone|DoubleMetaphone] |
203 |
|
204 |
=item * |
205 |
|
206 |
L<FollowSymLinks|/"item_FollowSymLinks"> [yes|NO] |
207 |
|
208 |
=item * |
209 |
|
210 |
L<HTMLLinksMetaName|/"item_HTMLLinksMetaName"> *metaname* |
211 |
|
212 |
=item * |
213 |
|
214 |
L<IgnoreFirstChar|/"item_IgnoreFirstChar"> *string of characters* |
215 |
|
216 |
=item * |
217 |
|
218 |
L<IgnoreLastChar|/"item_IgnoreLastChar"> *string of characters* |
219 |
|
220 |
=item * |
221 |
|
222 |
L<IgnoreLimit|/"item_IgnoreLimit"> *integer integer* |
223 |
|
224 |
=item * |
225 |
|
226 |
L<IgnoreMetaTags|/"item_IgnoreMetaTags"> *list of names* |
227 |
|
228 |
=item * |
229 |
|
230 |
L<IgnoreNumberChars|/"item_IgnoreNumberChars"> *list of characters* |
231 |
|
232 |
=item * |
233 |
|
234 |
L<IgnoreTotalWordCountWhenRanking|/"item_IgnoreTotalWordCountWhenRanking"> [YES|no] |
235 |
|
236 |
=item * |
237 |
|
238 |
L<IgnoreWords|/"item_IgnoreWords"> [*list of stop words*|File: path] |
239 |
|
240 |
=item * |
241 |
|
242 |
L<ImageLinksMetaName|/"item_ImageLinksMetaName"> *metaname* |
243 |
|
244 |
=item * |
245 |
|
246 |
L<IncludeConfigFile|/"item_IncludeConfigFile"> |
247 |
|
248 |
=item * |
249 |
|
250 |
L<IndexAdmin|/"item_IndexAdmin"> *text* |
251 |
|
252 |
=item * |
253 |
|
254 |
L<IndexAltTagMetaName|/"item_IndexAltTagMetaName"> *tagname*|as-text |
255 |
|
256 |
=item * |
257 |
|
258 |
L<IndexComments|/"item_IndexComments"> [YES|no] |
259 |
|
260 |
=item * |
261 |
|
262 |
L<IndexContents|/"item_IndexContents"> [TXT|HTML|XML|WML|TXT2|HTML2|XML2] *file extensions* |
263 |
|
264 |
=item * |
265 |
|
266 |
L<IndexDescription|/"item_IndexDescription"> *text* |
267 |
|
268 |
=item * |
269 |
|
270 |
L<IndexDir|/"item_IndexDir"> [URL|directories or files] |
271 |
|
272 |
=item * |
273 |
|
274 |
L<IndexFile|/"item_IndexFile"> *path* |
275 |
|
276 |
=item * |
277 |
|
278 |
L<IndexName|/"item_IndexName"> *text* |
279 |
|
280 |
=item * |
281 |
|
282 |
L<IndexOnly|/"item_IndexOnly"> *list of file suffixes* |
283 |
|
284 |
=item * |
285 |
|
286 |
L<IndexPointer|/"item_IndexPointer"> *text* |
287 |
|
288 |
=item * |
289 |
|
290 |
L<IndexReport|/"item_IndexReport"> [0|1|2|3] |
291 |
|
292 |
=item * |
293 |
|
294 |
L<MaxDepth|/"item_MaxDepth"> *integer* |
295 |
|
296 |
=item * |
297 |
|
298 |
L<MaxWordLimit|/"item_MaxWordLimit"> *integer* |
299 |
|
300 |
=item * |
301 |
|
302 |
L<MetaNameAlias|/"item_MetaNameAlias"> *meta name* *list of aliases* |
303 |
|
304 |
=item * |
305 |
|
306 |
L<MetaNames|/"item_MetaNames"> *list of names* |
307 |
|
308 |
=item * |
309 |
|
310 |
L<MinWordLimit|/"item_MinWordLimit"> *integer* |
311 |
|
312 |
=item * |
313 |
|
314 |
L<NoContents|/"item_NoContents"> *list of file suffixes* |
315 |
|
316 |
=item * |
317 |
|
318 |
L<obeyRobotsNoIndex|/"item_obeyRobotsNoIndex"> [yes|NO] |
319 |
|
320 |
=item * |
321 |
|
322 |
L<ParserWarnLevel|/"item_ParserWarnLevel"> [0|1|2|3] |
323 |
|
324 |
=item * |
325 |
|
326 |
L<PreSortedIndex|/"item_PreSortedIndex"> *list of property names* |
327 |
|
328 |
=item * |
329 |
|
330 |
L<PropCompressionLevel|/"item_PropCompressionLevel"> [0-9] |
331 |
|
332 |
=item * |
333 |
|
334 |
L<PropertyNameAlias|/"item_PropertyNameAlias"> *property name* *list of aliases* |
335 |
|
336 |
=item * |
337 |
|
338 |
L<PropertyNames|/"item_PropertyNames"> *list of meta names* |
339 |
|
340 |
=item * |
341 |
|
342 |
L<PropertyNamesCompareCase|/"item_PropertyNamesCompareCase"> *list of meta names* |
343 |
|
344 |
=item * |
345 |
|
346 |
L<PropertyNamesIgnoreCase|/"item_PropertyNamesIgnoreCase"> *list of meta names* |
347 |
|
348 |
=item * |
349 |
|
350 |
L<PropertyNamesDate|/"item_PropertyNamesDate"> *list of meta names* |
351 |
|
352 |
=item * |
353 |
|
354 |
L<PropertyNamesNumeric|/"item_PropertyNamesNumeric"> *list of meta names* |
355 |
|
356 |
=item * |
357 |
|
358 |
L<PropertyNamesMaxLength|/"item_PropertyNamesMaxLength"> integer *list of meta names* |
359 |
|
360 |
=item * |
361 |
|
362 |
L<ReplaceRules|/"item_ReplaceRules"> [replace|remove|prepend|append|regex] |
363 |
|
364 |
=item * |
365 |
|
366 |
L<ResultExtFormatName|/"item_ResultExtFormatName"> name -x format string |
367 |
|
368 |
=item * |
369 |
|
370 |
L<SpiderDirectory|/"item_SpiderDirectory"> *path* |
371 |
|
372 |
=item * |
373 |
|
374 |
L<StoreDescription|/"item_StoreDescription"> [XML <tag>|HTML <meta>|TXT size] |
375 |
|
376 |
=item * |
377 |
|
378 |
L<SwishProgParameters|/"item_SwishProgParameters> *list of parameters* |
379 |
|
380 |
=item * |
381 |
|
382 |
L<SwishSearchDefaultRule|/"item_SwishSearchDefaultRule"> [<AND-WORD>|<or-word>] |
383 |
|
384 |
=item * |
385 |
|
386 |
L<SwishSearchOperators|/"item_SwishSearchOperators"> <and-word> <or-word> <not-word> |
387 |
|
388 |
=item * |
389 |
|
390 |
L<TmpDir|/"item_TmpDir"> *path* |
391 |
|
392 |
=item * |
393 |
|
394 |
L<TranslateCharacters|/"item_TranslateCharacters"> [*string1 string2*|:ascii7:] |
395 |
|
396 |
=item * |
397 |
|
398 |
L<TruncateDocSize|/"item_TruncateDocSize"> |
399 |
*number of characters* |
400 |
|
401 |
=item * |
402 |
|
403 |
L<UndefinedMetaTags|/"item_UndefinedMetaTags"> [error|ignore|INDEX|auto] |
404 |
|
405 |
=item * |
406 |
|
407 |
L<UndefinedXMLAttributes|/"item_UndefinedXMLAttributes"> [DISABLE| error|ignore|index|auto] |
408 |
|
409 |
=item * |
410 |
|
411 |
L<UseStemming|/"item_UseStemming"> [yes|NO] |
412 |
|
413 |
=item * |
414 |
|
415 |
L<UseSoundex|/"item_UseSoundex"> [yes|NO] |
416 |
|
417 |
=item * |
418 |
|
419 |
L<UseWords|/"item_UseWords"> [*list of words*|File: path] |
420 |
|
421 |
=item * |
422 |
|
423 |
L<WordCharacters|/"item_WordCharacters"> *string of characters* |
424 |
|
425 |
=item * |
426 |
|
427 |
L<XMLClassAttributes|/"item_XMLClassAttributes"> *list of XML attribute names* |
428 |
|
429 |
=back |
430 |
|
431 |
=head2 Directives that Control Swish |
432 |
|
433 |
These configuration directives control the general behavior of Swish-e. |
434 |
|
435 |
=over 4 |
436 |
|
437 |
=item IncludeConfigFile *path to config file* |
438 |
|
439 |
This directive can be used to include configuration directives located |
440 |
in another file. |
441 |
|
442 |
IncludeConfigFile /usr/local/swish/conf/site_config.config |
443 |
|
444 |
=item IndexReport [0|1|2|3] |
445 |
|
446 |
This is how detailed you want reporting while indexing. You can specify |
447 |
numbers 0 to 3. 0 is totally silent, 3 is the most verbose. The default |
448 |
is 1. |
449 |
|
450 |
This may be overridden from the command line via the C<-v> switch (see |
451 |
L<SWISH-RUN|SWISH-RUN>). |
452 |
|
453 |
=item ParserWarnLevel [0|1|2|3] |
454 |
|
455 |
Sets the error level when using the libxml2 parser for XML and HTML. |
456 |
libxml2 will point out structural errors in your documents. |
457 |
|
458 |
0 = no report |
459 |
1 = fatal errors |
460 |
2 = errors |
461 |
3 = warnings |
462 |
|
463 |
The exception to this is UTF-8 to Latin-1 coversion errors are reported at |
464 |
level 1. This is because words may be indexed incorrecty in these cases. |
465 |
|
466 |
Note that unlike other errors generated by Swish-e, these errors are |
467 |
sent to stderr. |
468 |
|
469 |
=item IndexFile *path* |
470 |
|
471 |
Index file specifies the location of the generated index file. If not |
472 |
specified, Swish-e will create the file F<index.swish-e> in the current |
473 |
directory. |
474 |
|
475 |
IndexFile /usr/local/swish/site.index |
476 |
|
477 |
=item obeyRobotsNoIndex [yes|NO] |
478 |
|
479 |
When enabled, Swish-e will not index any HTML file that contains: |
480 |
|
481 |
<meta name="robots" content="noindex"> |
482 |
|
483 |
The default is to ignore these meta tags and index the document. |
484 |
This tag is described at http://www.robotstxt.org/wc/exclusion.html. |
485 |
|
486 |
Note: This feature is only available with the libxml2 HTML parser. |
487 |
|
488 |
Also, if you are using the libxml2 parser (HTML2 and XML2) then you can use the the following |
489 |
comments in your documents to prevent indexing: |
490 |
|
491 |
<!-- SwishCommand noindex --> |
492 |
<!-- SwishCommand index --> |
493 |
|
494 |
and/or these may be used also: |
495 |
|
496 |
<!-- noindex --> |
497 |
<!-- index --> |
498 |
|
499 |
For example, these are very helpful to prevent indexing of common headers, footers, and menus. |
500 |
|
501 |
|
502 |
=back |
503 |
|
504 |
B<NOTE>: This following items are currently not available. These items |
505 |
require Swish-e to parse the configuration file while searching. |
506 |
|
507 |
|
508 |
=over 4 |
509 |
|
510 |
=item EnableAltSearchSyntax [yes|NO] |
511 |
|
512 |
B<NOTE>: This following item is currently not available. |
513 |
|
514 |
Enable alternate search syntax. Allows the usage of a basic |
515 |
"Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search |
516 |
query can contain "+" and "-" as syntax parameter. |
517 |
|
518 |
Example: |
519 |
|
520 |
swish-e -w "+word1 +word2 -word3 word4 word5" |
521 |
"+" = following word has to be in all found documents |
522 |
"-" = following word may not be in any document found |
523 |
" " = following word will be searched in documents |
524 |
|
525 |
=item SwishSearchOperators <and-word> <or-word> <not-word> |
526 |
|
527 |
B<NOTE>: This following item is currently not available. |
528 |
|
529 |
Using this config directive you can change the boolean search operators of |
530 |
Swish-e, e.g. to adapt these to your language. |
531 |
The default is: AND OR NOT |
532 |
|
533 |
Example (german): |
534 |
|
535 |
SwishSearchOperators UND ODER NICHT |
536 |
|
537 |
=item SwishSearchDefaultRule [<AND-WORD>|<or-word>] |
538 |
|
539 |
B<NOTE>: This following item is currently not available. |
540 |
|
541 |
C<SwishSearchDefaultRule> defines the default Boolean operator to use |
542 |
if none is specified between words or phrases. The default is C<AND>. |
543 |
|
544 |
The word you specify must match one of the available |
545 |
C<SwishSearchOperators>. |
546 |
|
547 |
Example: |
548 |
|
549 |
SwishSearchOperators UND ODER NICHT |
550 |
# Make it act like a web search engine |
551 |
SwishSearchDefaultRule ODER |
552 |
|
553 |
=item ResultExtFormatName name -x format string |
554 |
|
555 |
B<NOTE>: This following item is currently not available. |
556 |
|
557 |
The output of Swish-e can be defined by specifying a format string with |
558 |
the C<-x> command line argument. Using C<ResultExtFormatName> you can |
559 |
assign a predefined format string to a name. |
560 |
|
561 |
Examples: |
562 |
|
563 |
ResultExtFormatName moreinfo "%c|%r|%t|%p|<author>|<publishyear>\n" |
564 |
|
565 |
Then when searching you can specify the the format string's name |
566 |
|
567 |
swish-e ... -x moreinfo ... |
568 |
|
569 |
See the C<-x> switch in L<SWISH-RUN|SWISH-RUN> for more information |
570 |
about output formats. |
571 |
|
572 |
=back |
573 |
|
574 |
|
575 |
=head2 Administrative Headers Directives |
576 |
|
577 |
Swish-e stores configuration information in the header of the index file. |
578 |
This information can be retrieved while searching or by functions in |
579 |
the Swish-e C library. There are a number of fields available for your |
580 |
own use. None of these fields are required: |
581 |
|
582 |
=over 4 |
583 |
|
584 |
=item IndexName *text* |
585 |
|
586 |
=item IndexDescription *text* |
587 |
|
588 |
=item IndexPointer *text* |
589 |
|
590 |
=item IndexAdmin *text* |
591 |
|
592 |
These variables specify information that goes into index files to help |
593 |
users and administrators. IndexName should be the name of your index, |
594 |
like a book title. IndexDescription is a short description of the index |
595 |
or a URL pointing to a more full description. IndexPointer should be |
596 |
a pointer to the original information, most likely a URL. IndexAdmin |
597 |
should be the name of the index maintainer and can include name and email |
598 |
information. These values should not be more than 70 or so characters |
599 |
and should be contained in quotes. Note that the automatically generated |
600 |
date in index files is in D/M/Y and 24-hour format. |
601 |
|
602 |
Examples: |
603 |
|
604 |
IndexName "Linux Documentation" |
605 |
IndexDescription "This is an index of /usr/doc on our Linux machine." |
606 |
IndexPointer http://localhost/swish/linux/index.html |
607 |
IndexAdmin webmaster |
608 |
|
609 |
|
610 |
=back |
611 |
|
612 |
=head2 Document Source Directives |
613 |
|
614 |
These directives control I<what> documents are indexed and I<how> |
615 |
they are accessed. See also L<Directives for the File Access method |
616 |
only|/"Directives for the File Access method only"> and L<Directives for |
617 |
the HTTP Access Method Only|/"Directives for the HTTP Access Method Only"> |
618 |
for directives that are specific to those access methods. |
619 |
|
620 |
|
621 |
=over 4 |
622 |
|
623 |
=item IndexDir [directories or files|URL|external program] |
624 |
|
625 |
IndexDir defines the source of the documents for Swish-e. Swish-e |
626 |
currently supports three file access methods: B<File system>, B<HTTP> |
627 |
(also called B<spidering>), and B<prog> for reading files from an |
628 |
external program. |
629 |
|
630 |
The C<-S> command line argument is used to select the file access method. |
631 |
|
632 |
swish-e -c swish.config -S fs - file system |
633 |
swish-e -c swish.config -S http - internal http spider |
634 |
swish-e -c swish.config -S prog - external program of any type |
635 |
|
636 |
For the B<fs> method of access B<IndexDir> is a space-separated |
637 |
list of files and directories to index. Use a forward slash as the path |
638 |
separator in MS Windows. |
639 |
|
640 |
For the B<http> method the B<IndexDir> setting is a list of space-separated |
641 |
URLs. |
642 |
|
643 |
For the B<prog> method the B<IndexDir> setting is a list of space-separated |
644 |
programs to run (which generate documents for swish to index). |
645 |
|
646 |
You may specify more than one B<IndexDir> directive. |
647 |
|
648 |
Any sub-directories of any listed directory will also be indexed. |
649 |
|
650 |
Note: While I<processing> directories, Swish-e will ignore any files |
651 |
or directories that begin with a dot ("."). You may index files |
652 |
or directories that begin with a dot by specifying their name with |
653 |
C<IndexDir> or C<-i>. |
654 |
|
655 |
Examples: |
656 |
|
657 |
# Index this directory an any subdirectories |
658 |
IndexDir /usr/local/home/http |
659 |
|
660 |
# Index the docs directory in current directory |
661 |
IndexDir ./docs |
662 |
|
663 |
# Index these files in the current directory |
664 |
IndexDir ./index.html ./page1.html ./page2.html |
665 |
# and index this directory, too |
666 |
IndexDir ../public_html |
667 |
|
668 |
For the B<HTTP> method of access specify the URL's from which |
669 |
you want the spidering to begin. |
670 |
|
671 |
Example: |
672 |
|
673 |
IndexDir http://www.my-site.com/index.html |
674 |
IndexDir http://localhost/index.html |
675 |
|
676 |
Obviously, using the B<HTTP> method to index is B<much> slower than |
677 |
indexing local files. Be well aware that some sites do not appreciate |
678 |
spidering and may block your IP address. You may wish to contact the |
679 |
remote site before spidering their web site. More information about |
680 |
spidering can be found in L<Directives for the HTTP Access Method |
681 |
Only|/"Directives for the HTTP Access Method Only"> below. |
682 |
|
683 |
For the L<prog|SWISH-RUN/"item_prog"> method of access B<IndexDir> |
684 |
specifies the path to the program(s) to execute. The external program |
685 |
must correctly format the documents being passed back to Swish-e. |
686 |
Examples of external programs are provided in the F<prog-bin> directory. |
687 |
|
688 |
IndexDir ./myprogram.pl |
689 |
|
690 |
See L<prog|SWISH-RUN/"item_prog"> for details. |
691 |
|
692 |
|
693 |
Note: Not all directives work with all methods. |
694 |
|
695 |
=item NoContents *list of file suffixes* |
696 |
|
697 |
Files with these suffixes will B<not> have their contents indexed. |
698 |
|
699 |
If the file's type is HTML (as set by C<IndexContents> or |
700 |
C<DefaultContents>) then the file will be parsed for a HTML title and |
701 |
that title will be indexed. Note that you must set the file's type: |
702 |
C<.html> and C<.htm> are NOT type HTML by default. |
703 |
|
704 |
If a title is found, it will still be checked for C<FileRules title>, |
705 |
and the file will be skipped if a match is found. See C<FileRules>. |
706 |
|
707 |
If the file's type is not HTML, or it is HTML and no title is found, |
708 |
then the file's path will be indexed. For example, you might wish to |
709 |
search for image files by file name. |
710 |
|
711 |
Example: |
712 |
|
713 |
NoContents .gif .xbm .au .mov .mpg .pdf .ps |
714 |
|
715 |
Note: Using this directive will not cause files with those suffixes |
716 |
to be indexed. That is, if you use C<IndexOnly> to limit the types of |
717 |
files that are indexed, then you must specify in C<IndexOnly> the same |
718 |
suffixes listed in C<NoContents>. |
719 |
|
720 |
A C<-S prog> program may set the C<No-Contents:> header (to anything) |
721 |
to enable this feature for a specific document (althought it would be |
722 |
smarter for the C<-S prog> program to simply only send the pathname or |
723 |
title to be indexed. |
724 |
|
725 |
=item ReplaceRules [replace|remove|prepend|append|regex] |
726 |
|
727 |
ReplaceRules allows you to make changes to file pathnames before |
728 |
they're indexed. These changed file names or URLs will be returned in |
729 |
search results. |
730 |
|
731 |
For example, you may index your files locally (with the File system |
732 |
indexing method), yet return a URL in search results. This directive can |
733 |
be used to map the file names to their respective URLs on your web server. |
734 |
|
735 |
There are five operations you can specify: B<replace>, B<append>, |
736 |
B<remove>, B<prepend>, and B<regex> They will parse the pathname in the |
737 |
order you've typed these commands. |
738 |
|
739 |
This directive uses C library regex.h regular expressions. |
740 |
|
741 |
replace "the string you want replaced" "what to change it to" |
742 |
remove "a string to remove" |
743 |
prepend "a string to add before the result" |
744 |
append "a string to add after the result" |
745 |
regex "/search string/replace string/options" |
746 |
|
747 |
Remember, quotes are needed if an expression contains white space, |
748 |
and backslashes have special meaning. |
749 |
|
750 |
Regex is an Extended Regular Expression. The first character found is |
751 |
the delimiter (but it's not smart enough to use matched chars such as [], |
752 |
(), and {}). |
753 |
|
754 |
The B<replace> string may use substitution variables: |
755 |
|
756 |
$0 the entire matched (sub)string |
757 |
$1-$9 returns patterns captured in "(" ")" pairs |
758 |
$` the string before the matched pattern |
759 |
$' the string after the matched pattern |
760 |
|
761 |
The B<options> change the behavior of expression: |
762 |
|
763 |
i ignore the case when matching |
764 |
g repeat the substitution for the entire pattern |
765 |
|
766 |
Examples: |
767 |
|
768 |
ReplaceRules replace testdir/ anotherdir/ |
769 |
ReplaceRules replace [a-z_0-9]*_m.*\.html index.html |
770 |
|
771 |
ReplaceRules remove testdir/ |
772 |
|
773 |
ReplaceRules prepend http://localhost/ |
774 |
ReplaceRules append .html |
775 |
|
776 |
ReplaceRules regex !^/web/(.+)/!http://$1.domain.com/! |
777 |
replaces a file path: |
778 |
/web/search/foo/index.html |
779 |
with |
780 |
http://search.domain.com/foo/index.html |
781 |
|
782 |
ReplaceRules regex #^#http://localhost/www# |
783 |
ReplaceRules prepend http://localhost/www (same thing) |
784 |
|
785 |
# Remove all extensions from C source files |
786 |
ReplaceRules remove .c # ERROR! That "." is *any char* |
787 |
ReplaceRules remove \.c # much better... |
788 |
|
789 |
ReplaceRules remove "\\.c" # if in quotes you need double-backslash! |
790 |
ReplaceRules remove "\.c" # ERROR! "\." -> "." and is *any char* |
791 |
|
792 |
|
793 |
=item IndexContents [TXT|HTML|XML|WML|TXT2|HTML2|XML2] *file extensions* |
794 |
|
795 |
The C<IndexContents> directive assigns one of Swish-e's document parsers |
796 |
to a document, based on the its extension. Swish-e currently knows how |
797 |
to parse TXT, HTML, and XML documents. |
798 |
|
799 |
The XML2, HTML2, and TXT2 parsers are currently only available when |
800 |
Swish-e is configured to use libxml2. |
801 |
|
802 |
Documents that are not assigned a parser with C<IndexContents> will, by |
803 |
default, use the HTML parser. The C<DefaultContents> directive may be |
804 |
used to assign a parser to documents that do not match a file extension |
805 |
defined with the C<IndexContents> directive. |
806 |
|
807 |
Example: |
808 |
|
809 |
IndexContents HTML .htm .html .shtml |
810 |
IndexContents TXT .txt .log .text |
811 |
IndexContents XML .xml |
812 |
|
813 |
HTML is the default type for all files, unless otherwise specified |
814 |
(and this default can be changed by the B<DefaultContents> directive. |
815 |
Swish-e parses titles from HTML files, if available, and keeps track |
816 |
of the context of the text for context searching (see C<-t> in |
817 |
L<SWISH-RUN|SWISH-RUN>). HTML and XML files use different tag formats |
818 |
for B<MetaNames> and B<PropertyNames>. |
819 |
|
820 |
If using filters to convert documents you should include those extensions, |
821 |
too. For example, if using a filter to conver .pdf to .html, you need |
822 |
to tell Swish-e that .pdf should be indexed by the internal HTML parser: |
823 |
|
824 |
FileFilter .pdf pdf2html |
825 |
IndexContent HTML .pdf |
826 |
|
827 |
See also L<Document Filter Directives|/"Document Filter Directives">. |
828 |
|
829 |
B<Note:> Some of this may be changed in the future to use content-types |
830 |
instead of file extensions. See L<SWISH-3.0|SWISH-3.0> |
831 |
|
832 |
=item DefaultContents [TXT|HTML|XML|WML|TXT2|HTML2|XML2] |
833 |
|
834 |
This sets the default parser for documents that are not specified in |
835 |
B<IndexContents>. If not specified the default is HTML. |
836 |
|
837 |
The XML2, HTML2, and TXT2 parsers are currently only available when |
838 |
Swish-e is configured to use libxml2. |
839 |
|
840 |
|
841 |
Example: |
842 |
|
843 |
DefaultContents HTML |
844 |
|
845 |
The C<DefaultContents> directive I<should> be used when spidering, |
846 |
as HTML files may be returned without a file extension (such as when |
847 |
requesting a directory and the default index.html is returned). |
848 |
|
849 |
|
850 |
=item FileInfoCompression [yes|NO] |
851 |
|
852 |
** This directive is currently not supported ** |
853 |
|
854 |
Setting B<FileInfoCompression> to C<yes> will compress the index file to |
855 |
save disk space. This may result in longer indexing times. The default |
856 |
is C<no>. |
857 |
|
858 |
Also see the C<-e> switch in L<SWISH-RUN|SWISH-RUN> for saving RAM |
859 |
during indexing. |
860 |
|
861 |
|
862 |
=back |
863 |
|
864 |
=head2 Document Contents Directives |
865 |
|
866 |
These directives control what information is extracted from your source |
867 |
documents, and how that information is made available during searching. |
868 |
|
869 |
=over 4 |
870 |
|
871 |
=item ConvertHTMLEntities [YES|no] |
872 |
|
873 |
ASCII I<entities> can be converted automatically while indexing documents |
874 |
of type HTML. For performance reasons you may wish to set this to C<no> |
875 |
if your documents do not contain HTML entities. The default is C<yes>. |
876 |
|
877 |
If C<ConvertHTMLEntities> is set C<no> the entities will be indexed |
878 |
without conversion. |
879 |
|
880 |
B<NOTE:> Entities within XML files and files parsed with libxml2 are |
881 |
converted regardless of this setting. |
882 |
|
883 |
=item MetaNames *list of names* |
884 |
|
885 |
META names are a way to define "fields" in your XML and HTML documents. |
886 |
You can use the META names in your queries to limit the search to just |
887 |
the words contained in that META name of your document. For example, |
888 |
you might have a META tagged field in your documents called C<subjects> |
889 |
and then you can search your documents for the word "foo" but only return |
890 |
documents where "foo" is within the C<subjects> META tag. |
891 |
|
892 |
swish-e -w subjects=foo |
893 |
|
894 |
(See also the C<-t> switch in L<SWISH-RUN|SWISH-RUN> for information |
895 |
about I<context> searching in HTML documents.) |
896 |
|
897 |
The B<MetaNames> directive is a space separated list. For example: |
898 |
|
899 |
MetaNames meta1 meta2 keywords subjects |
900 |
|
901 |
You may also use L<UndefinedMetaTags|/"item_UndefinedMetaTags"> to specify |
902 |
automatic extraction of meta names from your HTML and XML documents, |
903 |
and also to ignore indexing content of meta tags. |
904 |
|
905 |
META tags can have two formats in your B<HTML> source documents: |
906 |
|
907 |
<META NAME="meta1" CONTENT="some content"> |
908 |
|
909 |
and (if using the HTML2/libxml2 parser) |
910 |
|
911 |
<meta1> |
912 |
some content |
913 |
</meta1> |
914 |
|
915 |
But this second version is invalid HTML, and will generate a warning if |
916 |
ParserWarningLevel is set (libxml2 only). |
917 |
|
918 |
And in B<XML> documents, use the format: |
919 |
|
920 |
<meta1> |
921 |
Some Content |
922 |
</meta1> |
923 |
|
924 |
Then you can limit your search to just META B<meta1> like this: |
925 |
|
926 |
swish-e -w 'meta1=(apples or oranges)' |
927 |
|
928 |
You may nest the XML and the start/end tag versions: |
929 |
|
930 |
<keywords> |
931 |
<tag1> |
932 |
some content |
933 |
</tag1> |
934 |
<tag2> |
935 |
some other content |
936 |
</tag2> |
937 |
<keywords> |
938 |
|
939 |
Then you can search in both tag2 and tag2 with: |
940 |
|
941 |
swish-e -w 'keywords=(query words)' |
942 |
|
943 |
Swish-e indexes all text as some metaname. The default is |
944 |
C<swishdefault>, so these two queries are the same: |
945 |
|
946 |
swish-e -w foo |
947 |
swish-e -w swishdefault=foo |
948 |
|
949 |
When indexing HTML Swish-e indexes the HTML title as default text, so |
950 |
when searching Swish-e will find matches in both the HTML body and the |
951 |
HTML title. Swish also, by default, indexes content of meta tags. So: |
952 |
|
953 |
swish-e -w foo |
954 |
|
955 |
will find "foo" in the body, the title, or any meta tags. |
956 |
|
957 |
Currently, there's no way to prevent Swish-e from indexing |
958 |
the title contents along with the body contents, but see |
959 |
L<UndefinedMetaTags|/"item_UndefinedMetaTags"> for how to control the |
960 |
indexing of meta tags. |
961 |
|
962 |
If you would like to search just the title text, you may use: |
963 |
|
964 |
MetaNames swishtitle |
965 |
|
966 |
This will index the title text separately under the built-in swish |
967 |
internal meta name "swishtitle". You may then search like |
968 |
|
969 |
swish-e -w foo -- search for "foo" in title, body (and undefined meta tags) |
970 |
swish-e -w swishtitle=foo -- search for "foo" in title only |
971 |
|
972 |
In addition to swishtitle, you can limit searches to documents' path with: |
973 |
|
974 |
MetaNames swishdocpath |
975 |
|
976 |
Then to search for "foo" but also limit searches to documents that include |
977 |
"manual" or "tutorial" in thier path: |
978 |
|
979 |
swish-e -w foo swishdocpath=(manual or tutorial) |
980 |
|
981 |
See also L<ExtractPath|/"item_ExtractPath">. |
982 |
|
983 |
|
984 |
=item MetaNameAlias *meta name* *list of aliases* |
985 |
|
986 |
MetaNameAlias assigns aliases for a meta name. For example, if your |
987 |
documents contain meta tags "description", "summary", and "overview" |
988 |
that all give a summary of your documents you could do this: |
989 |
|
990 |
MetaNames summary |
991 |
MetaNameAlias summary description overview |
992 |
|
993 |
Then all three tags will get indexed as meta tag "summary". You can |
994 |
then search all the fields as: |
995 |
|
996 |
-w summary=foo |
997 |
|
998 |
The Alias work at search time, too. So these will also limit the searh |
999 |
to the "summary" meta name. |
1000 |
|
1001 |
-w description=foo |
1002 |
-w overview=foo |
1003 |
|
1004 |
=item MetaNamesRank integer *list of meta names* |
1005 |
|
1006 |
* Not implemented yet * |
1007 |
|
1008 |
You can assign a bias to metanames that will affect how ranking is |
1009 |
calculated. The range of values is from -10 to +10, with zero being |
1010 |
no bias. |
1011 |
|
1012 |
MetaNamesRank 4 subject |
1013 |
MetaNamesRank 3 swishdefault |
1014 |
MetaNamesRank 2 author publisher |
1015 |
MetaNamesRank -5 wrongwords |
1016 |
|
1017 |
This feature is not implemented yet |
1018 |
|
1019 |
=item HTMLLinksMetaName *metaname* |
1020 |
|
1021 |
Allows indexing of HTML links. Normally, HTML links (href tags) are |
1022 |
not indexed by Swish-e. This directive defines a metaname, and links |
1023 |
will be indexed under this meta name. |
1024 |
|
1025 |
Example: |
1026 |
|
1027 |
HTMLLinksMetaName links |
1028 |
|
1029 |
Now, to limit searches to files with a link to "home.html" do this: |
1030 |
|
1031 |
-w links='"home.html"' |
1032 |
|
1033 |
The double quotes force a phrase search. |
1034 |
|
1035 |
To make Swish-e index links as normal text, you may use: |
1036 |
|
1037 |
HTMLLinksMetaName swishdefault |
1038 |
|
1039 |
This feature is only available with the libxml2 HTML parser. |
1040 |
|
1041 |
=item ImageLinksMetaName *metaname* |
1042 |
|
1043 |
Allows indexing of image links under a metaname. Normally, image URLs |
1044 |
are not indexed. |
1045 |
|
1046 |
Example: |
1047 |
|
1048 |
ImagesLinksMetaName images |
1049 |
|
1050 |
Now, if you would like to find pages that include a nice image of a beach: |
1051 |
|
1052 |
-w images='beach' |
1053 |
|
1054 |
To make Swish-e index links as normal text, you may use: |
1055 |
|
1056 |
ImageLinksMetaName swishdefault |
1057 |
|
1058 |
This feature is only available with the libxml2 HTML parser. |
1059 |
|
1060 |
|
1061 |
=item IndexAltTagMetaName *tagname*|as-text |
1062 |
|
1063 |
Allows indexing of images <IMG> ALT tag text. Specify either a tag name which will be |
1064 |
used as a metaname, or the special text "as-text" which says to index the ALT text as |
1065 |
if it were plain text at the current location. |
1066 |
|
1067 |
For example, by specifying a tag name: |
1068 |
|
1069 |
IndexAltTagMetaName bar |
1070 |
|
1071 |
would make this markup: |
1072 |
|
1073 |
<foo> |
1074 |
<img src="/someimage.png" alt="Alt text here"> |
1075 |
</foo> |
1076 |
|
1077 |
appear like |
1078 |
|
1079 |
<foo> |
1080 |
<bar>Alt text here</bar> |
1081 |
</foo> |
1082 |
|
1083 |
Then the normal rules (C<MetaNames> and C<PropertyNames>) apply to how that text is indexed. |
1084 |
|
1085 |
If you use the special tag "as-text" then |
1086 |
|
1087 |
<foo> |
1088 |
<img src="/someimage.png" alt="Alt text here"> |
1089 |
</foo> |
1090 |
|
1091 |
simply becomes |
1092 |
|
1093 |
<foo> |
1094 |
Alt text here |
1095 |
</foo> |
1096 |
|
1097 |
This feature is only available when using the libxml2 parser (HTML2 and XML2). |
1098 |
|
1099 |
|
1100 |
=item AbsoluteLinks [yes|NO] |
1101 |
|
1102 |
If this is set true then Swish-e will attempt to convert relative URIs |
1103 |
extracted from HTML documents for use with C<HTMLLinksMetaName> and |
1104 |
C<ImageLinksMetaName> into absolute URIs. Swish-e will use any <BASE> |
1105 |
tag found in the document, otherwise it will use the file's pathname. |
1106 |
The pathname used will be the pathname *after* C<ReplaceRules> has been |
1107 |
applied to the document's pathname. |
1108 |
|
1109 |
For example, say you wish to index image links under the metaname |
1110 |
"images". |
1111 |
|
1112 |
ImageLinksMetaName images |
1113 |
|
1114 |
If an image is located in http://localhost/vacations/france/index.html |
1115 |
and C<AbsoluteLinks> is set to no, then a image within that document: |
1116 |
|
1117 |
<img src="beach.jpeg"> |
1118 |
|
1119 |
will only index "beach.jpeg". |
1120 |
|
1121 |
But, if you want more deatil when searching, you can enable |
1122 |
C<AbsoluteLinks> and Swish-e will index |
1123 |
"http://localhost/vacations/france/beach.jpeg". You can then look for |
1124 |
images of beaches, but only in France: |
1125 |
|
1126 |
-w images=(beach and france) |
1127 |
|
1128 |
This also means you can search for any images within France: |
1129 |
|
1130 |
-w images=(france) |
1131 |
|
1132 |
This feature is only available with the libxml2 HTML parser. |
1133 |
|
1134 |
=item UndefinedMetaTags [error|ignore|INDEX|auto] |
1135 |
|
1136 |
This directive defines the behavior of Swish-e during indexing when a |
1137 |
meta name is found but is B<not> listed in B<MetaNames>. There are |
1138 |
four choices: |
1139 |
|
1140 |
|
1141 |
=over 2 |
1142 |
|
1143 |
=item error |
1144 |
|
1145 |
If a meta name is found that is not listed in B<MetaNames> |
1146 |
then indexing will be halted and an error reported. |
1147 |
|
1148 |
=item ignore |
1149 |
|
1150 |
The contents of the meta tag are ignored and B<not> indexed |
1151 |
unless a metaname has been defined with the C<MetaNames> directive. |
1152 |
|
1153 |
=item index |
1154 |
|
1155 |
The contents of the meta tag are indexed, but placed in the |
1156 |
main index unless there's an enclosing metatag already in force. This |
1157 |
is the default. |
1158 |
|
1159 |
=item auto |
1160 |
|
1161 |
This method create meta tags automatically for HTML meta names |
1162 |
and XML elements. Using this is the same as specifying all the meta |
1163 |
names explicitly in a B<MetaNames> dirictive. |
1164 |
|
1165 |
=back |
1166 |
|
1167 |
=item UndefinedXMLAttributes [DISABLE|error|index|auto] |
1168 |
|
1169 |
This is similar to C<UndefinedMetaTags>, but only applies to XML documents (parsed with libxml2). |
1170 |
This allows indexing of attribute content, and provides a way to index the content under a |
1171 |
metaname. For example, C<UndefinedXMLAttributes> can make |
1172 |
|
1173 |
<person age="23"> |
1174 |
John Doe |
1175 |
</person> |
1176 |
|
1177 |
look like the following to swish: |
1178 |
|
1179 |
<person> |
1180 |
<person.age> |
1181 |
23 |
1182 |
</person.age> |
1183 |
John Doe |
1184 |
</person> |
1185 |
|
1186 |
What happens to the text "23" will depend on the setting of C<UndefinedXMLAttributes>: |
1187 |
|
1188 |
=over 2 |
1189 |
|
1190 |
=item disable |
1191 |
|
1192 |
XML attributes are not parsed and not indexed. This is the default. |
1193 |
|
1194 |
=item error |
1195 |
|
1196 |
If the concatenated meta name (e.g. person.age) is not listed in |
1197 |
B<MetaNames> then indexing will be halted and an error reported. |
1198 |
|
1199 |
=item ignore |
1200 |
|
1201 |
The contents of the meta tag are ignored and B<not> indexed unless a |
1202 |
metaname has been defined with the C<MetaNames> directive. |
1203 |
|
1204 |
=item index |
1205 |
|
1206 |
The contents of the meta tag are indexed, but placed in the main index |
1207 |
unless there's an enclosing metatag already in force. |
1208 |
|
1209 |
=item auto |
1210 |
|
1211 |
This method will create meta tags from the combined element and attributes |
1212 |
(and XML Class name) This options should be used with caution as it can |
1213 |
generate a lot of metaname entries. |
1214 |
|
1215 |
See also the example below C<XMLClassAttribues>. |
1216 |
|
1217 |
|
1218 |
=back |
1219 |
|
1220 |
=item XMLClassAttributes *list of XML attribute names* |
1221 |
|
1222 |
Combines an XML class name with the element name to make up a metaname. |
1223 |
For example: |
1224 |
|
1225 |
XMLClassAttributes class |
1226 |
|
1227 |
<person class="first"> |
1228 |
John |
1229 |
</person> |
1230 |
<person class="last"> |
1231 |
Doe |
1232 |
</person> |
1233 |
|
1234 |
Will appear to Swish-e as: |
1235 |
|
1236 |
<person> |
1237 |
<person.first> |
1238 |
John |
1239 |
</person.first> |
1240 |
</person> |
1241 |
<person> |
1242 |
<person.last> |
1243 |
Doe |
1244 |
</person.last> |
1245 |
</person> |
1246 |
|
1247 |
How the data is indexed depends on C<MetaNames> and C<UndefinedMetaTags>. |
1248 |
|
1249 |
Here's an example using the following configuation which combines the |
1250 |
two directives C<XMLClassAttributes> and C<UndefinedXMLAttributes>. |
1251 |
|
1252 |
XMLClassAttributes class |
1253 |
UndefinedMetaTags auto |
1254 |
UndefinedXMLAttributes auto |
1255 |
IndexContents XML2 .xml |
1256 |
|
1257 |
The source XML file looks like: |
1258 |
|
1259 |
<xml> |
1260 |
<person class="student" phone="555-1212" age="102"> |
1261 |
John |
1262 |
</person> |
1263 |
<person greeting="howdy">Bill</person> |
1264 |
</xml> |
1265 |
|
1266 |
Swish-e parses as: |
1267 |
|
1268 |
./swish-e -c 2 -i 1.xml -T parsed_tags parsed_text -v 0 |
1269 |
Indexing Data Source: "File-System" |
1270 |
|
1271 |
<xml> (MetaName) |
1272 |
|
1273 |
<person> (MetaName) |
1274 |
<person.student> (MetaName) |
1275 |
<person.student.phone> (MetaName) |
1276 |
555-1212 |
1277 |
</person.student.phone> |
1278 |
<person.student.age> (MetaName) |
1279 |
102 |
1280 |
</person.student.age> |
1281 |
John |
1282 |
</person> |
1283 |
|
1284 |
<person> (MetaName) |
1285 |
<person.greeting> (MetaName) |
1286 |
howdy |
1287 |
</person.greeting> |
1288 |
Bill |
1289 |
</person> |
1290 |
|
1291 |
</xml> |
1292 |
Indexing done! |
1293 |
|
1294 |
One thing to note is that the first <person> block finds a class name |
1295 |
"student" so all metanames that are created from attributes use the |
1296 |
combined name "person.student". The second <person> block doesn't contain |
1297 |
a "class" so, the attribute name is combinded directly with the element |
1298 |
name (e.g. "person.greeting"). |
1299 |
|
1300 |
=item ExtractPath *metaname* [replace|remove|prepend|append|regex] |
1301 |
|
1302 |
This directive can be used to index extracted parts of a document's path. |
1303 |
A common use would to to limit searches to specific areas of your |
1304 |
file tree. |
1305 |
|
1306 |
The extracted string will be indexed under the specified meta name. |
1307 |
|
1308 |
See C<ReplaceRules> for a description of the various pattern replacement |
1309 |
methods, but you will use the I<regex> method. |
1310 |
|
1311 |
For example, say your file system (or web tree) was organized into departments: |
1312 |
|
1313 |
/web/sales/foo... |
1314 |
/web/parts/foo... |
1315 |
/web/accounting/foo... |
1316 |
|
1317 |
And you wanted a way to limit searches to just documents under "sales". |
1318 |
|
1319 |
ExtractPath department regex !^/web/([^/]+)/.*$!$1! |
1320 |
|
1321 |
Which says, extract out the department name (as substring $1) and index |
1322 |
it as meta name C<department>. Then to limit a search to the sales |
1323 |
department: |
1324 |
|
1325 |
swish-e -w foo AND department=sales |
1326 |
|
1327 |
Note that the C<regex> method uses a substitution pattern, so to index |
1328 |
only a sub-string match the I<entire> document path in the regular |
1329 |
expression, as shown above. |
1330 |
|
1331 |
See the C<ExtractPathDefault> option for a way to set a value if not |
1332 |
patterns match. |
1333 |
|
1334 |
Although unlikely, you may use more than one C<ExtractPath> direcive. |
1335 |
More than one directive of the I<same> meta name will operate successively |
1336 |
(in order listed in the configuration file) on the path. This allows |
1337 |
you to use regular expressions on the results of the previous pattern |
1338 |
substitution (as if piping the output from one expression to the patter |
1339 |
of the next). |
1340 |
|
1341 |
ExtractPath foo regex !^(...).+$!$1! |
1342 |
ExtractPath foo regex !^.+(.)$!$1! |
1343 |
|
1344 |
So, the third letter is indexed as meta name "foo" if both patterns match. |
1345 |
|
1346 |
ExtractPath foo regex !^X(...).+$!$1! |
1347 |
ExtractPath foo regex !^.+(.)$!$1! |
1348 |
|
1349 |
Now (not the "X"), if the first pattern doesn't match, the last character |
1350 |
of the path name is indexed. You must be clear on this behavior if you |
1351 |
are using more than one C<ExtractPath> directive with the same metaname. |
1352 |
|
1353 |
The document path operated on is the real path swish used to access |
1354 |
the document. That is, the C<ReplaceRules> directive has no effect on |
1355 |
the path used with C<ExtractPath>. |
1356 |
|
1357 |
The full path is used for each meta name if more than one C<ExtractPath> |
1358 |
directive is used. That is, changes to the path used in C<ExtractPath |
1359 |
foo> do not affect the path used by C<ExtractPath bar>. |
1360 |
|
1361 |
=item ExtractPathDefault *metaname* default_value |
1362 |
|
1363 |
This can be used with C<ExtractPath> to set a default string to index |
1364 |
under the given metaname if none of the C<ExtractPath> patterns match. |
1365 |
|
1366 |
For example, say your want to index each document with a metaname |
1367 |
"department" based on the following path examples: |
1368 |
|
1369 |
/web/sales/foo... |
1370 |
/web/parts/foo... |
1371 |
/web/accounting/foo... |
1372 |
|
1373 |
But you are also indexing documents that do not follow that pattern and you want to search those |
1374 |
seperately, too. |
1375 |
|
1376 |
ExtractPath department regex !^/web/([^/]+)/.*$!$1! |
1377 |
ExtractPathDefault department other |
1378 |
|
1379 |
Now, you may search like this: |
1380 |
|
1381 |
-w foo department=(sales) - limit searches to the sales documents |
1382 |
-w foo department=(parts) - limit searches to the parts documents |
1383 |
-w foo department=(accounting) - limit searches to the accounting documents |
1384 |
-w foo department=(other) - everything but sales, parts, and accounting. |
1385 |
|
1386 |
This basically is a shortcut for: |
1387 |
|
1388 |
-w foo not department=(sales or parts or accounting) |
1389 |
|
1390 |
but you don't need to keep track of what was extracted. |
1391 |
|
1392 |
=item PropertyNames *list of meta names* |
1393 |
|
1394 |
=item PropertyNamesCompareCase *list of meta names* |
1395 |
|
1396 |
=item PropertyNamesIgnoreCase *list of meta names* |
1397 |
|
1398 |
Swish-e allows you to specify certain META tags that can be used as |
1399 |
B<document properties>. The contents of any META tag that has been |
1400 |
identified as a document property can be returned as part of the search |
1401 |
results along with the rank, file name, title, and document size (see |
1402 |
the C<-p> and C<-x> switches in L<SWISH-RUN|SWISH-RUN>). |
1403 |
|
1404 |
Properties are useful for returning additional data from documents in |
1405 |
search results -- this saves the effort of reading and parsing the source |
1406 |
files while reading Swish-e search results, and is especially useful |
1407 |
when the source documents are no longer available or slow to access |
1408 |
(e.g. over http). |
1409 |
|
1410 |
Another feature of properties is that Swish-e can use the PropertyNames |
1411 |
for sorting the search results (see the C<-s> switch). |
1412 |
|
1413 |
PropertyNames author subjects |
1414 |
|
1415 |
Two variations are available. C<PropertyNamesCompareCase> and |
1416 |
C<PropertyNamesIgnoreCase>. These tell Swish-e to either ignore or |
1417 |
compare case when sorting results. The default for C<PropertyNames> |
1418 |
is to ignore the case. |
1419 |
|
1420 |
PropertyNamesIgnoreCase subject |
1421 |
PropertyNamesCompareCase keyword |
1422 |
|
1423 |
The defaults for "internal" properties are: |
1424 |
|
1425 |
swishtitle -- ignore the case |
1426 |
swishdocpath -- compare case |
1427 |
swishdescription -- compare case |
1428 |
|
1429 |
These can be overridden with C<PropertyNamesCompareCase> and |
1430 |
C<PropertyNamesIgnoreCase>. |
1431 |
|
1432 |
PropertyNamesCompareCase swishtitle |
1433 |
|
1434 |
Use of PropertyNames will increase the size of your index files, |
1435 |
sometimes significantly. Properties will be compressed if Swish-e is |
1436 |
compiled with zlib as described in the L<INSTALL|INSTALL> manual page. |
1437 |
|
1438 |
If Swish-e finds more than one property of the same name in a document |
1439 |
the property's contents will be concatinated for strings, and a warning |
1440 |
issues for numeric (or date) properties. |
1441 |
|
1442 |
|
1443 |
=item PropertyNamesNumeric |
1444 |
|
1445 |
This directive is similar to C<PropertyNames>, but it flags the property |
1446 |
as being a string of digits (integer value) that will be stored as binary data instead |
1447 |
of a string. This allows sorting with C<-s> and limiting with C<-L> |
1448 |
to sort and limit the property correctly. |
1449 |
|
1450 |
Swish-e uses C<strtoul(3)> to convert the string into an unsigned long |
1451 |
integer. Therefore, only positive integers can be stored. |
1452 |
|
1453 |
Future versions of Swish-e may be able to store different property types |
1454 |
(such as negative integers and real numbers). This directive may change |
1455 |
in future releases of Swish. |
1456 |
|
1457 |
=item PropertyNamesDate |
1458 |
|
1459 |
This directive is exactly like C<PropertyNamesNumeric>, but it also |
1460 |
flags the number as a machine timestamp (seconds since Epoch), and |
1461 |
will print a formatted date when returning this property. See C<-x> |
1462 |
in L<SWISH-RUN|SWISH-RUN>. |
1463 |
|
1464 |
Swish-e will not parse dates when indexing; you must use a timestamp. |
1465 |
|
1466 |
=item PropertyNameAlias *property name* *list of aliases* |
1467 |
|
1468 |
This allows aliases for a property name. For example, if you are indexing |
1469 |
HTML files, plus XML files that are written in English, German, and |
1470 |
Spanish and thus use the tags "title", "titel", and "título" you can use: |
1471 |
|
1472 |
PropertyNameAlias swishtitle title titel título titulo |
1473 |
|
1474 |
Note that "swishtitle" is the built-in property used to store the title of |
1475 |
a document, and therefore you do not need to specify it as a PropertyName |
1476 |
before use. |
1477 |
|
1478 |
=item PropertyNamesMaxLength integer *list of meta names* |
1479 |
|
1480 |
This option will set the max length of the text stored in a property. |
1481 |
You must specify a number between 0 and the max integer size on your |
1482 |
platform, and a list of properties. The properties specified must not |
1483 |
be aliases. |
1484 |
|
1485 |
If any of the property names do not exist they will be created (e.g. you |
1486 |
do not need to define the property with PropertyNames first). |
1487 |
|
1488 |
In general, this feature will only be useful when parsing HTML or XML |
1489 |
with the libxml2 parser. |
1490 |
|
1491 |
For example: |
1492 |
|
1493 |
PropertyNamesMaxLength 1000 swishdescription |
1494 |
PropertyNameAlias swishdescription body |
1495 |
|
1496 |
Is somewhat like |
1497 |
|
1498 |
StoreDescription HTML <body> 1000 |
1499 |
StoreDescription XML <body> 1000 |
1500 |
StoreDescription HTML2 <body> 1000 |
1501 |
StoreDescription XML2 <body> 1000 |
1502 |
|
1503 |
but StoreDescription allows setting the tag for each parser type. |
1504 |
|
1505 |
PropertyNamesMaxLength 1000 headings |
1506 |
PropertyNameAlias headings h1 h2 h3 h4 |
1507 |
|
1508 |
collects all the heading text into a single property called "headings", not |
1509 |
to exceed 1000 characters. |
1510 |
|
1511 |
|
1512 |
=item PreSortedIndex *list of property names* |
1513 |
|
1514 |
By default Swish-e generates presorted tables while indexing for each |
1515 |
property name. This allows faster sorting when generating results. |
1516 |
On large document collections this presorting may add to the indexing |
1517 |
time, and also adds to the total size of the index. This directive can |
1518 |
be used to customize exactly which properties will be presorted. |
1519 |
|
1520 |
If C<PreSortedIndex> it is I<not> present in the config file (default |
1521 |
action), all the properties will be presorted at indexing time. If it |
1522 |
is present without any parameter, no properties will be presorted. |
1523 |
Otherwise, only the property names specified will be presorted. |
1524 |
|
1525 |
For example, if you only wish to sort results by a property called |
1526 |
C<title>: |
1527 |
|
1528 |
PropertyNames title age time |
1529 |
PreSortedIndex title |
1530 |
|
1531 |
|
1532 |
=item StoreDescription [XML <tag> size|HTML <meta> size|TXT size] |
1533 |
|
1534 |
B<StoreDescription> allows you to store a document description in the |
1535 |
index file, and this description can be returned in your search results |
1536 |
when the C<-x> switch is used to include the I<swishdescription> for |
1537 |
extended results. |
1538 |
|
1539 |
For text documents you specify the type C<TXT> and the number of I<characters> to capture. |
1540 |
|
1541 |
StoreDescription TXT 20 |
1542 |
|
1543 |
The above stores only the first twenty characters from the text file in the Swish-e index |
1544 |
file. |
1545 |
|
1546 |
For HTML, and XML file types, specify the the tag to use for the |
1547 |
description, and optionally the number of characters to capture. If not |
1548 |
specified will capture the entire contents of the tag. |
1549 |
|
1550 |
StoreDescription HTML <body> 20000 |
1551 |
StoreDescription XML <desc> 40 |
1552 |
|
1553 |
Note that documents must be assigned a document type with C<IndexContents> |
1554 |
or C<DefaultContents> to use this feature. |
1555 |
|
1556 |
Swish-e will compress the descriptions (or any other large property) |
1557 |
if compiled to use zlib (see L<INSTALL|INSTALL>). This is recommended when using |
1558 |
StoreDescription and a large number of documents. Compression of 30% to 50% is |
1559 |
not uncomon with HTML files. |
1560 |
|
1561 |
=item PropCompressionLevel [0-9] |
1562 |
|
1563 |
This directive sets the compression level used when storing properties |
1564 |
to disk. A setting of zero is no compression, and a setting of nine is |
1565 |
the most compression. |
1566 |
|
1567 |
The default depends on the default setting compiled with zlib, but is |
1568 |
typicaly six. |
1569 |
|
1570 |
This option is useful when using C<StoreDescription> to store a large |
1571 |
amount text in properties (or if using C<PropertyNames> with large |
1572 |
property sizes). |
1573 |
|
1574 |
Properties must be over a value defined in F<config.h> (100 is the |
1575 |
default) before compression will be attempted. Swish-e will never store |
1576 |
the results of the compression if the compressed data is larger than |
1577 |
the original data. |
1578 |
|
1579 |
This option is only available when Swish-e is compiled with zlib support. |
1580 |
|
1581 |
|
1582 |
=item TruncateDocSize *number of characters* |
1583 |
|
1584 |
TruncateDocSize limits the size of a document while indexing documents |
1585 |
and/or using filters. This config directive truncates the numbers of |
1586 |
read bytes of a document to the specified size. This means: if a document |
1587 |
is larger, read only the specified numbers of bytes of the document. |
1588 |
|
1589 |
Example: |
1590 |
|
1591 |
TruncateDocSize 10000000 |
1592 |
|
1593 |
The default is zero, which means read all data. |
1594 |
|
1595 |
|
1596 |
Warning: If you use TruncateDocSize, use it with care! TruncateDocSize |
1597 |
is a safty belt only, to limit e.g. filteroutput, when accessing |
1598 |
databases, or to limit "runnaway" filters. Truncating doc input may |
1599 |
destroy document structures for Swish-e (e.g. swish may miss closing |
1600 |
tags for XML or HTML documents). |
1601 |
|
1602 |
TruncateDocSize does not currently work with the C<prog> input source |
1603 |
method. |
1604 |
|
1605 |
=item FuzzyIndexingMode NONE|Stemming|Soundex|Metaphone|DoubleMetaphone |
1606 |
|
1607 |
Selects the type of index to create. Only one type of index may be created. |
1608 |
|
1609 |
It's a good idea to create both a normal index and a fuzzy index and |
1610 |
allow your search interface select which index to use. Many people find the |
1611 |
fuzzy searches to be too fuzzy. |
1612 |
|
1613 |
The available fuzzy indexing options are: |
1614 |
|
1615 |
=over 4 |
1616 |
|
1617 |
=item None |
1618 |
|
1619 |
Words are stored in the index without any conversion. This is the default. |
1620 |
|
1621 |
=item Stemming |
1622 |
|
1623 |
Words are converted using the Porter stemming algorithm. |
1624 |
|
1625 |
From: http://www.tartarus.org/~martin/PorterStemmer/ |
1626 |
|
1627 |
The Porter stemming algorithm (or ‘Porter stemmer’) is a |
1628 |
process for removing the commoner morphological and inflexional |
1629 |
endings from words in English. Its main use is as part of a |
1630 |
term normalisation process that is usually done when setting up |
1631 |
Information Retrieval systems. |
1632 |
|
1633 |
|
1634 |
This will help a search for "running" to also find "run" and "runs", for example. |
1635 |
|
1636 |
The stemming function does not convert words to their root, rather |
1637 |
programmatically removes endings on words in an attempt to make similar |
1638 |
words with different endings stem to the same string of characters. |
1639 |
It's not a perfect system, and searches on stemmed indexes often return |
1640 |
curious results. For example, two entirely different words may stem to |
1641 |
the same word. |
1642 |
|
1643 |
Stemming also can be confusing when used with a wildcard (truncation). |
1644 |
For example, you might expect to find the word "running" by searching for |
1645 |
"runn*". But this fails when using a stemmed index, as "running" stems to |
1646 |
"run", yet searching for "runn*" looks for words that start with "runn". |
1647 |
|
1648 |
=item Soundex |
1649 |
|
1650 |
Soundex was developed in the 1880s so records for people with similar |
1651 |
sounding names could be found more readily. Soundex is a coded surname |
1652 |
based on the way a surname sounds rather than spelling. Surnames that |
1653 |
sound similar, like Smith and Smyth, are filed together under the same |
1654 |
Soundex code. This is mostly useful for US English. |
1655 |
|
1656 |
Soundex should not be used to search for sound-alike words. Metaphone |
1657 |
would be more appropriate for generic sound matching of words. Soundex |
1658 |
should only be used where you need to search multiple documents for |
1659 |
proper names which sound similar. This is primarily used for indexing |
1660 |
genealogical records. This may be useful for indexing other collections |
1661 |
of data consisting mostly of names. Many common name variations are |
1662 |
matched by Soundex. The only notable exception is the first letter of |
1663 |
the name. The first letter is not matched for sound. |
1664 |
|
1665 |
=item Metaphone and DoubleMetaphone |
1666 |
|
1667 |
Words are transformed into a short series of letters representing the sound of the word (in English). |
1668 |
Metaphone algorithms are often used for looking up mis-spelled words in dictionary programs. |
1669 |
|
1670 |
From: http://aspell.sourceforge.net/metaphone/ |
1671 |
|
1672 |
Lawrence Philips' Metaphone Algorithm is an algorithm which returns |
1673 |
the rough approximation of how an English word sounds. |
1674 |
|
1675 |
The C<DoubleMetaphone> mode will sometimes generate two different metaphones for the same word. |
1676 |
This is supposed to be useful when a word may be pronounced more than one way. |
1677 |
|
1678 |
A metaphone index should give results somewhere in between Soundex and Stemming. |
1679 |
|
1680 |
=back |
1681 |
|
1682 |
=item UseStemming [yes|NO] |
1683 |
|
1684 |
Put yes to apply word stemming algorithm during indexing, else no. |
1685 |
|
1686 |
UseStemming no |
1687 |
UseStemming yes |
1688 |
|
1689 |
When UseStemming is set to C<yes> every word is stemmed before placing |
1690 |
it in to the index. |
1691 |
|
1692 |
This option is depreciated. It has been superceded by C<FuzzyIndexingMode>. |
1693 |
|
1694 |
=item UseSoundex [yes|NO] |
1695 |
|
1696 |
When UseSoundex is set to C<yes> every word is converted to a Soundex |
1697 |
code before placing it in to the index. |
1698 |
|
1699 |
This option is depreciated. It has been superceded by C<FuzzyIndexingMode>. |
1700 |
|
1701 |
=item IgnoreTotalWordCountWhenRanking [YES|no] |
1702 |
|
1703 |
Put yes to ignore the total number of words in the file when calculating |
1704 |
ranking. Often better with merges and small files. Default is yes. |
1705 |
|
1706 |
IgnoreTotalWordCountWhenRanking no |
1707 |
|
1708 |
The default was changed from no to yes in version 2.2. |
1709 |
|
1710 |
=item MinWordLimit *integer* |
1711 |
|
1712 |
Set the minimum length of an word. Shorter words will not be indexed. |
1713 |
The default is 1 (as defined in F<src/config.h>). |
1714 |
|
1715 |
MinWordLimit 5 |
1716 |
|
1717 |
=item MaxWordLimit *integer* |
1718 |
|
1719 |
Set the maximum length of an indexable word. Every longer word will not |
1720 |
be indexed. The Default is 40 (as defined in F<src/config.h>). |
1721 |
|
1722 |
=item WordCharacters *string of characters* |
1723 |
|
1724 |
=item IgnoreFirstChar *string of characters* |
1725 |
|
1726 |
=item IgnoreLastChar *string of characters* |
1727 |
|
1728 |
=item BeginCharacters *string of characters* |
1729 |
|
1730 |
=item EndCharacter *string of characters* |
1731 |
|
1732 |
|
1733 |
These settings define what a word consists of to the Swish-e indexing engine. |
1734 |
Compiled in defaults are in F<src/config.h>. |
1735 |
|
1736 |
When indexing Swish-e uses B<WordCharacters> to split up the document |
1737 |
into words. Words are defined by any string of non-blank characters |
1738 |
that contain only the characters listed in WordCharacters. If a string |
1739 |
of characters includes a character that is not in WordCharacters then |
1740 |
the word will be spit into two or more separate words. |
1741 |
|
1742 |
For example: |
1743 |
|
1744 |
WordCharacters abde |
1745 |
|
1746 |
Would turn "abcde" into two words "ab" and "de". |
1747 |
|
1748 |
Next, of these words, any characters defined in B<IgnoreFirstChar> are |
1749 |
stripped off the start of the word, and B<IgnoreLastChar> characters |
1750 |
are stripped off the end of the word. This allows, for example, |
1751 |
periods within a word (www.slashdot.com), but not at the end of |
1752 |
a word. Characters in IgnoreFirstChar and IgnoreLastChar must be in |
1753 |
WordCharacters. |
1754 |
|
1755 |
Finally, the resulting words MUST begin with one of the characters |
1756 |
listed in B<BeginCharacters> and end with one of the characters listed in |
1757 |
B<EndCharacters>. BeginCharacters and EndCharacters must be a subset of |
1758 |
the characters in WordCharacters. Often, WordCharacters, BeginCharacters |
1759 |
and EndCharacters will all be the same. |
1760 |
|
1761 |
Note that the same process applies to the query while searching. |
1762 |
|
1763 |
Getting these settings correct will take careful consideration and |
1764 |
practice. It's helpful to create an index of a single test file, and |
1765 |
then look at the words that are placed in the index (see the C<-v 4>, |
1766 |
C<-D> and C<-k> searching switches). |
1767 |
|
1768 |
Currently there is only support for eight-bit characters. |
1769 |
|
1770 |
Example: |
1771 |
|
1772 |
WordCharacters .abcdefghijklmnopqrstuvwxyz |
1773 |
BeginCharacters abcdefghijklmnopqrstuvwxyz |
1774 |
EndCharacters abcdefghijklmnopqrstuvwxyz |
1775 |
IgnoreFirstChar . |
1776 |
IgnoreLastChar . |
1777 |
|
1778 |
So the string |
1779 |
|
1780 |
Please visit http://www.example.com/path/to/file.html. |
1781 |
|
1782 |
will be indexed as the following words: |
1783 |
|
1784 |
please |
1785 |
visit |
1786 |
http |
1787 |
www.example.com |
1788 |
path |
1789 |
to |
1790 |
file.html |
1791 |
|
1792 |
Which means that you can search for C<www.example.com> as a single word, |
1793 |
but searching for just C<example> will not find the document. |
1794 |
|
1795 |
Note: when indexing HTML documents HTML entities are converted to their |
1796 |
character equivalents before being processed with these directives. |
1797 |
This is a change from previous versions of Swish-e where you were |
1798 |
required to include the characters C<0123456789&#;> to index entities. |
1799 |
See also L<ConvertHTMLEntities|/"item_ConvertHTMLEntities"> |
1800 |
|
1801 |
=item Buzzwords [*list of buzzwords*|File: path] |
1802 |
|
1803 |
The Buzzwords option allows you to specify words that will be indexed |
1804 |
regardless of WordCharacters, BeginCharacters, EndCharacters, stemming, |
1805 |
soundex and many of the other checks do on words while indexing. |
1806 |
|
1807 |
Buzzwords are case insensitive. |
1808 |
|
1809 |
Buzzwords should be separated by spaces and may span multiple directives. |
1810 |
If the special format C<File:filename> is used then the Buzzwords will |
1811 |
be read from an external file during indexing. |
1812 |
|
1813 |
Examples: |
1814 |
|
1815 |
Buzzwords C++ TCP/IP |
1816 |
|
1817 |
Buzzwords File: ./buzzwords.lst |
1818 |
|
1819 |
If a Buzzword contains search operator characters they must be backslashed |
1820 |
when searching. For example: |
1821 |
|
1822 |
Buzzwords C++ TCP/IP web=http |
1823 |
|
1824 |
./swish-e -w 'web\=http' |
1825 |
|
1826 |
Buzzwords are found by splitting the text on whitespace, removing |
1827 |
C<IgnoreFirstChar> and C<IgnoreLastChar> characters from the word, |
1828 |
and then comparing with the list of C<Buzzwords>. Therefore, if |
1829 |
adding C<Buzzwords> to an index you will probably want to define |
1830 |
C<IgnoreFirstChar> and C<IgnoreLastChar> settings. |
1831 |
|
1832 |
Note: Buzzwords specific settings for C<IgnoreFirstChar> and |
1833 |
C<IgnoreLastChar> may be used in the future. |
1834 |
|
1835 |
|
1836 |
=item IgnoreWords [*list of stop words*|File: path] |
1837 |
|
1838 |
The IgnoreWords option allows you to specify words to ignore, called |
1839 |
I<stopwords>. The default is to not use any stopwords. |
1840 |
|
1841 |
Words should be separated by spaces and may span multiple directives. |
1842 |
If the special format C<File:filename> is used then the stop words will |
1843 |
be read from an external file during indexing. |
1844 |
|
1845 |
In previous versions of Swish-e you could use the directive |
1846 |
|
1847 |
IgnoreWords swishdefault - obsolete! |
1848 |
|
1849 |
to include a default list of compiled in stopwords. This keyword is no |
1850 |
longer supported. |
1851 |
|
1852 |
Examples: |
1853 |
|
1854 |
IgnoreWords www http a an the of and or |
1855 |
|
1856 |
IgnoreWords File: ./stopwords.de |
1857 |
|
1858 |
=item UseWords [*list of words*|File: path] |
1859 |
|
1860 |
UseWords defines the words that Swish-e will index. B<Only> the words |
1861 |
listed will be indexed. |
1862 |
|
1863 |
You can specify a list of words following the directive (you may specify |
1864 |
more than one C<UseWords> directive in a config file), and/or use the |
1865 |
C<File:> form to specify a path to a file containing the words: |
1866 |
|
1867 |
UseWords perl python pascal fortran basic cobal php |
1868 |
UseWords File: /path/to/my/wordlist |
1869 |
|
1870 |
Please drop the Swish-e list a note if you actually use this feature. |
1871 |
It may be removed from future versions. |
1872 |
|
1873 |
=item IgnoreLimit *integer integer* |
1874 |
|
1875 |
This automatically omits words that appear too often in the files (these |
1876 |
words are called stopwords). Specify a whole percentage and a number, |
1877 |
such as "80 256". This omits words that occur in over 80% of the files |
1878 |
and appear in over 256 files. Comment out to turn off auto-stopwording. |
1879 |
|
1880 |
IgnoreLimit 50 1000 |
1881 |
|
1882 |
Swish-e must do extra processing to adjust the entire index when this |
1883 |
feature is used. It is recommended that instead of using this feature |
1884 |
that you decided what words are stopwords and add them to B<IngoreWords> |
1885 |
in your configuration file. To do this, use IgnoreLimit one time and |
1886 |
note the stop words that are found while indexing. Add this list to |
1887 |
IgnoreWords, and then remove IgnoreLimit from the configuration file. |
1888 |
|
1889 |
=item IgnoreMetaTags *list of names* |
1890 |
|
1891 |
C<IgnoreMetaTags> defines a list of metantags to ignore while indexing |
1892 |
XML files (and HTML files if using libxml2 for parsing HTML). All text |
1893 |
within the tags will be ignored -- both for indexing (C<MetaNames>) |
1894 |
and properties (C<PropertyNames>). To still parse properties, yet do |
1895 |
not index the text, see L<UndefinedMetaTags|/"item_UndefinedMetaTags">. |
1896 |
|
1897 |
This option is useful to avoid indexing specific data from a file. |
1898 |
For example: |
1899 |
|
1900 |
<person> |
1901 |
<first_name> |
1902 |
William |
1903 |
</first_name> <last_name> |
1904 |
Shakespeare |
1905 |
</last_name> <updated_date> |
1906 |
April 25, 1999 |
1907 |
</updated_date> |
1908 |
</person> |
1909 |
|
1910 |
In the above example you might B<not> want to index the updated date, |
1911 |
and therefore prevent finding this record by searching |
1912 |
|
1913 |
-w 'person=(April)' |
1914 |
|
1915 |
This is solved by: |
1916 |
|
1917 |
IgnoreMetaTags updated_date |
1918 |
|
1919 |
|
1920 |
See also L<UndefinedMetaTags|/"item_UndefinedMetaTags">. |
1921 |
|
1922 |
=item IgnoreNumberChars *list of characters* |
1923 |
|
1924 |
Experimental Feature |
1925 |
|
1926 |
This experimental feature can be used to define a set of characters |
1927 |
that describe a number. If a word is found to contain only those |
1928 |
characters it will not be indexed. The characters listed must be part |
1929 |
of C<WordCharacters> settings. In other words, the "word" checked is |
1930 |
a word that Swish-e would otherwise index. |
1931 |
|
1932 |
For example, |
1933 |
|
1934 |
IgnoreNumberChars 0123456789$., |
1935 |
|
1936 |
Then Swish-e would not index the following: |
1937 |
|
1938 |
123 |
1939 |
123,456.78 |
1940 |
$123.45 |
1941 |
|
1942 |
You might be tempted to avoid indexing hex numbers with: |
1943 |
|
1944 |
IgnoreNumberChars 0123456789abcdef |
1945 |
|
1946 |
which will not index 0D31, but will also not index the word "bad". |
1947 |
|
1948 |
This is an experimental feature that may change in future versions. |
1949 |
One possible change is to use regular expressions instead. |
1950 |
|
1951 |
|
1952 |
=item IndexComments [NO|yes] |
1953 |
|
1954 |
This option allows the user decide if to index the contents of HTML |
1955 |
comments. Default is no. Set to yes if comment indexing is required. |
1956 |
|
1957 |
IndexComments yes |
1958 |
|
1959 |
Note: This is a change in the default behavior prior to version 2.2. |
1960 |
|
1961 |
=item TranslateCharacters [*string1 string2*|:ascii7:] |
1962 |
|
1963 |
The TranslateCharacters directive maps the characters in string1 to the |
1964 |
characters listed in string2. |
1965 |
|
1966 |
For example: |
1967 |
|
1968 |
# This will index a_b as a-b and ámo as amo |
1969 |
TranslateCharacters _á -a |
1970 |
|
1971 |
C<TranslateCharacters :ascii7:> is a predefined set of characters that |
1972 |
will translate eight bit characters to ascii7 characters. Using the |
1973 |
:ascii7: rule will translate "Ääç" to "aac". This means: searching |
1974 |
"Çelik", "çelik" or "celik" will all match the same word. |
1975 |
|
1976 |
TranslateCharacters is done early in the indexing process, after |
1977 |
converting HTML entities but before splitting the input text into words |
1978 |
based on B<WordCharacters>. So characterters you are translating I<from> |
1979 |
do not need to be listed in word characters. |
1980 |
|
1981 |
The same character translations take place when searching. |
1982 |
|
1983 |
=item BumpPositionCounterCharacters *string* |
1984 |
|
1985 |
When indexing Swish-e assigns a word position to each word. This enables |
1986 |
phrase searching. There may be cases where you would like to prevent |
1987 |
phrase matching. The BumpPositionCounterCharacters directive allows |
1988 |
you to specify a set of characters that when found in the text will |
1989 |
increment the word position -- effectively preventing phrase matches |
1990 |
across that character. |
1991 |
|
1992 |
For example, if you have a tag: |
1993 |
|
1994 |
<subjects> |
1995 |
computer programming | apple computers |
1996 |
</subjects> |
1997 |
|
1998 |
You might want to prevent matching "programming apple" in that meta name. |
1999 |
|
2000 |
BumpPositionCounterCharacters | |
2001 |
|
2002 |
There is no default, and you may list a string of characters. |
2003 |
|
2004 |
=item DontBumpPositionOnEndTags *list of names* |
2005 |
|
2006 |
=item DontBumpPositionOnStartTags *list of names* |
2007 |
|
2008 |
Since metatags are typically separate data fields, the word position |
2009 |
counter is automatically bumped between metatags (actally, bumpted when a |
2010 |
start tag is found and when an end tag is found). This prevents matching |
2011 |
a phrase that spans more than one metaname. C<DontBumpPositionOnEndTags> |
2012 |
and C<DontBumpPositionOnStartTags> disables this feature for the listed |
2013 |
metanames. |
2014 |
|
2015 |
For example, |
2016 |
|
2017 |
<person> |
2018 |
<first_name> |
2019 |
William |
2020 |
</first_name> |
2021 |
<last_name> |
2022 |
Shakespeare |
2023 |
</last_name> |
2024 |
<updated_date> |
2025 |
April 25, 1999 |
2026 |
</updated_date> |
2027 |
</person> |
2028 |
|
2029 |
In the conifuration file: |
2030 |
|
2031 |
DontBumpPositionOnEndTags first_name |
2032 |
DontBumpPositionOnStartTags last_name |
2033 |
|
2034 |
This configuration allows this phrase search |
2035 |
|
2036 |
-w 'person=("william shakespeare")' |
2037 |
|
2038 |
but this phrase search will fail |
2039 |
|
2040 |
-w 'person=("shakespeare april")' |
2041 |
|
2042 |
|
2043 |
|
2044 |
=back |
2045 |
|
2046 |
|
2047 |
=head2 Directives for the File Access method only |
2048 |
|
2049 |
Some directives have different uses depending on the source of the |
2050 |
documents. These directives are only valid when using the B<File system> |
2051 |
method of indexing. |
2052 |
|
2053 |
=over 4 |
2054 |
|
2055 |
=item IndexOnly *list of file suffixes* |
2056 |
|
2057 |
This directive specifies the allowable file suffixes (extensions) while |
2058 |
indexing. The default is to index all files specified in B<IndexDir>. |
2059 |
|
2060 |
# Only index .html .htm and .q files |
2061 |
IndexOnly .html .htm .q |
2062 |
|
2063 |
C<IndexOnly> checks that the file end in the characters listed. It does |
2064 |
not check "extensions". C<IndexOnly> is tested right before C<FileRules> |
2065 |
is processed. |
2066 |
|
2067 |
=item FollowSymLinks [yes|NO] |
2068 |
|
2069 |
Put "yes" to follow symbolic links in indexing, else "no". Default is no. |
2070 |
|
2071 |
FollowSymLinks no |
2072 |
FollowSymLinks yes |
2073 |
|
2074 |
Note that when set to C<no> extra stat(2) system calls must be made for |
2075 |
each file. For large number of files you may see a small reduction in |
2076 |
indexing time by setting this to C<yes>. |
2077 |
|
2078 |
See also the C<-l> switch in L<SWISH-RUN|SWISH-RUN>. |
2079 |
|
2080 |
=item FileRules [type] [contains|is|regex] *regular expression* |
2081 |
|
2082 |
=item FileMatch [type] [contains|is|regex] *regular expression* |
2083 |
|
2084 |
FileRules and FileMatch are used to, respectively, exclude and include |
2085 |
files and directories to index. Since, by default, Swish-e indexes all |
2086 |
files and recurses all directories (but see also C<FollowSymLinks>) you |
2087 |
will typically only use C<FileRules> to exclude files or directories. |
2088 |
C<FileMatch> is useful in a few cases, for example, to override the |
2089 |
behavior of C<IndexOnly>. Some examples are included below. |
2090 |
|
2091 |
Except for C<FileRules title ...>, this feature is only available for |
2092 |
file access method (-S fs), which is the default indexing mode. Also, |
2093 |
any pathname modification with C<ReplaceRules> happens after the check |
2094 |
for C<FileRules>. (It's unlikly that you would exclude files with |
2095 |
C<FileRules> based on text you added with C<ReplaceRules>!) |
2096 |
|
2097 |
The regular expression is a C regex.h extended regular expression. |
2098 |
You may supply more than one regular expression per line, or use |
2099 |
separate directives. Preceeding the regular expression with the word |
2100 |
"not" negates the match. |
2101 |
|
2102 |
The regular expression is compared against B<[type]> as described below. |
2103 |
|
2104 |
For historical reasons, you can specify C<contains> or C<is>. C<is> |
2105 |
simply forces the regular expression to match at the start and end |
2106 |
of the string (by internally prepending "^" and appending "$" to the |
2107 |
regular expression). |
2108 |
|
2109 |
The C<regex> option requires delimiter characters: |
2110 |
|
2111 |
FileRules title regex /^private/i |
2112 |
|
2113 |
The only advantage of C<regex> is if you want to do case insensitive |
2114 |
matches, or simply like your regular expressions to look like perl |
2115 |
regular expressions. You must use matching delimiters; (), {}, and [], |
2116 |
are not currently supported for no good reason other than laziness. |
2117 |
|
2118 |
Use quotes (" or ') around a pattern if it contains any white space. |
2119 |
Note that the backslash character becomes the escape character within |
2120 |
quotes. |
2121 |
|
2122 |
For example, these sets generate the same regular expressions. |
2123 |
|
2124 |
FileRules title is hello |
2125 |
FileRules title contains ^hello$ |
2126 |
FileRules title regex /^hello$/ |
2127 |
|
2128 |
These all need quotes due to the included space character |
2129 |
|
2130 |
FileRules title is "hello there" |
2131 |
FileRules title contains "^hello there$" |
2132 |
FileRules title regex "!^hello there$!" |
2133 |
|
2134 |
These show how the backslash must be doubled inside of quotes. |
2135 |
Swish-e converts a double-backslash into a single backslash, and then |
2136 |
passes that single onto the regular expression compiler. |
2137 |
|
2138 |
FileRules filename regex /\.pdf/ |
2139 |
FileRules filename regex "/\\.pdf/" |
2140 |
|
2141 |
FileRules filename regex !hello\\there! # need double for real backslash |
2142 |
FileRules filename regex "!hello\\\\there!" # need double-double inside of quotes |
2143 |
|
2144 |
|
2145 |
B<Matching Types> |
2146 |
|
2147 |
The following types of match strings my be supplied: |
2148 |
|
2149 |
FileRules pathname |
2150 |
FileRules dirname |
2151 |
FileRules filename |
2152 |
FileRules directory |
2153 |
FileRules title |
2154 |
|
2155 |
FileMatch pathname |
2156 |
FileMatch filename |
2157 |
FileMatch dirname |
2158 |
FileMatch directory |
2159 |
|
2160 |
B<pathname> matches the regular expression against the current pathname. |
2161 |
The pathname may or may not be absolute depending on what you supplied |
2162 |
to C<IndexDir>. |
2163 |
|
2164 |
Example: |
2165 |
|
2166 |
# Don't index paths that contain private or hidden |
2167 |
FileRules pathname contains (private|hidden) |
2168 |
|
2169 |
# Same thing |
2170 |
FileRules pathname regex /(private|hidden)/ |
2171 |
|
2172 |
# Don't index exe files |
2173 |
FileRules pathname contains \.exe$ |
2174 |
|
2175 |
B<dirname> and B<filename> split the path name by the last delimiter |
2176 |
character into a directory name, and a file name. Then these are compared |
2177 |
against the patterns supplied. Directory names do B<not> have a trailing |
2178 |
slash. All path names use the forward slash as a delimiter within Swish-e. |
2179 |
|
2180 |
Example: |
2181 |
|
2182 |
# Same as last example - don't index *.exe files. |
2183 |
FileRules filename contains \.exe$ |
2184 |
|
2185 |
# Don't index any file called test.html files |
2186 |
FileRules filename contains ^test\.html$ |
2187 |
|
2188 |
# Same thing |
2189 |
FileRules filename is test\.html |
2190 |
|
2191 |
# Don't index any directoires that contain "old" (/usr/local/myold/docs) |
2192 |
FileRules dirname contains old |
2193 |
|
2194 |
# Don't index any directories that contain the path segment "old" (/usr/local/old/foo) |
2195 |
FileRules dirname contains /old/ |
2196 |
|
2197 |
# Index only .htm, .html, plus any all-digit file names |
2198 |
IndexOnly .htm .html |
2199 |
FileMatch filename contains ^\d+$ |
2200 |
|
2201 |
# Same as previous, but maybe a little slower |
2202 |
FileRules filename regex not !\.(htm|html)$! |
2203 |
FileMatch filename contains ^\d+$ |
2204 |
|
2205 |
Swish-e checks these settings in the order of C<pathname>, C<dirname>, and |
2206 |
C<filename>, and C<FileMatch> patterns are checked before C<FileRules>, |
2207 |
in general. This allows you to exclude most files with C<FileRules>, |
2208 |
yet allow in a few special cases with C<FileMatch>. For example: |
2209 |
|
2210 |
# Exclude all files of .exe, .bin, and .bat |
2211 |
FileRules filename contains \.(exe|bin|bat)$ |
2212 |
# But, let these two in |
2213 |
FileMatch filename is baseball\.bat incoming_mail\.bin |
2214 |
|
2215 |
# Same, but as a single pattern |
2216 |
FileMatch filename is (baseball\.bat|incoming_mail\.bin) |
2217 |
|
2218 |
The C<directory> type is somewhat unique. When Swish-e recurses into a |
2219 |
directory it will compare all the I<files> in the directory with the |
2220 |
pattern and then decide if that entire directory should or should not |
2221 |
be indexed (or recursed). Note that you are matching against file names |
2222 |
in a directory -- and some of those names may be directory names. |
2223 |
|
2224 |
A C<FileRules directory> match will cause Swish-e to ignore all files and |
2225 |
sub-directories in the current directory. |
2226 |
|
2227 |
Warning: A match with C<FileMatch directory> says to index B<everything> |
2228 |
in the *current* directory and B<ignore> any FileRules for this directory. |
2229 |
|
2230 |
|
2231 |
Example: |
2232 |
|
2233 |
# Don't index any directories (and sub directories) that contain |
2234 |
# a file (or sub-directory) called "index.skip" |
2235 |
FileRules directory contains ^index\.skip$ |
2236 |
|
2237 |
# Don't index directories that contain a .htaccess file. |
2238 |
FileRules directory contains ^\.htaccess |
2239 |
|
2240 |
Note: While I<processing> directories, Swish-e will ignore any files |
2241 |
or directories that begin with a dot ("."). You may index files |
2242 |
or directories that begin with a dot by specifying their name with |
2243 |
C<IndexDir> or C<-i>. |
2244 |
|
2245 |
C<title> checks for a pattern match in an HTML title. |
2246 |
|
2247 |
Example: |
2248 |
|
2249 |
FileRules title contains construction example pointers |
2250 |
|
2251 |
# This example says to ignore case |
2252 |
FileRules title regex "/^Internal document/i" |
2253 |
|
2254 |
Note: C<FileRules title> works for any input method (fs, prog, or http) |
2255 |
that is parsed as HTML, and where a title was found in the document. |
2256 |
|
2257 |
In case all this seems a bit confusing, processing a directory happens |
2258 |
in the following order. |
2259 |
|
2260 |
First the directory name is checked: |
2261 |
|
2262 |
FileRules dirname - reject entire directory if matches |
2263 |
|
2264 |
Next the directory is scanned and each file name (which might be the |
2265 |
name of a sub-directory) is checked: |
2266 |
|
2267 |
FileRules directory - reject entire dir if any files match FileMatch |
2268 |
directory - accept *entire* dir if any files match |
2269 |
|
2270 |
Then, unless C<FileMatch directory> matched, each file is tested with |
2271 |
FileMatch. A match says to index the file without further testing |
2272 |
(i.e. overrides FileRules and IndexOnly): |
2273 |
|
2274 |
FileMatch pathname \ |
2275 |
FileMatch dirname - file is accepted if any match |
2276 |
FileMatch filename / |
2277 |
|
2278 |
otherwise |
2279 |
|
2280 |
IndexOnly - file is checked for the correct file extension |
2281 |
|
2282 |
FileRules pathname \ |
2283 |
FileRules dirname - file is rejected if any match |
2284 |
FileRules filename / |
2285 |
|
2286 |
finally, the file is indexed. |
2287 |
|
2288 |
Files (not directories) listed with C<IndexDir> or C<-i> are processed |
2289 |
in a similar way: |
2290 |
|
2291 |
FileMatch pathname \ |
2292 |
FileMatch dirname - file is accepted if any match |
2293 |
FileMatch filename / |
2294 |
|
2295 |
otherwise, the file is rejected if it doesn't have the correct extension |
2296 |
or a FileRules matches. |
2297 |
|
2298 |
IndexOnly - file is checked for the correct file extension |
2299 |
|
2300 |
FileRules pathname \ |
2301 |
FileRules dirname - file is rejected if any match |
2302 |
FileRules filename / |
2303 |
|
2304 |
Note: If things are not indexing as you expect, create a directory |
2305 |
with some test files and use the C<-T regex> trace option to see how |
2306 |
file names are checked. Start with very simple tests! |
2307 |
|
2308 |
|
2309 |
=back |
2310 |
|
2311 |
=head2 Directives for the HTTP Access Method Only |
2312 |
|
2313 |
These directives are available when using the HTTP Access Method of indexing. |
2314 |
|
2315 |
=over 4 |
2316 |
|
2317 |
=item MaxDepth *integer* |
2318 |
|
2319 |
MaxDepth defines how many links the spider should follow before stopping. |
2320 |
A value of 0 configures the spider to traverse all links. The default |
2321 |
is MaxDepth 5. |
2322 |
|
2323 |
MaxDepth 5 |
2324 |
|
2325 |
=item Delay *seconds* |
2326 |
|
2327 |
The number of seconds to wait between issuing requests to a server. |
2328 |
This setting allows for more friendly spidering of remote sites. |
2329 |
The default is 60 seconds. |
2330 |
|
2331 |
Delay 1 |
2332 |
|
2333 |
=item TmpDir *path* |
2334 |
|
2335 |
The location of a writable temp directory on your system. The HTTP |
2336 |
access method tells the Perl helper to place its files in this location, |
2337 |
and the C<-e> switch causes Swish-e to use this directory while indexing. |
2338 |
There is no default. |
2339 |
|
2340 |
TmpDir /tmp/swish |
2341 |
|
2342 |
If this directory does not exist or is not writable Swish-e will fail |
2343 |
with an error during indexing. |
2344 |
|
2345 |
Note, the environment variables of C<TMPDIR>, C<TMP>, and C<TEMP> |
2346 |
(in that order) will B<override> this setting. |
2347 |
|
2348 |
=item SpiderDirectory *path* |
2349 |
|
2350 |
The location of the Perl helper script called F<swishspider>. If you |
2351 |
use a relative directory, it is relative to your directory when you run |
2352 |
Swish-e, not to the directory that Swish-e is in. The default is C<./> |
2353 |
|
2354 |
SpiderDirectory /usr/local/swish |
2355 |
|
2356 |
=item EquivalentServer *server alias* |
2357 |
|
2358 |
Often times the same site may be referred to by different names. |
2359 |
A common example is that often http://www.some-server.com and |
2360 |
http://some-server.com are the same. Each line should have a list of |
2361 |
all the method/names that should be considered equivalent. Multiple |
2362 |
EquivalentServer directives may be used. Each directive defines its |
2363 |
own set of equivalent servers. |
2364 |
|
2365 |
EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu |
2366 |
EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu |
2367 |
|
2368 |
=back |
2369 |
|
2370 |
=head2 Directives for the prog Access Method Only |
2371 |
|
2372 |
This section details the directives that are only available for the |
2373 |
"prog" document source feature of Swish-e. The "prog" access method runs |
2374 |
an external program that "feeds" documents to Swish-e. This allows indexing |
2375 |
and filtering of documents from any source. |
2376 |
|
2377 |
See L<prog - general purpose access method|SWISH-RUN/"item_prog"> in |
2378 |
the SWISH-RUN man page for more information. |
2379 |
|
2380 |
|
2381 |
A number of example programs for use with the "prog" access method are |
2382 |
provided in the F<prog-bin> directory. Please see those example if you |
2383 |
have questions about implementing a "prog" input program. |
2384 |
|
2385 |
=over 4 |
2386 |
|
2387 |
=item SwishProgParameters *list of parameters* |
2388 |
|
2389 |
This is a list of parameters that will be sent to the external program |
2390 |
when running with the "prog" document source method. |
2391 |
|
2392 |
SwishProgParameters /path/to/config hello there |
2393 |
IndexDir /path/to/program.pl |
2394 |
|
2395 |
Then running: |
2396 |
|
2397 |
swish-e -c config -S prog |
2398 |
|
2399 |
Swish-e will execute C</path/to/program.pl> and pass C</path/to/config |
2400 |
hello there> as three command line arguments to the program. This |
2401 |
directive makes it easy to pass settings from the Swish-e configuration |
2402 |
file to the external program. |
2403 |
|
2404 |
For example, the C<spider.pl> program (included in the C<prog-bin> |
2405 |
directory) uses the C<SwishProgParameters> to specify what file to read |
2406 |
for configuation information. |
2407 |
|
2408 |
SwishProgParameters spider.config |
2409 |
IndexDir ./spider.pl |
2410 |
|
2411 |
The C<spider.pl> program also has a default action so you can avoid |
2412 |
using a configuration file: |
2413 |
|
2414 |
SwishProgParameters default http://www.swishe.org/ http://some.other.site/ |
2415 |
IndexDir ./spider.pl |
2416 |
|
2417 |
And the spider program will use default settings for spidering those sites. |
2418 |
|
2419 |
=back |
2420 |
|
2421 |
B<Notes when using MS Windows> |
2422 |
|
2423 |
You should use unix style path separators to specify your external |
2424 |
program. Swish will convert forward slashes to backslashes before |
2425 |
calling the external program. This is only true for the program name |
2426 |
specified with C<IndexDir> or the C<-i> command line option. |
2427 |
|
2428 |
In addition, Swish-e will make sure the program specified actually exists, |
2429 |
which means you need to use the full name of the program. |
2430 |
|
2431 |
For example, to run the perl spider program F<spider.pl> you would need |
2432 |
a Swish-e configuration file such as: |
2433 |
|
2434 |
IndexDir e:/perl/bin/perl.exe |
2435 |
SwishProgParameters prog-bin/spider.pl default http://swish-e.org |
2436 |
|
2437 |
and run indexing with the command: |
2438 |
|
2439 |
swish-e -c swish.cfg -S prog -v 9 |
2440 |
|
2441 |
The C<IndexDir> command tells Swish-e the name of the program to run. |
2442 |
Under unix you can just specify the name of the script, since unix will |
2443 |
figure out the program from the first line of the script. |
2444 |
|
2445 |
The C<SwishProgParameters> are the parameters passed to the program |
2446 |
specified by C<IndexDir> (perl.exe in this case). The first parameter |
2447 |
is the perl script to run (F<prog-bin/spider.pl>). Perl passes the rest |
2448 |
of the parameters directly to the perl script. The second parameter |
2449 |
F<default> tells the F<spider.pl> program to use default settings for |
2450 |
spidering (or you could specify a spider config file -- see C<perldoc |
2451 |
spider.pl> for details), and lastly, the URL is passed into the spider |
2452 |
program. |
2453 |
|
2454 |
|
2455 |
=head2 Document Filter Directives |
2456 |
|
2457 |
Internally, Swish-e knows how to parse only text, HTML, and XML documents. |
2458 |
With Swish-e filters you can index other types of documents. For example, |
2459 |
if all your web pages are in gzip format a filter can uncompress these |
2460 |
on the fly for indexing. |
2461 |
|
2462 |
A filter is an external program that Swish-e executes while processing |
2463 |
a document of a given type. Swish-e will execute the filter program |
2464 |
for each file that matches the file suffix (extension) set in the |
2465 |
B<FileFilter> or B<FileFilterMatch> directives. B<FileFilterMatch> |
2466 |
matches using regular expressions and is described below. |
2467 |
|
2468 |
Swish-e calls the external program passing as B<default> arguments: |
2469 |
|
2470 |
=over 4 |
2471 |
|
2472 |
=item $0 |
2473 |
|
2474 |
the name of the filter program |
2475 |
|
2476 |
=item $1 |
2477 |
|
2478 |
the physical path name of the file to read. This may be a temporary |
2479 |
file location if indexing by the http method. |
2480 |
|
2481 |
=item $2 |
2482 |
|
2483 |
When indexing under the file system this will be the same as $1 (the |
2484 |
path to the source file), but when indexing under the http method this |
2485 |
will be the URL of the source document. |
2486 |
|
2487 |
=back |
2488 |
|
2489 |
Swish-e can also pass other parameters to the filter program. These |
2490 |
parameters can be defined using the B<FileFilter> or B<FileFilterMatch> |
2491 |
directives. See Filter Options below. |
2492 |
|
2493 |
The filter program must open the file, process its contents, and return |
2494 |
it to Swish-e by printing to STDOUT. |
2495 |
|
2496 |
Note that this can add a significant amount of time to the indexing |
2497 |
process if your external program is a perl or shell script. If you |
2498 |
have many files to filter you should consider writing your filter in C |
2499 |
instead of a shell or perl script, or using the "prog" Access Method. |
2500 |
|
2501 |
=over 4 |
2502 |
|
2503 |
=item FilterDir *path-to-directory* |
2504 |
|
2505 |
This is the path to a directory where the filter programs are stored. |
2506 |
Swish-e looks in this directory to find the filter specified in the |
2507 |
B<FileFilter> directive. If this directive is omitted, you have to |
2508 |
specify the full path to the filterscript on each FileFilter directive. |
2509 |
|
2510 |
This feature does *not* apply to the C<FileFilterMatch> directive. |
2511 |
|
2512 |
Example: |
2513 |
|
2514 |
FilterDir /usr/local/swish/filters |
2515 |
|
2516 |
=item FileFilter *suffix* "filter-prog" ["filter-options"] |
2517 |
|
2518 |
This maps file suffixe (extension) to a filter program. If I<filter-prog> |
2519 |
starts with a directory delimiter (absolute path), Swish-e doesn't use |
2520 |
the FilterDir settings, but uses the given I<filter-prog> path directly. |
2521 |
|
2522 |
Filter options: |
2523 |
|
2524 |
Filter options are a string passed as arguments to the I<filter-prog>. |
2525 |
Filter options can contain variables, replaced by Swish-e. If you ommit |
2526 |
I<filter-options> Swish-e will use default parameters for the options |
2527 |
listed above. |
2528 |
|
2529 |
Default: "'%p' '%P'" |
2530 |
Which means: pass "workfile path" and "documentfile path" to filter (each quoted). |
2531 |
|
2532 |
Variables in filter options: |
2533 |
|
2534 |
%% = % |
2535 |
%P = Full document pathname (e.g. URL, or path on filesystem) |
2536 |
%p = Full pathname to work file (maybe a tmpfile or the real document path on filesystem) |
2537 |
%F = Filename stripped from full document pathname |
2538 |
%f = Filename stripped from "work" pathname |
2539 |
%D = Directoryname stripped from full document pathname |
2540 |
%d = Directoryname stripped from full "work" pathname |
2541 |
|
2542 |
Examples of strings passed: |
2543 |
|
2544 |
%P = document pathname: http://myserver/path1/mydoc.txt |
2545 |
%p = work pathname: /tmp/tmp.1234.mydoc.txt |
2546 |
%F = mydoc.txt |
2547 |
%f = tmp.1234.mydoc.txt |
2548 |
%D = http://myserver/path1 |
2549 |
%d = /tmp |
2550 |
|
2551 |
Important hint for security: |
2552 |
|
2553 |
When using variable substitution, use quotes to ensure filename integrity. |
2554 |
|
2555 |
e.g. "'%f'" --> 'file name with spaces.doc'. |
2556 |
|
2557 |
If you don't use this, your system security may be compromised, or |
2558 |
filtering may not work for these files. |
2559 |
|
2560 |
B<Notes when using MS Windows> |
2561 |
|
2562 |
Windows uses double quotes to escape shell metacharacters, so reverse |
2563 |
the quotes in the examples above. e.g.: |
2564 |
|
2565 |
'"%f"' --> "file name with spaced.doc" |
2566 |
|
2567 |
You can specify the filter program using forward slashes (unix style). |
2568 |
Swish will convert the slashes to backslashes before running your program. |
2569 |
|
2570 |
FileFilter .mydoc c:/some/path/mydocfilter.exe '-d "%d" -example -url "%P" "%f"' |
2571 |
|
2572 |
|
2573 |
Examples of filters: |
2574 |
|
2575 |
FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'" |
2576 |
FileFilter .pdf pdftotext "'%p' -" |
2577 |
FileFilter .html.gz gzip "-c '%p'" |
2578 |
FileFilter .mydoc "/some/path/mydocfilter" "-d '%d' -example -url '%P' '%f'" |
2579 |
|
2580 |
The above examples are running a I<binary> filter program. For more |
2581 |
complicated filtering needs you may use a scripting language such as |
2582 |
Perl or a shell script. Here's some examples of calling a shell and |
2583 |
perl script: |
2584 |
|
2585 |
FileFilter .pdf pdf2html.sh |
2586 |
FileFilter .ps ghostscript-filter.pl |
2587 |
|
2588 |
Using a scripting language (or any language that has a large startup |
2589 |
cost) can B<greatly increase the indexing time>. For small indexing |
2590 |
jobs, this may not be an issue, but for large collections of files that |
2591 |
require processing by a scripting language, you may be better off using |
2592 |
the C<-S prog> access method where the script will only be compiled once, |
2593 |
instead of for each document. |
2594 |
|
2595 |
Filters are probably easier to write than a C<-S prog> program. Which you |
2596 |
decide to use depends on your requirements. Examples of filter scripts |
2597 |
can be found in the F<filter-bin> directory, and examples of C<-S prog> |
2598 |
programs can be found in the F<prog-bin> directory. |
2599 |
|
2600 |
=item FileFilterMatch *filter-prog* *filter-options* *regex* [*regex* ...] |
2601 |
|
2602 |
This is similar to C<FileMatch> except uses regular expressions to |
2603 |
match against the file name. *filter-prog* is the path to the program. |
2604 |
Unlike C<FileFilter> this does B<not> use the C<FilterDir> option. |
2605 |
Also unlike C<FileFilter> you B<must> specify the *filter-options*. |
2606 |
|
2607 |
Examples: |
2608 |
|
2609 |
FileFilterMatch ./pdftotext "'%p' -" /\.pdf$/ |
2610 |
|
2611 |
Note that will also match a file called ".pdf", so you may want to use |
2612 |
something that requires a filename that has more than just an extension. |
2613 |
For example: |
2614 |
|
2615 |
FileFilterMatch ./pdftotext "'%p' -" /.\.pdf$/ |
2616 |
|
2617 |
To specify more than one extension: |
2618 |
|
2619 |
FileFilterMatch ./check_title.pl "%p" /\.html$/ /\.htm$/ |
2620 |
|
2621 |
Or a few ways to do the same thing: |
2622 |
|
2623 |
FileFilterMatch ./check_title.pl %p /\.(html|html)$/ |
2624 |
FileFilterMatch ./check_title.pl %p /\.html?$/ |
2625 |
|
2626 |
And to ignore case: |
2627 |
|
2628 |
FileFilterMatch ./check_title.pl %p /\.html?$/i |
2629 |
|
2630 |
You may also precede an expression with "not" to negate regular expression |
2631 |
that follow. For example, to match files that do not have an extension: |
2632 |
|
2633 |
FileFilterMatch ./convert "%p %P" not /\..+$/ |
2634 |
|
2635 |
=back |
2636 |
|
2637 |
=head1 Document Info |
2638 |
|
2639 |
$Id: SWISH-CONFIG.pod,v 1.60 2002/08/28 14:30:23 whmoseley Exp $ |
2640 |
|
2641 |
. |