/[MITgcm]/mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-FAQ.pod
ViewVC logotype

Contents of /mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-FAQ.pod

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.1.1.1 - (show annotations) (download) (vendor branch)
Fri Sep 20 19:47:29 2002 UTC (22 years, 10 months ago) by adcroft
Branch: Import, MAIN
CVS Tags: baseline, HEAD
Changes since 1.1: +0 -0 lines
Importing web-site building process.

1 =head1 NAME
2
3 The Swish-e FAQ - Answers to Common Questions
4
5 =head1 Frequently Asked Questions
6
7 =head2 General Questions
8
9 =head3 What is Swish-e?
10
11 Swish-e is B<S>imple B<W>eb B<I>ndexing B<S>ystem for B<H>umans -
12 B<E>nhanced. With it, you can quickly and easily index directories of
13 files or remote web sites and search the generated indexes for words
14 and phrases.
15
16 =head3 So, is Swish-e a search engine?
17
18 Well, yes. Probably the most common use of Swish-e is to provide a search
19 engine for web sites. The Swish-e distribution includes CGI scripts that
20 can be used with it to add a I<search engine> for your web site. The CGI
21 scripts can be found in the F<example> directory of the distribution
22 package. See the F<README> file for information about the scripts.
23
24 But Swish-e can also be used to index all sorts of data, such as email
25 messages, data stored in a relational database management system,
26 XML documents, or documents such as Word and PDF documents -- or any
27 combination of those sources at the same time. Searches can be limited
28 to fields or I<MetaNames> within a document, or limited to areas within
29 an HTML document (e.g. body, title). Programs other than CGI applications
30 can use Swish-e, as well.
31
32 =head3 Should I upgrade if I'm already running a previous version
33 of Swish-e?
34
35 A large number of bug fixes, feature additions, and logic corrections were
36 made in version 2.2. In addition, indexing speed has been drastically
37 improved (reports of indexing times changing from four hours to 5
38 minutes), and major parts of the indexing and search parsers have been
39 rewritten. There's better debugging options, enhanced output formats,
40 more document meta data (e.g. last modified date, document summary),
41 options for indexing from external data sources, and faster spidering
42 just to name a few changes. (See the CHANGES file for more information.
43
44 Since so much effort has gone into version 2.2, support for previous
45 versions will probably be limited.
46
47 =head3 Are there binary distributions available for Swish-e on platform foo?
48
49 Foo? Well, yes there are some binary distributions available. Please see
50 the Swish-e web site for a list at http://swish-e.org/.
51
52 In general, it is recommended that you build Swish-e from source,
53 if possible.
54
55 =head3 Do I need to reindex my site each time I upgrade to a new Swish-e
56 version?
57
58 At times it might not strictly be necessary, but since you don't really
59 know if anything in the index has changed, it is a good rule to reindex.
60
61 =head3 What's the advantage of using the libxml2 library for parsing HTML?
62
63 Swish-e may be linked with libxml2, a library for working with HTML and XML
64 documents. Swish-e can use libxml2 for parsing HTML and XML documents.
65
66 The libxml2 parser is a better parser than Swish-e's built-in HTML
67 parser. It offers more features, and it does a much better job at
68 extracting out the text from a web page. In addition, you can use the
69 C<ParserWarningLevel> configuration setting to find structural errors
70 in your documents that could (and would with Swish-e's HTML parser)
71 cause documents to be indexed incorrectly.
72
73 Libxml2 is not required, but is strongly recommended for parsing HTML
74 documents. It's also recommended for parsing XML, as it offers many
75 more features than the internal Expat xml.c parser.
76
77 The internal HTML parser will have limited support, and does have a
78 number of bugs. For example, HTML entities may not always be correctly
79 converted and properties do not have entities converted. The internal
80 parser tends to get confused when invalid HTML is parsed where the libxml2
81 parser doesn't get confused as often. The structure is better detected
82 with the libxml2 parser.
83
84 If you are using the Perl module (the C interface to the Swish-e
85 library) you may wish to build two versions of Swish-e, one with the
86 libxml2 library linked in the binary, and one without, and build the
87 Perl module against the library without the libxml2 code. This is to
88 save space in the library. Hopefully, the library will someday soon be
89 split into indexing and searching code (volunteers welcome).
90
91 =head3 Does Swish-e include a CGI interface?
92
93 An example CGI script is included in the C<example> directory.
94 (Type C<perldoc swish.cgi> in the C<example> directory for instructions.)
95
96 Please be careful when picking a CGI script to use with Swish-e. Quite a
97 few of the scripts that have been available for it are insecure and
98 should not be used.
99
100 The included example CGI script was designed with security in mind.
101 Regardless, you are encouraged to have your local Perl expert review it
102 (and all other CGI scripts you use) before placing into production.
103 This is just a good policy to follow.
104
105 =head3 How secure is Swish-e?
106
107 We know of no security issues with using Swish-e. Careful attention
108 has been made with regard to common security problems such as buffer
109 overruns when programming Swish-e.
110
111 The most likely security issue with Swish-e is when it is run via
112 a poorly written CGI interface. This is not limited to CGI scripts
113 written in Perl, as it's just as easy to write an insecure CGI script
114 in C, Java, PHP, or Python. A good source of information is included
115 with the Perl distribution. Type C<perldoc perlsec> at your local
116 prompt for more information. Another must-read document is located at
117 C<http://www.w3.org/Security/faq/wwwsf4.html>.
118
119 Note that there are many I<free> yet insecure and poorly written CGI
120 scripts available -- even some designed for use with Swish-e. Please
121 carefully review any CGI script you use. Free is not such a good price
122 when you get your server hacked...
123
124 =head3 Should I run Swish-e as the superuser (root)?
125
126 No. Never.
127
128 =head3 What files does Swish-e write?
129
130 Swish writes the index file, of course. This is specified with the
131 C<IndexFile> configuration directive or by the C<-f> command line switch.
132
133 The index file is actually a collection of files, but all start with
134 the file name specified with the C<IndexFile> directive or the C<-f>
135 command line switch.
136
137 For example, the file ending in F<.prop> contains the document properties.
138
139 When creating the index files Swish-e appends the extension F<.temp>
140 to the index file names. When indexing is complete Swish-e renames the
141 F<.temp> files to the index files specified by C<IndexFile> or C<-f>.
142 This is done so that existing indexes remain untouched until it completes
143 indexing.
144
145 Swish-e also writes temporary files in some cases during indexing
146 (e.g. C<-s http>, C<-s prog> with filters>, when merging, and when
147 using C<-e>). Temporary files are created with the mkstemp(3) function
148 (with 0600 permission on unix-like operating systems).
149
150 The temporary files are created in the directory specified by the
151 environment variables C<TMPDIR> and C<TMP> in that order. If those
152 are not set then swish uses the setting the configuration setting
153 L<TmpDir|SWISH-CONFIG/"item_TmpDir">. Otherwise, the temporary file
154 will be located in the current directory.
155
156 =head3 Can I index PDF and MS-Word documents?
157
158 Yes, you can use a I<Filter> to convert documents while indexing, or you
159 can use a program that "feeds" documents to Swish-e that have already
160 been converted. See <Indexing> below.
161
162 =head3 Can I index documents on a web server?
163
164 Yes, Swish-e provides two ways to index (spider) documents on a web
165 server. See C<Spidering> below.
166
167 Swish-e can retrieve documents from a file system or from a remote web
168 server. It can also execute a program that returns documents back
169 to it. This program can retrieve documents from a database, filter
170 compressed documents files, convert PDF files, extract data from mail
171 archives, or spider remote web sites.
172
173 =head3 Can I implement keywords in my documents?
174
175 Yes, Swish-e can associate words with I<MetaNames> while indexing,
176 and you can limit your searches to these MetaNames while searching.
177
178 In your HTML files you can put keywords in HTML META tags or in XML blocks.
179
180 META tags can have two formats in your source documents:
181
182 <META NAME="DC.subject" CONTENT="digital libraries">
183
184
185 And in XML format (can also be used in HTML documents when using libxml2):
186
187 <meta2>
188 Some Content
189 </meta2>
190
191
192 Then, to inform Swish-e about the existence of the meta name in your
193 documents, edit the line in your configuration file:
194
195 MetaNames DC.subject meta1 meta2
196
197 When searching you can now limit some or all search terms to that
198 MetaName. For example, to look for documents that contain the word
199 apple and also have either fruit or cooking in the DC.subject meta tag.
200
201 =head3 What are document properties?
202
203 A document property is typically data that describes the document.
204 For example, properties might include a document's path name, its last
205 modified date, its title, or its size. Swish-e stores a document's
206 properties in the index file, and they can be reported back in search
207 results.
208
209 Swish-e also uses properties for sorting. You may sort your results by
210 one or more properties, in ascending or descending order.
211
212 Properties can also be defined within your documents. HTML and
213 XML files can specify tags (see previous question) as properties.
214 The I<contents> of these tags can then be returned with search results.
215 These user-defined properties can also be used for sorting search results.
216
217 For example, if you had the following in your documents
218
219 <meta name="creator" content="accounting department">
220
221 and C<creator> is defined as a property (see C<PropertyNames> in
222 L<SWISH-CONFIG|SWISH-CONFIG>) Swish-e can return C<accounting department>
223 with the result for that document.
224
225 swish-e -w foo -p creator
226
227 Or for sorting:
228
229 swish-e -w foo -s creator
230
231 =head3 What's the difference between MetaNames and PropertyNames?
232
233 MetaNames allows keywords searches in your documents. That is, you can
234 use MetaNames to restrict searches to just parts of your documents.
235
236 PropertyNames, on the other hand, define text that can be returned with
237 results, and can be used for sorting.
238
239 Both use I<meta tags> found in your documents (as shown in the above two
240 questions) to define the text you wish to use as a property or meta name.
241
242 You may define a tag as B<both> a property and a meta name. For example:
243
244 <meta name="creator" content="accounting department">
245
246 placed in your documents and then using configuration settings of:
247
248 PropertyNames creator
249 MetaNames creator
250
251 will allow you to limit your searches to documents created by accounting:
252
253 swish-e -w 'foo and creator=(accounting)'
254
255 That will find all documents with the word C<foo> that also have a creator
256 meta tag that contains the word C<accounting>. This is using MetaNames.
257
258 And you can also say:
259
260 swish-e -w foo -p creator
261
262 which will return all documents with the word C<foo>, but the results will
263 also include the contents of the C<creator> meta tag along with results.
264 This is using properties.
265
266 You can use properties and meta names at the same time, too:
267
268 swish-e -w creator=(accounting or marketing) -p creator -s creator
269
270 That searches only in the C<creator> I<meta name> for either of the words
271 C<accounting> or C<marketing>, prints out the contents of the contents
272 of the C<creator> I<property>, and sorts the results by the C<creator>
273 I<property name>.
274
275 (See also the C<-x> output format switch in L<SWISH-RUN|SWISH-RUN>.)
276
277 =head3 Can Swish-e index multi-byte characters?
278
279 No. This will require much work to change. But, Swish-e works with
280 eight Bit characters, so many characters sets can be used. Note that it
281 does call the ANSI-C tolower() function which does depend on the current
282 locale setting. See C<locale(7)> for more information.
283
284 =head2 Indexing
285
286 =head3 How do I pass Swish-e a list of files to index?
287
288 Currently, there is not a configuration directive to include a file that
289 contains a list of files to index. But, there is a directive to include
290 another configuration file.
291
292 IncludeConfigFile /path/to/other/config
293
294 And in C</path/to/other/config> you can say:
295
296 IndexDir file1 file2 file3 file4 file5 ...
297 IndexDir file20 file21 file22
298
299 You may also specify more than one configuration file on the command line:
300
301 ./swish-e -c config_one config_two config_three
302
303 Another option is to create a directory with symbolic links of the files
304 to index, and index just that directory.
305
306 =head3 How does Swish-e know which parser to use?
307
308 Swish can parse HTML, XML, and text documents. The parser is set by
309 associating a file extension with a parser by the C<IndexContents>
310 directive. You may set the default parser with the C<DefaultContents>
311 directive. If a document is not assigned a parser it will default to
312 the HTML parser (HTML2 if built with libxml2).
313
314 You may use Filters or an external program to convert documents to HTML,
315 XML, or text.
316
317 =head3 Can I reindex and search at the same time?
318
319 Yes. Starting with version 2.2 Swish-e indexes to temporary files, and then
320 renames the files when indexing is complete. On most systems renames
321 are atomic. But, since Swish-e also generates more than one file during
322 indexing there will be a very short period of time between renaming the
323 various files when the index is out of sync.
324
325 Settings in F<config.h> control some options related to temporary files,
326 and their use during indexing.
327
328 =head3 Can I index phrases?
329
330 Phrases are indexed automatically. To search for a phrase simply place
331 double quotes around the phrase.
332
333 For example:
334
335 swish-e -w 'free and "fast search engine"'
336
337 =head3 How can I prevent phrases from matching across sentences?
338
339 Use the
340 L<BumpPositionCounterCharacters|/"item_BumpPositionCounterCharacters">
341 configuration directive.
342
343 =head3 Swish-e isn't indexing a certain word or phrase.
344
345 There are a number of configuration parameters that control what Swish-e
346 considers a "word" and it has a debugging feature to help pinpoint
347 any indexing problems.
348
349 Configuration file directives (L<SWISH-CONFIG|SWISH-CONFIG>)
350 C<WordCharacters>, C<BeginCharacters>, C<EndCharacters>,
351 C<IgnoreFirstChar>, and C<IgnoreLastChar> are the main settings that
352 Swish-e uses to define a "word". See L<SWISH-CONFIG|SWISH-CONFIG> and
353 L<SWISH-RUN|SWISH-RUN> for details.
354
355 Swish-e also uses compile-time defaults for many settings. These are
356 located in F<src/config.h> file.
357
358 Use of the command line arguments C<-k>, C<-v> and C<-T> are useful when
359 debugging these problems. Using C<-T INDEXED_WORDS> while indexing will
360 display each word as it is indexed. You should specify one file when
361 using this feature since it can generate a lot of output.
362
363 ./swish-e -c my.conf -i problem.file -T INDEXED_WORDS
364
365 You may also wish to index a single file that contains words that are or
366 are not indexing as you expect and use -T to output debugging information
367 about the index. A useful command might be:
368
369 ./swish-e -f index.swish-e -T INDEX_FULL
370
371 Once you see how Swish-e is parsing and indexing your words, you can
372 adjust the configuration settings mentioned above to control what words
373 are indexed.
374
375 Another useful command might be:
376
377 ./swish-e -c my.conf -i problem.file -T PARSED_WORDS INDEXED_WORDS
378
379 This will show white-spaced words parsed from the document (PARSED_WORDS),
380 and how those words are split up into separate words for indexing
381 (INDEXED_WORDS).
382
383
384 =head3 How do I keep Swish-e from indexing numbers?
385
386 Swish-e indexes words as defined by the C<WordCharacters> setting, as
387 described above. So to avoid indexing numbers you simply remove digits
388 from the C<WordCharacters> setting.
389
390 There are also some settings in F<config.h> that control what "words"
391 are indexed. You can configure swish to never index words that are all
392 digits, vowels, or consonants, or that contain more than some consecutive
393 number of digits, vowels, or consonants. In general, you won't need to
394 change these settings.
395
396 Also, there's an experimental feature called C<IgnoreNumberChars>
397 which allows you to define a set of characters that describe a number.
398 If a word is made up of B<only> those characters it will not be indexed.
399
400
401 =head3 Swish-e crashes and burns on a certain file. What can I do?
402
403 This shouldn't happen. If it does please post to the Swish-e discussion
404 list the details so it can be reproduced by the developers.
405
406 In the mean time, you can use a C<FileRules> directive to exclude the
407 particular file name, or pathname, or its title. If there are serious
408 problems in indexing certain types of files, they may not have valid text
409 in them (they may be binary files, for instance). You can use NoContents
410 to exclude that type of file.
411
412 Swish-e will issue a warning if an embedded null character is found in a
413 document. This warning will be an indication that you are trying to index
414 binary data. If you need to index binary files try to find a program
415 that will extract out the text (e.g. strings(1), catdoc(1), pdftotext(1)).
416
417 =head3 How to I prevent indexing of some documents?
418
419 When using the file system to index your files you can use the
420 C<FileRules> directive. Other than C<FileRules title>, C<FileRules>
421 only works with the file system (C<-S fs>) indexing method, not with
422 C<-S prog> or C<-S http>.
423
424 If you are spidering, use a F<robots.text> file in your document root.
425 This is a standard way to excluded files from search engines, and is
426 fully supported by Swish-e. See http://www.robotstxt.org/
427
428 You can also modify the F<spider.pl> spider perl program to skip, index
429 content only, or spider only listed web pages. Type C<perldoc spider.pl>
430 in the C<prog-bin> directory for details.
431
432 If using the libxml2 library for parsing HTML, you may also use the Meta
433 Robots Exclusion in your documents:
434
435 <meta name="robots" content="noindex">
436
437 See the L<obeyRobotsNoIndex|SWISH-CONFIG/"item_obeyRobotsNoIndex"> directive.
438
439 =head3 How do I prevent indexing parts of a document?
440
441 To prevent Swish-e from indexing a common header, footer, or navigation
442 bar, AND you are using libxml2 for parsing HTML, then you may
443 use a fake HTML tag around the text you wish to ignore and use the
444 C<IgnoreMetaTags> directive. This will generate an error message if
445 the C<ParserWarningLevel> is set as it's invalid HTML.
446
447 C<IgnoreMetaTags> works with XML documents (and HTML documents when
448 using libxml2 as the parser), but not with documents parsed by the text
449 (TXT) parser.
450
451 If you are using the libxml2 parser (HTML2 and XML2) then you can use the the following
452 comments in your documents to prevent indexing:
453
454 <!-- SwishCommand noindex -->
455 <!-- SwishCommand index -->
456
457 and/or these may be used also:
458
459 <!-- noindex -->
460 <!-- index -->
461
462
463 =head3 How do I modify the path or URL of the indexed documents.
464
465 Use the C<ReplaceRules> configuration directive to rewrite path names
466 and URLs. If you are using C<-S prog> input method you may set the path
467 to any string.
468
469 =head3 How can I index data from a database?
470
471 Use the "prog" document source method of indexing. Write a program to
472 extract out the data from your database, and format it as XML, HTML,
473 or text. See the examples in the C<prog-bin> directory, and the next
474 question.
475
476 =head3 How do I index my PDF, Word, and compressed documents?
477
478 Swish-e can internally only parse HTML, XML and TXT (text) files by
479 default, but can make use of I<filters> that will convert other types
480 of files such as MS Word documents, PDF, or gzipped files into one of
481 the file types that Swish-e understands.
482
483 The B<FileFilter> config directive is used to define programs to use
484 as filters, based on file extension. For example, you can use the
485 program C<catdoc> to convert MS-Word documents to text for indexing.
486 Please see L<SWISH-CONFIG|SWISH-CONFIG/"Document Filter Directives">
487 and the examples in the C<filter-bin> directory for more information.
488
489 Another option is to use the C<prog> document source input method.
490 In this case you write a program (such as a perl script) that will read
491 and convert your data as needed and then output one of the formats
492 that Swish-e understands. Examples of using the C<prog> input method
493 for filtering are included in the C<prog-bin> directory of the Swish-e
494 distribution.
495
496 The disadvantage of using the C<prog> input method is that you must
497 write a program that reads the documents from the source (e.g. from the
498 file system or via a spider to read files on a web server), and also
499 include the code to filter the documents. It's much easier to use the
500 C<FileFilter> option since the filter can often be implemented with just
501 a single configuration directive.
502
503 On the other hand, the advantage of using the C<prog> input method for
504 indexing is speed. Filtering within a C<prog> input method program
505 will be faster if your filtering program is something like a Perl script
506 (something that has a large start-up cost). This may or may not be an
507 issue for you, depending on how much time your indexing requires.
508
509 You can also use a combination of methods. For example, say you are
510 indexing a directory that contains PDF files using a C<FileFilter>
511 directive. Now you want to index a MySQL database that also contains
512 PDF files. You can write a C<prog> input method program to read your
513 MySQL database and use the same C<FileFilter> configuration parameter
514 (and filter program) to convert the PDF files into one of the native
515 Swish-e formats (TXT, HTML, XML).
516
517 Do note that it will be slower to use the C<FileFilter> method instead
518 of running the filter directly from the C<prog> input method program.
519 When C<FileFilter> is used with the C<prog> input method Swish-e must
520 create a temporary file containing the output from your C<prog> method
521 program, and then execute the filter program.
522
523 In general, use the C<FileFilter> method to filter documents. If indexing
524 speed is an issue, consider writing a C<prog> input method program.
525 If you are already using the C<prog> method, then filtering will probably
526 be best accomplished within that program.
527
528 Here's two examples of how to run a filter program, one using Swish-e's
529 C<FileFilter> directive, another using a C<prog> input method program.
530 These filters simply use the program C</bin/cat> as a filter and only
531 indexes .html files.
532
533 First, using the C<FileFilter> method, here's the entire configuration
534 file (swish.conf):
535
536 IndexDir .
537 IndexOnly .html
538 FileFilter .html "/bin/cat" "'%p'"
539
540 and index with the command
541
542 swish-e -c swish.conf -v 1
543
544 Now, the same thing with using the C<prog> document source input method
545 and a Perl program called catfilter.pl. You can see that's it's much
546 more work than using the C<FileFilter> method above, but provides a
547 place to do additional processing. In this example, the C<prog> method
548 is only slightly faster. But if you needed a perl script to run as a
549 FileFilter then C<prog> will be significantly faster.
550
551 #!/usr/local/bin/perl -w
552 use strict;
553 use File::Find; # for recursing a directory tree
554
555 $/ = undef;
556 find(
557 { wanted => \&wanted, no_chdir => 1, },
558 '.',
559 );
560
561 sub wanted {
562 return if -d;
563 return unless /\.html$/;
564
565 my $mtime = (stat)[9];
566
567 my $child = open( FH, '-|' );
568 die "Failed to fork $!" unless defined $child;
569 exec '/bin/cat', $_ unless $child;
570
571 my $content = <FH>;
572 my $size = length $content;
573
574 print <<EOF;
575 Content-Length: $size
576 Last-Mtime: $mtime
577 Path-Name: $_
578
579 EOF
580
581 print <FH>;
582 }
583
584 And index with the command:
585
586 swish-e -S prog -i ./catfilter.pl -v 1
587
588 This example will probably not work under Windows due to the '-|' open.
589 A simple piped open may work just as well:
590
591 That is, replace:
592
593 my $child = open( FH, '-|' );
594 die "Failed to fork $!" unless defined $child;
595 exec '/bin/cat', $_ unless $child;
596
597 with this:
598
599 open( FH, "/bin/cat $_ |" ) or die $!;
600
601 Perl will try to avoid running the command through the shell if meta
602 characters are not passed to the open. See C<perldoc -f open> for
603 more information.
604
605 =head3 Eh, but I just want to know how to index PDF documents!
606
607 See the examples in the F<conf> directory.
608
609 =head3 I'm using the prog method to index PDF documents, but the file
610 contents are not indexed.
611
612 The some of the examples in the F<prog-bin> directory use a module to
613 convert the PDF files into XML. So you must tell Swish-e that you are
614 indexing XML files for the PDF extension.
615
616 IndexContents XML .pdf
617
618 =head3 I'm using Windows and can't get Filters or the prog input method
619 to work!
620
621 Both the C<-S prog> input method and filters use the C<popen()> system
622 call to run the external program. If your external program is, for
623 example, a perl script, you have to tell Swish-e to run perl, instead of
624 the script. Also, you must use the backslash character in the program
625 name since C<popen()> runs the command via the shell, which must be a
626 backslash in windows.
627
628 For example, you would need to specify the path to perl as (assuming
629 this is where perl is on your system):
630
631 IndexDir e:\\perl\\bin\\perl.exe
632
633 Or run a filter like:
634
635 FileFilter .foo e:\\perl\\bin\\perl.exe 'myscript.pl "%p"'
636
637
638 =head3 How do I index non-English words?
639
640 Swish-e indexes 8-bit characters only. This is the ISO 8859-1 Latin-1
641 character set, and includes many non-English letters (and symbols).
642 As long as they are listed in C<WordCharacters> they will be indexed.
643
644 Actually, you probably can index any 8-bit character set, as long as
645 you don't mix character sets in the same index.
646
647 The C<TranslateCharacters> directive (L<SWISH-CONFIG|SWISH-CONFIG>)
648 can translate characters while indexing and searching. You may
649 specify the mapping of one character to another character with the
650 C<TranslateCharacters> directive.
651
652 C<TranslateCharacters :ascii7:> is a predefined set of characters that
653 will translate eight bit characters to ascii7 characters. Using the
654 C<:ascii7:> rule will, for example, translate "Ääç" to "aac". This means:
655 searching "Çelik", "çelik" or "celik" will all match the same word.
656
657 Note: When using libxml2 for parsing, parsed documents are converted
658 internally (within libxml2) to UTF-8. This is converted to ISO 8859-1
659 Latin-1 when indexing. In cases where a string can not be converted
660 from UTF-8 to ISO 8859-1 (because it contains non 8859-1 characters),
661 the string will be sent to Swish-e in UTF-8 encoding. This will results
662 in some words indexed incorrectly. Setting C<ParserWarningLevel> to 1
663 or more will display warnings when UTF-8 to 8859-1 conversion fails.
664
665 =head3 Can I add/remove files from an index?
666
667 Not really. Swish-e currently has no way to add or remove items from
668 its index.
669
670 About the only way to delete items from the index is to stat(2) all the
671 results to make sure that all the files still exist.
672
673 Incremental additions can be handled in a couple of ways, depending on
674 your situation. It's probably easiest to create one main index every
675 night (or every week), and then create an index of just the new files
676 between main indexing jobs and use the C<-f> option to pass both indexes
677 to Swish-e while searching.
678
679 You can merge the indexes into one index (instead of using -f), but it's
680 not clear that this has any advantage over searching multiple indexes.
681 Using C<-f> gives access to the individual headers of both indexes,
682 while C<-M> merges the headers, and merging indexes with different
683 indexing settings (Stemming, WordCharacters) may produce odd results.
684 This is a question for the Swish-e discussion list.
685
686 How does one create the incremental index?
687
688 One method is by using the C<-N> switch to pass a file path to
689 Swish-e when indexing. It will only index files that have a last
690 modification date C<newer> than the file supplied with the C<-N> switch.
691
692 This option has the disadvantage that Swish-e must process every file in
693 every directory as if they were going to be indexed (the test for C<-N>
694 is done last right before indexing of the file contents begin and after
695 all other tests on the file have been completed) -- all that just to
696 find a few new files.
697
698 Also, if you use the Swish-e index file as the file passed to C<-N> there
699 may be files that were added after indexing was started, but before the
700 index file was written. This could result in a file not being added to
701 the index.
702
703 Another option is to maintain a parallel directory tree that contains
704 symlinks pointing to the main files. When a new file is added (or
705 changed) to the main directory tree you create a symlink to the real file
706 in the parallel directory tree. Then just index the symlink directory
707 to generate the incremental index.
708
709 This option has the disadvantage that you need to have a central
710 program that creates the new files that can also create the symlinks.
711 But, indexing is quite fast since Swish-e only has to look at the files
712 that need to be indexed. When you run full indexing you simply unlink
713 (delete) all the symlinks.
714
715 Both of these methods have issues where files could end up in both
716 indexes, or files being left out of an index. Use of file locks while
717 indexing, and hash lookups during searches can help prevent these
718 problems.
719
720 =head3 I run out of memory trying to index my files.
721
722 It's true that indexing can take up a lot of memory! Swish-e is extremely
723 fast at indexing, but that comes at the cost of memory.
724
725 The best answer is install more memory.
726
727 Another option is use the C<-e> switch. This will require less memory,
728 but indexing will take longer as not all data will be stored in memory
729 while indexing. How much less memory and how much more time depends on
730 the documents you are indexing, and the hardware that you are using.
731
732 Here's an example of indexing all .html files in /usr/doc on Linux.
733 This first example is I<without> C<-e> and used about 84M of memory:
734
735 270279 unique words indexed.
736 23841 files indexed. 177640166 total bytes.
737 Elapsed time: 00:04:45 CPU time: 00:03:19
738
739 This is I<with> C<-e>, and used about 26M or memory:
740
741 270279 unique words indexed.
742 23841 files indexed. 177640166 total bytes.
743 Elapsed time: 00:06:43 CPU time: 00:04:12
744
745 You can also build a number of smaller indexes and then merge together
746 with C<-M>. This will use more memory. Merging is not a great option.
747
748 Finally, if you do build a number of smaller indexes, you can specify more
749 than one index when searching by using the C<-f> switch. Sorting large
750 results sets by a property will be slower when specifying multiple index
751 files while searching.
752
753 =head3 My system admin says Swish-e uses too much of the CPU!
754
755 That's a good thing! That expensive CPU is suppose to be busy.
756
757 Indexing takes a lot of work -- to make indexing fast much of the work is
758 done in memory which reduces the amount of time Swish-e is waiting on I/O.
759 But, there's two things you can try:
760
761 The C<-e> option will run Swish-e in economy mode, which uses the disk
762 to store data while indexing. This makes Swish-e run somewhat slower,
763 but also uses less memory. Since it is writing to disk more often it
764 will be spending more time waiting on I/O and less time in CPU. Maybe.
765
766 The other thing is to simply lower the priority of the job using the
767 nice(1) command:
768
769 /bin/nice -15 swish-e -c search.conf
770
771 If concerned about searching time, make sure you are using the -b and -m
772 switches to only return a page at a time. If you know that your result
773 sets will be large, and that you wish to return results one page at a
774 time, and that often times many pages of the same query will be requested,
775 you may be smart to request all the documents on the first request, and
776 then cache the results to a temporary file. The perl module File::Cache
777 makes this very simple to accomplish.
778
779 =head2 Spidering
780
781 =head3 How can I index documents on a web server?
782
783 If possible, use the file system method C<-S fs> of indexing to index
784 documents in you web area of the file system. This avoids the overhead
785 of spidering a web server and is much faster. (C<-S fs> is the default
786 method if C<-S> is not specified).
787
788 If this is impossible (the web server is not local, or documents
789 are dynamically generated), Swish-e provides two methods of spidering.
790 First, it includes the http method of indexing C<-S http>. A number
791 of special configuration directives are available that control spidering
792 (see L<Directives for the HTTP Access Method Only|/"Directives for the
793 HTTP Access Method Only">). A perl helper script (swishspider) is
794 included in the F<src> directory to assist with spidering web servers.
795 There are example configurations for spidering in the F<conf> directory.
796
797 As of Swish-e 2.2, there's a general purpose "prog" document source where
798 a program can feed documents to it for indexing. A number of example
799 programs can be found in the C<prog-bin> directory, including a program
800 to spider web servers. The provided spider.pl program is full-featured
801 and is easily customized.
802
803 The advantage of the "prog" document source feature over the "http" method
804 is that the program is only executed one time, where the swishspider.pl
805 program used in the "http" method is executed once for every document
806 read from the web server. The forking of Swish-e and compiling of the
807 perl script can be quite expensive, time-wise.
808
809 The other advantage of the C<spider.pl> program is that it's simple and
810 efficient to add filtering (such as for PDF or MS Word docs) right into
811 the spider.pl's configuration, and it includes features such as MD5 checks
812 to prevent duplicate indexing, options to avoid spidering some files,
813 or index but avoid spidering. And since it's a perl program there's no
814 limit on the features you can add.
815
816 =head3 Why does swish report "./swishspider: not found"?
817
818 Does the file F<swishspider> exist where the error message displays? If not, either
819 set the configuration option L<SpiderDirectory|SWISH-CONFIG/"item_SpiderDir">
820 to point to the directory where the F<swishspider> program is found, or place the
821 F<swishspider> program in the current directory when running swish-e.
822
823 If you are running Windows, make sure "perl" is in your path. Try typing F<perl> from
824 a command prompt.
825
826 If you not running windows, make sure that the shebang line (the first line of the
827 swishspider program that starts with #!) points to the correct location of perl.
828 Typically this will be F</usr/bin/perl> or F</usr/local/bin/perl>. Also, make sure that
829 you have execute and read permissions on F<swishspider>.
830
831 The F<swishspider> perl script is only used with the -S http method of indexing.
832
833 =head3 I'm using the spider.pl program to spider my web site, but some
834 large files are not indexed.
835
836 The C<spider.pl> program has a default limit of 5MB file size. This can
837 be changed with the C<max_size> parameter setting. See C<perldoc
838 spider.pl> for more information.
839
840 =head3 I still don't think all my web pages are being indexed.
841
842 The F<spider.pl> program has a number of debugging switches and can be
843 quite verbose in telling you what's happening, and why. See C<perldoc
844 spider.pl> for instructions.
845
846 =head3 Swish is not spidering Javascript links!
847
848 Swish cannot follow links generated by Javascript, as they are generated
849 by the browser and are not part of the document.
850
851 =head3 How do I spider other websites and combine it with my own
852 (filesystem) index?
853
854 You can either merge C<-M> two indexes into a single index, or use C<-f>
855 to specify more than one index while searching.
856
857 You will have better results with the C<-f> method.
858
859
860 =head2 Searching
861
862 =head3 How do I limit searches to just parts of the index?
863
864 If you can identify "parts" of your index by the path name you have
865 two options.
866
867 The first options is by indexing the document path. Add this to your
868 configuration:
869
870 MetaNames swishdocpath
871
872 Now you can search for words or phrases in the path name:
873
874 swish-e -w 'foo AND swishdocpath=(sales)'
875
876 So that will only find documents with the word "foo" and where the file's
877 path contains "sales". That might not works as well as you like, though,
878 as both of these paths will match:
879
880 /web/sales/products/index.html
881 /web/accounting/private/sales_we_messed_up.html
882
883 This can be solved by searching with a phrase (assuming "/" is not
884 a WordCharacter):
885
886 swish-e -w 'foo AND swishdocpath=("/web/sales/")'
887 swish-e -w 'foo AND swishdocpath=("web sales")' (same thing)
888
889
890 The second option is a bit more powerful. With the C<ExtractPath>
891 directive you can use a regular expression to extract out a sub-set of
892 the path and save it as a separate meta name:
893
894 MetaNames department
895 ExtractPath department regex !^/web/([^/]+).+$!$1/
896
897 Which says match a path that starts with "/web/" and extract out
898 everything after that up to, but not including the next "/" and save it in
899 variable $1, and then match everything from the "/" onward. Then replace
900 the entire matches string with $1. And that gets indexed as meta name
901 "department".
902
903 Now you can search like:
904
905 swish-e -w 'foo AND department=sales'
906
907 and be sure that you will only match the documents in the /www/sales/*
908 path. Note that you can map completely different areas of your file
909 system to the same metaname:
910
911 # flag the marketing specific pages
912 ExtractPath department regex !^/web/(marketing|sales)/.+$!marketing/
913 ExtractPath department regex !^/internal/marketing/.+$!marketing/
914
915 # flag the technical departments pages
916 ExtractPath department regex !^/web/(tech|bugs)/.+$!tech/
917
918
919 Finally, if you have something more complicated, use C<-S prog> and
920 write a perl program or use a filter to set a meta tag when processing
921 each file.
922
923 =head3 How can I limit searches to the title, body, or comment?
924
925 Use the C<-t> switch.
926
927 =head3 I can't limit searches to title/body/comment.
928
929 Or, I<I can't search with meta names, all the names are indexed as
930 "plain".>
931
932 Check in the config.h file if #define INDEXTAGS is set to 1. If it is,
933 change it to 0, recompile, and index again. When INDEXTAGS is 1, ALL
934 the tags are indexed as plain text, that is you index "title", "h1", and
935 so on, AND they loose their indexing meaning. If INDEXTAGS is set to 0,
936 you will still index meta tags and comments, unless you have indicated
937 otherwise in the user config file with the IndexComments directive.
938
939 Also, check for the C<UndefinedMetaTags> setting in your configuration
940 file.
941
942 =head3 I've tried running the included CGI script and I get a "Internal
943 Server Error"
944
945 Debugging CGI scripts are beyond the scope of this document.
946 Internal Server Error basically means "check the web server's log for
947 an error message", as it can mean a bad shebang (#!) line, a missing
948 perl module, FTP transfer error, or simply an error in the program.
949 The CGI script F<swish.cgi> in the F<example> directory contains some
950 debugging suggestions. Type C<perldoc swish.cgi> for information.
951
952 There are also many, many CGI FAQs available on the Internet. A quick web
953 search should offer help. As a last resort you might ask your webadmin
954 for help...
955
956 =head3 When I try to view the swish.cgi page I see the contents of the
957 Perl program.
958
959 Your web server is not configured to run the program as a CGI script.
960 This problem is described in C<perldoc swish.cgi>.
961
962
963 =head3 How do I make Swish-e highlight words in search results?
964
965 Short answer:
966
967 Use the supplied swish.cgi script located in the F<examples> directory.
968
969 Long answer:
970
971 Swish-e can't because it doesn't have access to the source documents when
972 returning results, of course. But a front-end program of your creation
973 can highlight terms. Your program can open up the source documents and
974 then use regular expressions to replace search terms with highlighted
975 or bolded words.
976
977 But, that will fail with all but the most simple source documents.
978 For HTML documents, for example, you must parse the document into words
979 and tags (and comments). A word you wish to highlight may span multiple
980 HTML tags, or be a word in a URL and you wish to highlight the entire
981 link text.
982
983 Perl modules such as HTML::Parser and XML::Parser make word extraction
984 possible. Next, you need to consider that Swish-e uses settings such
985 as WordCharacters, BeginCharacters, EndCharacters, IgnoreFirstChar,
986 and IgnoreLast, char to define a "word". That is, you can't consider
987 that a string of characters with white space on each side is a word.
988
989 Then things like TranslateCharacters, and HTML Entities may transform a
990 source word into something else, as far as Swish-e is concerned. Finally,
991 searches can be limited by metanames, so you may need to limit your
992 highlighting to only parts of the source document. Throw phrase searches
993 and stopwords into the equation and you can see that it's not a trivial
994 problem to solve.
995
996 All hope is not lost, thought, as Swish-e does provide some help.
997 Using the C<-H> option it will return in the headers the current index
998 (or indexes) settings for WordCharacters (and others) required to parse
999 your source documents as it parses them during indexing, and will return a
1000 "Parsed Words:" header that will show how it parsed the query internally.
1001 If you use fuzzy indexing (word stemming, soundex, or metaphone)
1002 then you will also need to stem each word in your
1003 document before comparing with the "Parsed Words:" returned by Swish-e.
1004 The Swish-e stemming code is available either by using the Swish-e
1005 Perl module or C library (included with the swish-e distribution),
1006 or by using the SWISH::Stemmer module available on CPAN. Also on CPAN is
1007 the module Text::DoubleMetaphone.
1008
1009 =head3 Do filters effect the performance during search?
1010
1011 No. Filters (FileFilter or via "prog" method) are only used for building
1012 the search index database. During search requests there will be no
1013 filter calls.
1014
1015
1016 =head2 I have read the FAQ but I still have questions about using Swish-e.
1017
1018 The Swish-e discussion list is the place to go. http://swish-e.org/.
1019 Please do not email developers directly. The list is the best place to
1020 ask questions.
1021
1022 Before you post please read I<QUESTIONS AND TROUBLESHOOTING> located
1023 in the L<INSTALL|INSTALL> page. You should also search the Swish-e
1024 discussion list archive which can be found on the swish-e web site.
1025
1026 In short, be sure to include in the following when asking for help.
1027
1028 =over 4
1029
1030 =item * The swish-e version (./swish-e -V)
1031
1032 =item * What you are indexing (and perhaps a sample), and the number
1033 of files
1034
1035 =item * Your Swish-e configuration file
1036
1037 =item * Any error messages that Swish-e is reporting
1038
1039 =back
1040
1041 =head1 Document Info
1042
1043 $Id: SWISH-FAQ.pod,v 1.24 2002/08/20 22:24:08 whmoseley Exp $
1044
1045 .

  ViewVC Help
Powered by ViewVC 1.1.22