/[MITgcm]/mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-FAQ.pod
ViewVC logotype

Annotation of /mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-FAQ.pod

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.1 - (hide annotations) (download)
Fri Sep 20 19:47:29 2002 UTC (22 years, 10 months ago) by adcroft
Branch point for: Import, MAIN
Initial revision

1 adcroft 1.1 =head1 NAME
2    
3     The Swish-e FAQ - Answers to Common Questions
4    
5     =head1 Frequently Asked Questions
6    
7     =head2 General Questions
8    
9     =head3 What is Swish-e?
10    
11     Swish-e is B<S>imple B<W>eb B<I>ndexing B<S>ystem for B<H>umans -
12     B<E>nhanced. With it, you can quickly and easily index directories of
13     files or remote web sites and search the generated indexes for words
14     and phrases.
15    
16     =head3 So, is Swish-e a search engine?
17    
18     Well, yes. Probably the most common use of Swish-e is to provide a search
19     engine for web sites. The Swish-e distribution includes CGI scripts that
20     can be used with it to add a I<search engine> for your web site. The CGI
21     scripts can be found in the F<example> directory of the distribution
22     package. See the F<README> file for information about the scripts.
23    
24     But Swish-e can also be used to index all sorts of data, such as email
25     messages, data stored in a relational database management system,
26     XML documents, or documents such as Word and PDF documents -- or any
27     combination of those sources at the same time. Searches can be limited
28     to fields or I<MetaNames> within a document, or limited to areas within
29     an HTML document (e.g. body, title). Programs other than CGI applications
30     can use Swish-e, as well.
31    
32     =head3 Should I upgrade if I'm already running a previous version
33     of Swish-e?
34    
35     A large number of bug fixes, feature additions, and logic corrections were
36     made in version 2.2. In addition, indexing speed has been drastically
37     improved (reports of indexing times changing from four hours to 5
38     minutes), and major parts of the indexing and search parsers have been
39     rewritten. There's better debugging options, enhanced output formats,
40     more document meta data (e.g. last modified date, document summary),
41     options for indexing from external data sources, and faster spidering
42     just to name a few changes. (See the CHANGES file for more information.
43    
44     Since so much effort has gone into version 2.2, support for previous
45     versions will probably be limited.
46    
47     =head3 Are there binary distributions available for Swish-e on platform foo?
48    
49     Foo? Well, yes there are some binary distributions available. Please see
50     the Swish-e web site for a list at http://swish-e.org/.
51    
52     In general, it is recommended that you build Swish-e from source,
53     if possible.
54    
55     =head3 Do I need to reindex my site each time I upgrade to a new Swish-e
56     version?
57    
58     At times it might not strictly be necessary, but since you don't really
59     know if anything in the index has changed, it is a good rule to reindex.
60    
61     =head3 What's the advantage of using the libxml2 library for parsing HTML?
62    
63     Swish-e may be linked with libxml2, a library for working with HTML and XML
64     documents. Swish-e can use libxml2 for parsing HTML and XML documents.
65    
66     The libxml2 parser is a better parser than Swish-e's built-in HTML
67     parser. It offers more features, and it does a much better job at
68     extracting out the text from a web page. In addition, you can use the
69     C<ParserWarningLevel> configuration setting to find structural errors
70     in your documents that could (and would with Swish-e's HTML parser)
71     cause documents to be indexed incorrectly.
72    
73     Libxml2 is not required, but is strongly recommended for parsing HTML
74     documents. It's also recommended for parsing XML, as it offers many
75     more features than the internal Expat xml.c parser.
76    
77     The internal HTML parser will have limited support, and does have a
78     number of bugs. For example, HTML entities may not always be correctly
79     converted and properties do not have entities converted. The internal
80     parser tends to get confused when invalid HTML is parsed where the libxml2
81     parser doesn't get confused as often. The structure is better detected
82     with the libxml2 parser.
83    
84     If you are using the Perl module (the C interface to the Swish-e
85     library) you may wish to build two versions of Swish-e, one with the
86     libxml2 library linked in the binary, and one without, and build the
87     Perl module against the library without the libxml2 code. This is to
88     save space in the library. Hopefully, the library will someday soon be
89     split into indexing and searching code (volunteers welcome).
90    
91     =head3 Does Swish-e include a CGI interface?
92    
93     An example CGI script is included in the C<example> directory.
94     (Type C<perldoc swish.cgi> in the C<example> directory for instructions.)
95    
96     Please be careful when picking a CGI script to use with Swish-e. Quite a
97     few of the scripts that have been available for it are insecure and
98     should not be used.
99    
100     The included example CGI script was designed with security in mind.
101     Regardless, you are encouraged to have your local Perl expert review it
102     (and all other CGI scripts you use) before placing into production.
103     This is just a good policy to follow.
104    
105     =head3 How secure is Swish-e?
106    
107     We know of no security issues with using Swish-e. Careful attention
108     has been made with regard to common security problems such as buffer
109     overruns when programming Swish-e.
110    
111     The most likely security issue with Swish-e is when it is run via
112     a poorly written CGI interface. This is not limited to CGI scripts
113     written in Perl, as it's just as easy to write an insecure CGI script
114     in C, Java, PHP, or Python. A good source of information is included
115     with the Perl distribution. Type C<perldoc perlsec> at your local
116     prompt for more information. Another must-read document is located at
117     C<http://www.w3.org/Security/faq/wwwsf4.html>.
118    
119     Note that there are many I<free> yet insecure and poorly written CGI
120     scripts available -- even some designed for use with Swish-e. Please
121     carefully review any CGI script you use. Free is not such a good price
122     when you get your server hacked...
123    
124     =head3 Should I run Swish-e as the superuser (root)?
125    
126     No. Never.
127    
128     =head3 What files does Swish-e write?
129    
130     Swish writes the index file, of course. This is specified with the
131     C<IndexFile> configuration directive or by the C<-f> command line switch.
132    
133     The index file is actually a collection of files, but all start with
134     the file name specified with the C<IndexFile> directive or the C<-f>
135     command line switch.
136    
137     For example, the file ending in F<.prop> contains the document properties.
138    
139     When creating the index files Swish-e appends the extension F<.temp>
140     to the index file names. When indexing is complete Swish-e renames the
141     F<.temp> files to the index files specified by C<IndexFile> or C<-f>.
142     This is done so that existing indexes remain untouched until it completes
143     indexing.
144    
145     Swish-e also writes temporary files in some cases during indexing
146     (e.g. C<-s http>, C<-s prog> with filters>, when merging, and when
147     using C<-e>). Temporary files are created with the mkstemp(3) function
148     (with 0600 permission on unix-like operating systems).
149    
150     The temporary files are created in the directory specified by the
151     environment variables C<TMPDIR> and C<TMP> in that order. If those
152     are not set then swish uses the setting the configuration setting
153     L<TmpDir|SWISH-CONFIG/"item_TmpDir">. Otherwise, the temporary file
154     will be located in the current directory.
155    
156     =head3 Can I index PDF and MS-Word documents?
157    
158     Yes, you can use a I<Filter> to convert documents while indexing, or you
159     can use a program that "feeds" documents to Swish-e that have already
160     been converted. See <Indexing> below.
161    
162     =head3 Can I index documents on a web server?
163    
164     Yes, Swish-e provides two ways to index (spider) documents on a web
165     server. See C<Spidering> below.
166    
167     Swish-e can retrieve documents from a file system or from a remote web
168     server. It can also execute a program that returns documents back
169     to it. This program can retrieve documents from a database, filter
170     compressed documents files, convert PDF files, extract data from mail
171     archives, or spider remote web sites.
172    
173     =head3 Can I implement keywords in my documents?
174    
175     Yes, Swish-e can associate words with I<MetaNames> while indexing,
176     and you can limit your searches to these MetaNames while searching.
177    
178     In your HTML files you can put keywords in HTML META tags or in XML blocks.
179    
180     META tags can have two formats in your source documents:
181    
182     <META NAME="DC.subject" CONTENT="digital libraries">
183    
184    
185     And in XML format (can also be used in HTML documents when using libxml2):
186    
187     <meta2>
188     Some Content
189     </meta2>
190    
191    
192     Then, to inform Swish-e about the existence of the meta name in your
193     documents, edit the line in your configuration file:
194    
195     MetaNames DC.subject meta1 meta2
196    
197     When searching you can now limit some or all search terms to that
198     MetaName. For example, to look for documents that contain the word
199     apple and also have either fruit or cooking in the DC.subject meta tag.
200    
201     =head3 What are document properties?
202    
203     A document property is typically data that describes the document.
204     For example, properties might include a document's path name, its last
205     modified date, its title, or its size. Swish-e stores a document's
206     properties in the index file, and they can be reported back in search
207     results.
208    
209     Swish-e also uses properties for sorting. You may sort your results by
210     one or more properties, in ascending or descending order.
211    
212     Properties can also be defined within your documents. HTML and
213     XML files can specify tags (see previous question) as properties.
214     The I<contents> of these tags can then be returned with search results.
215     These user-defined properties can also be used for sorting search results.
216    
217     For example, if you had the following in your documents
218    
219     <meta name="creator" content="accounting department">
220    
221     and C<creator> is defined as a property (see C<PropertyNames> in
222     L<SWISH-CONFIG|SWISH-CONFIG>) Swish-e can return C<accounting department>
223     with the result for that document.
224    
225     swish-e -w foo -p creator
226    
227     Or for sorting:
228    
229     swish-e -w foo -s creator
230    
231     =head3 What's the difference between MetaNames and PropertyNames?
232    
233     MetaNames allows keywords searches in your documents. That is, you can
234     use MetaNames to restrict searches to just parts of your documents.
235    
236     PropertyNames, on the other hand, define text that can be returned with
237     results, and can be used for sorting.
238    
239     Both use I<meta tags> found in your documents (as shown in the above two
240     questions) to define the text you wish to use as a property or meta name.
241    
242     You may define a tag as B<both> a property and a meta name. For example:
243    
244     <meta name="creator" content="accounting department">
245    
246     placed in your documents and then using configuration settings of:
247    
248     PropertyNames creator
249     MetaNames creator
250    
251     will allow you to limit your searches to documents created by accounting:
252    
253     swish-e -w 'foo and creator=(accounting)'
254    
255     That will find all documents with the word C<foo> that also have a creator
256     meta tag that contains the word C<accounting>. This is using MetaNames.
257    
258     And you can also say:
259    
260     swish-e -w foo -p creator
261    
262     which will return all documents with the word C<foo>, but the results will
263     also include the contents of the C<creator> meta tag along with results.
264     This is using properties.
265    
266     You can use properties and meta names at the same time, too:
267    
268     swish-e -w creator=(accounting or marketing) -p creator -s creator
269    
270     That searches only in the C<creator> I<meta name> for either of the words
271     C<accounting> or C<marketing>, prints out the contents of the contents
272     of the C<creator> I<property>, and sorts the results by the C<creator>
273     I<property name>.
274    
275     (See also the C<-x> output format switch in L<SWISH-RUN|SWISH-RUN>.)
276    
277     =head3 Can Swish-e index multi-byte characters?
278    
279     No. This will require much work to change. But, Swish-e works with
280     eight Bit characters, so many characters sets can be used. Note that it
281     does call the ANSI-C tolower() function which does depend on the current
282     locale setting. See C<locale(7)> for more information.
283    
284     =head2 Indexing
285    
286     =head3 How do I pass Swish-e a list of files to index?
287    
288     Currently, there is not a configuration directive to include a file that
289     contains a list of files to index. But, there is a directive to include
290     another configuration file.
291    
292     IncludeConfigFile /path/to/other/config
293    
294     And in C</path/to/other/config> you can say:
295    
296     IndexDir file1 file2 file3 file4 file5 ...
297     IndexDir file20 file21 file22
298    
299     You may also specify more than one configuration file on the command line:
300    
301     ./swish-e -c config_one config_two config_three
302    
303     Another option is to create a directory with symbolic links of the files
304     to index, and index just that directory.
305    
306     =head3 How does Swish-e know which parser to use?
307    
308     Swish can parse HTML, XML, and text documents. The parser is set by
309     associating a file extension with a parser by the C<IndexContents>
310     directive. You may set the default parser with the C<DefaultContents>
311     directive. If a document is not assigned a parser it will default to
312     the HTML parser (HTML2 if built with libxml2).
313    
314     You may use Filters or an external program to convert documents to HTML,
315     XML, or text.
316    
317     =head3 Can I reindex and search at the same time?
318    
319     Yes. Starting with version 2.2 Swish-e indexes to temporary files, and then
320     renames the files when indexing is complete. On most systems renames
321     are atomic. But, since Swish-e also generates more than one file during
322     indexing there will be a very short period of time between renaming the
323     various files when the index is out of sync.
324    
325     Settings in F<config.h> control some options related to temporary files,
326     and their use during indexing.
327    
328     =head3 Can I index phrases?
329    
330     Phrases are indexed automatically. To search for a phrase simply place
331     double quotes around the phrase.
332    
333     For example:
334    
335     swish-e -w 'free and "fast search engine"'
336    
337     =head3 How can I prevent phrases from matching across sentences?
338    
339     Use the
340     L<BumpPositionCounterCharacters|/"item_BumpPositionCounterCharacters">
341     configuration directive.
342    
343     =head3 Swish-e isn't indexing a certain word or phrase.
344    
345     There are a number of configuration parameters that control what Swish-e
346     considers a "word" and it has a debugging feature to help pinpoint
347     any indexing problems.
348    
349     Configuration file directives (L<SWISH-CONFIG|SWISH-CONFIG>)
350     C<WordCharacters>, C<BeginCharacters>, C<EndCharacters>,
351     C<IgnoreFirstChar>, and C<IgnoreLastChar> are the main settings that
352     Swish-e uses to define a "word". See L<SWISH-CONFIG|SWISH-CONFIG> and
353     L<SWISH-RUN|SWISH-RUN> for details.
354    
355     Swish-e also uses compile-time defaults for many settings. These are
356     located in F<src/config.h> file.
357    
358     Use of the command line arguments C<-k>, C<-v> and C<-T> are useful when
359     debugging these problems. Using C<-T INDEXED_WORDS> while indexing will
360     display each word as it is indexed. You should specify one file when
361     using this feature since it can generate a lot of output.
362    
363     ./swish-e -c my.conf -i problem.file -T INDEXED_WORDS
364    
365     You may also wish to index a single file that contains words that are or
366     are not indexing as you expect and use -T to output debugging information
367     about the index. A useful command might be:
368    
369     ./swish-e -f index.swish-e -T INDEX_FULL
370    
371     Once you see how Swish-e is parsing and indexing your words, you can
372     adjust the configuration settings mentioned above to control what words
373     are indexed.
374    
375     Another useful command might be:
376    
377     ./swish-e -c my.conf -i problem.file -T PARSED_WORDS INDEXED_WORDS
378    
379     This will show white-spaced words parsed from the document (PARSED_WORDS),
380     and how those words are split up into separate words for indexing
381     (INDEXED_WORDS).
382    
383    
384     =head3 How do I keep Swish-e from indexing numbers?
385    
386     Swish-e indexes words as defined by the C<WordCharacters> setting, as
387     described above. So to avoid indexing numbers you simply remove digits
388     from the C<WordCharacters> setting.
389    
390     There are also some settings in F<config.h> that control what "words"
391     are indexed. You can configure swish to never index words that are all
392     digits, vowels, or consonants, or that contain more than some consecutive
393     number of digits, vowels, or consonants. In general, you won't need to
394     change these settings.
395    
396     Also, there's an experimental feature called C<IgnoreNumberChars>
397     which allows you to define a set of characters that describe a number.
398     If a word is made up of B<only> those characters it will not be indexed.
399    
400    
401     =head3 Swish-e crashes and burns on a certain file. What can I do?
402    
403     This shouldn't happen. If it does please post to the Swish-e discussion
404     list the details so it can be reproduced by the developers.
405    
406     In the mean time, you can use a C<FileRules> directive to exclude the
407     particular file name, or pathname, or its title. If there are serious
408     problems in indexing certain types of files, they may not have valid text
409     in them (they may be binary files, for instance). You can use NoContents
410     to exclude that type of file.
411    
412     Swish-e will issue a warning if an embedded null character is found in a
413     document. This warning will be an indication that you are trying to index
414     binary data. If you need to index binary files try to find a program
415     that will extract out the text (e.g. strings(1), catdoc(1), pdftotext(1)).
416    
417     =head3 How to I prevent indexing of some documents?
418    
419     When using the file system to index your files you can use the
420     C<FileRules> directive. Other than C<FileRules title>, C<FileRules>
421     only works with the file system (C<-S fs>) indexing method, not with
422     C<-S prog> or C<-S http>.
423    
424     If you are spidering, use a F<robots.text> file in your document root.
425     This is a standard way to excluded files from search engines, and is
426     fully supported by Swish-e. See http://www.robotstxt.org/
427    
428     You can also modify the F<spider.pl> spider perl program to skip, index
429     content only, or spider only listed web pages. Type C<perldoc spider.pl>
430     in the C<prog-bin> directory for details.
431    
432     If using the libxml2 library for parsing HTML, you may also use the Meta
433     Robots Exclusion in your documents:
434    
435     <meta name="robots" content="noindex">
436    
437     See the L<obeyRobotsNoIndex|SWISH-CONFIG/"item_obeyRobotsNoIndex"> directive.
438    
439     =head3 How do I prevent indexing parts of a document?
440    
441     To prevent Swish-e from indexing a common header, footer, or navigation
442     bar, AND you are using libxml2 for parsing HTML, then you may
443     use a fake HTML tag around the text you wish to ignore and use the
444     C<IgnoreMetaTags> directive. This will generate an error message if
445     the C<ParserWarningLevel> is set as it's invalid HTML.
446    
447     C<IgnoreMetaTags> works with XML documents (and HTML documents when
448     using libxml2 as the parser), but not with documents parsed by the text
449     (TXT) parser.
450    
451     If you are using the libxml2 parser (HTML2 and XML2) then you can use the the following
452     comments in your documents to prevent indexing:
453    
454     <!-- SwishCommand noindex -->
455     <!-- SwishCommand index -->
456    
457     and/or these may be used also:
458    
459     <!-- noindex -->
460     <!-- index -->
461    
462    
463     =head3 How do I modify the path or URL of the indexed documents.
464    
465     Use the C<ReplaceRules> configuration directive to rewrite path names
466     and URLs. If you are using C<-S prog> input method you may set the path
467     to any string.
468    
469     =head3 How can I index data from a database?
470    
471     Use the "prog" document source method of indexing. Write a program to
472     extract out the data from your database, and format it as XML, HTML,
473     or text. See the examples in the C<prog-bin> directory, and the next
474     question.
475    
476     =head3 How do I index my PDF, Word, and compressed documents?
477    
478     Swish-e can internally only parse HTML, XML and TXT (text) files by
479     default, but can make use of I<filters> that will convert other types
480     of files such as MS Word documents, PDF, or gzipped files into one of
481     the file types that Swish-e understands.
482    
483     The B<FileFilter> config directive is used to define programs to use
484     as filters, based on file extension. For example, you can use the
485     program C<catdoc> to convert MS-Word documents to text for indexing.
486     Please see L<SWISH-CONFIG|SWISH-CONFIG/"Document Filter Directives">
487     and the examples in the C<filter-bin> directory for more information.
488    
489     Another option is to use the C<prog> document source input method.
490     In this case you write a program (such as a perl script) that will read
491     and convert your data as needed and then output one of the formats
492     that Swish-e understands. Examples of using the C<prog> input method
493     for filtering are included in the C<prog-bin> directory of the Swish-e
494     distribution.
495    
496     The disadvantage of using the C<prog> input method is that you must
497     write a program that reads the documents from the source (e.g. from the
498     file system or via a spider to read files on a web server), and also
499     include the code to filter the documents. It's much easier to use the
500     C<FileFilter> option since the filter can often be implemented with just
501     a single configuration directive.
502    
503     On the other hand, the advantage of using the C<prog> input method for
504     indexing is speed. Filtering within a C<prog> input method program
505     will be faster if your filtering program is something like a Perl script
506     (something that has a large start-up cost). This may or may not be an
507     issue for you, depending on how much time your indexing requires.
508    
509     You can also use a combination of methods. For example, say you are
510     indexing a directory that contains PDF files using a C<FileFilter>
511     directive. Now you want to index a MySQL database that also contains
512     PDF files. You can write a C<prog> input method program to read your
513     MySQL database and use the same C<FileFilter> configuration parameter
514     (and filter program) to convert the PDF files into one of the native
515     Swish-e formats (TXT, HTML, XML).
516    
517     Do note that it will be slower to use the C<FileFilter> method instead
518     of running the filter directly from the C<prog> input method program.
519     When C<FileFilter> is used with the C<prog> input method Swish-e must
520     create a temporary file containing the output from your C<prog> method
521     program, and then execute the filter program.
522    
523     In general, use the C<FileFilter> method to filter documents. If indexing
524     speed is an issue, consider writing a C<prog> input method program.
525     If you are already using the C<prog> method, then filtering will probably
526     be best accomplished within that program.
527    
528     Here's two examples of how to run a filter program, one using Swish-e's
529     C<FileFilter> directive, another using a C<prog> input method program.
530     These filters simply use the program C</bin/cat> as a filter and only
531     indexes .html files.
532    
533     First, using the C<FileFilter> method, here's the entire configuration
534     file (swish.conf):
535    
536     IndexDir .
537     IndexOnly .html
538     FileFilter .html "/bin/cat" "'%p'"
539    
540     and index with the command
541    
542     swish-e -c swish.conf -v 1
543    
544     Now, the same thing with using the C<prog> document source input method
545     and a Perl program called catfilter.pl. You can see that's it's much
546     more work than using the C<FileFilter> method above, but provides a
547     place to do additional processing. In this example, the C<prog> method
548     is only slightly faster. But if you needed a perl script to run as a
549     FileFilter then C<prog> will be significantly faster.
550    
551     #!/usr/local/bin/perl -w
552     use strict;
553     use File::Find; # for recursing a directory tree
554    
555     $/ = undef;
556     find(
557     { wanted => \&wanted, no_chdir => 1, },
558     '.',
559     );
560    
561     sub wanted {
562     return if -d;
563     return unless /\.html$/;
564    
565     my $mtime = (stat)[9];
566    
567     my $child = open( FH, '-|' );
568     die "Failed to fork $!" unless defined $child;
569     exec '/bin/cat', $_ unless $child;
570    
571     my $content = <FH>;
572     my $size = length $content;
573    
574     print <<EOF;
575     Content-Length: $size
576     Last-Mtime: $mtime
577     Path-Name: $_
578    
579     EOF
580    
581     print <FH>;
582     }
583    
584     And index with the command:
585    
586     swish-e -S prog -i ./catfilter.pl -v 1
587    
588     This example will probably not work under Windows due to the '-|' open.
589     A simple piped open may work just as well:
590    
591     That is, replace:
592    
593     my $child = open( FH, '-|' );
594     die "Failed to fork $!" unless defined $child;
595     exec '/bin/cat', $_ unless $child;
596    
597     with this:
598    
599     open( FH, "/bin/cat $_ |" ) or die $!;
600    
601     Perl will try to avoid running the command through the shell if meta
602     characters are not passed to the open. See C<perldoc -f open> for
603     more information.
604    
605     =head3 Eh, but I just want to know how to index PDF documents!
606    
607     See the examples in the F<conf> directory.
608    
609     =head3 I'm using the prog method to index PDF documents, but the file
610     contents are not indexed.
611    
612     The some of the examples in the F<prog-bin> directory use a module to
613     convert the PDF files into XML. So you must tell Swish-e that you are
614     indexing XML files for the PDF extension.
615    
616     IndexContents XML .pdf
617    
618     =head3 I'm using Windows and can't get Filters or the prog input method
619     to work!
620    
621     Both the C<-S prog> input method and filters use the C<popen()> system
622     call to run the external program. If your external program is, for
623     example, a perl script, you have to tell Swish-e to run perl, instead of
624     the script. Also, you must use the backslash character in the program
625     name since C<popen()> runs the command via the shell, which must be a
626     backslash in windows.
627    
628     For example, you would need to specify the path to perl as (assuming
629     this is where perl is on your system):
630    
631     IndexDir e:\\perl\\bin\\perl.exe
632    
633     Or run a filter like:
634    
635     FileFilter .foo e:\\perl\\bin\\perl.exe 'myscript.pl "%p"'
636    
637    
638     =head3 How do I index non-English words?
639    
640     Swish-e indexes 8-bit characters only. This is the ISO 8859-1 Latin-1
641     character set, and includes many non-English letters (and symbols).
642     As long as they are listed in C<WordCharacters> they will be indexed.
643    
644     Actually, you probably can index any 8-bit character set, as long as
645     you don't mix character sets in the same index.
646    
647     The C<TranslateCharacters> directive (L<SWISH-CONFIG|SWISH-CONFIG>)
648     can translate characters while indexing and searching. You may
649     specify the mapping of one character to another character with the
650     C<TranslateCharacters> directive.
651    
652     C<TranslateCharacters :ascii7:> is a predefined set of characters that
653     will translate eight bit characters to ascii7 characters. Using the
654     C<:ascii7:> rule will, for example, translate "Ääç" to "aac". This means:
655     searching "Çelik", "çelik" or "celik" will all match the same word.
656    
657     Note: When using libxml2 for parsing, parsed documents are converted
658     internally (within libxml2) to UTF-8. This is converted to ISO 8859-1
659     Latin-1 when indexing. In cases where a string can not be converted
660     from UTF-8 to ISO 8859-1 (because it contains non 8859-1 characters),
661     the string will be sent to Swish-e in UTF-8 encoding. This will results
662     in some words indexed incorrectly. Setting C<ParserWarningLevel> to 1
663     or more will display warnings when UTF-8 to 8859-1 conversion fails.
664    
665     =head3 Can I add/remove files from an index?
666    
667     Not really. Swish-e currently has no way to add or remove items from
668     its index.
669    
670     About the only way to delete items from the index is to stat(2) all the
671     results to make sure that all the files still exist.
672    
673     Incremental additions can be handled in a couple of ways, depending on
674     your situation. It's probably easiest to create one main index every
675     night (or every week), and then create an index of just the new files
676     between main indexing jobs and use the C<-f> option to pass both indexes
677     to Swish-e while searching.
678    
679     You can merge the indexes into one index (instead of using -f), but it's
680     not clear that this has any advantage over searching multiple indexes.
681     Using C<-f> gives access to the individual headers of both indexes,
682     while C<-M> merges the headers, and merging indexes with different
683     indexing settings (Stemming, WordCharacters) may produce odd results.
684     This is a question for the Swish-e discussion list.
685    
686     How does one create the incremental index?
687    
688     One method is by using the C<-N> switch to pass a file path to
689     Swish-e when indexing. It will only index files that have a last
690     modification date C<newer> than the file supplied with the C<-N> switch.
691    
692     This option has the disadvantage that Swish-e must process every file in
693     every directory as if they were going to be indexed (the test for C<-N>
694     is done last right before indexing of the file contents begin and after
695     all other tests on the file have been completed) -- all that just to
696     find a few new files.
697    
698     Also, if you use the Swish-e index file as the file passed to C<-N> there
699     may be files that were added after indexing was started, but before the
700     index file was written. This could result in a file not being added to
701     the index.
702    
703     Another option is to maintain a parallel directory tree that contains
704     symlinks pointing to the main files. When a new file is added (or
705     changed) to the main directory tree you create a symlink to the real file
706     in the parallel directory tree. Then just index the symlink directory
707     to generate the incremental index.
708    
709     This option has the disadvantage that you need to have a central
710     program that creates the new files that can also create the symlinks.
711     But, indexing is quite fast since Swish-e only has to look at the files
712     that need to be indexed. When you run full indexing you simply unlink
713     (delete) all the symlinks.
714    
715     Both of these methods have issues where files could end up in both
716     indexes, or files being left out of an index. Use of file locks while
717     indexing, and hash lookups during searches can help prevent these
718     problems.
719    
720     =head3 I run out of memory trying to index my files.
721    
722     It's true that indexing can take up a lot of memory! Swish-e is extremely
723     fast at indexing, but that comes at the cost of memory.
724    
725     The best answer is install more memory.
726    
727     Another option is use the C<-e> switch. This will require less memory,
728     but indexing will take longer as not all data will be stored in memory
729     while indexing. How much less memory and how much more time depends on
730     the documents you are indexing, and the hardware that you are using.
731    
732     Here's an example of indexing all .html files in /usr/doc on Linux.
733     This first example is I<without> C<-e> and used about 84M of memory:
734    
735     270279 unique words indexed.
736     23841 files indexed. 177640166 total bytes.
737     Elapsed time: 00:04:45 CPU time: 00:03:19
738    
739     This is I<with> C<-e>, and used about 26M or memory:
740    
741     270279 unique words indexed.
742     23841 files indexed. 177640166 total bytes.
743     Elapsed time: 00:06:43 CPU time: 00:04:12
744    
745     You can also build a number of smaller indexes and then merge together
746     with C<-M>. This will use more memory. Merging is not a great option.
747    
748     Finally, if you do build a number of smaller indexes, you can specify more
749     than one index when searching by using the C<-f> switch. Sorting large
750     results sets by a property will be slower when specifying multiple index
751     files while searching.
752    
753     =head3 My system admin says Swish-e uses too much of the CPU!
754    
755     That's a good thing! That expensive CPU is suppose to be busy.
756    
757     Indexing takes a lot of work -- to make indexing fast much of the work is
758     done in memory which reduces the amount of time Swish-e is waiting on I/O.
759     But, there's two things you can try:
760    
761     The C<-e> option will run Swish-e in economy mode, which uses the disk
762     to store data while indexing. This makes Swish-e run somewhat slower,
763     but also uses less memory. Since it is writing to disk more often it
764     will be spending more time waiting on I/O and less time in CPU. Maybe.
765    
766     The other thing is to simply lower the priority of the job using the
767     nice(1) command:
768    
769     /bin/nice -15 swish-e -c search.conf
770    
771     If concerned about searching time, make sure you are using the -b and -m
772     switches to only return a page at a time. If you know that your result
773     sets will be large, and that you wish to return results one page at a
774     time, and that often times many pages of the same query will be requested,
775     you may be smart to request all the documents on the first request, and
776     then cache the results to a temporary file. The perl module File::Cache
777     makes this very simple to accomplish.
778    
779     =head2 Spidering
780    
781     =head3 How can I index documents on a web server?
782    
783     If possible, use the file system method C<-S fs> of indexing to index
784     documents in you web area of the file system. This avoids the overhead
785     of spidering a web server and is much faster. (C<-S fs> is the default
786     method if C<-S> is not specified).
787    
788     If this is impossible (the web server is not local, or documents
789     are dynamically generated), Swish-e provides two methods of spidering.
790     First, it includes the http method of indexing C<-S http>. A number
791     of special configuration directives are available that control spidering
792     (see L<Directives for the HTTP Access Method Only|/"Directives for the
793     HTTP Access Method Only">). A perl helper script (swishspider) is
794     included in the F<src> directory to assist with spidering web servers.
795     There are example configurations for spidering in the F<conf> directory.
796    
797     As of Swish-e 2.2, there's a general purpose "prog" document source where
798     a program can feed documents to it for indexing. A number of example
799     programs can be found in the C<prog-bin> directory, including a program
800     to spider web servers. The provided spider.pl program is full-featured
801     and is easily customized.
802    
803     The advantage of the "prog" document source feature over the "http" method
804     is that the program is only executed one time, where the swishspider.pl
805     program used in the "http" method is executed once for every document
806     read from the web server. The forking of Swish-e and compiling of the
807     perl script can be quite expensive, time-wise.
808    
809     The other advantage of the C<spider.pl> program is that it's simple and
810     efficient to add filtering (such as for PDF or MS Word docs) right into
811     the spider.pl's configuration, and it includes features such as MD5 checks
812     to prevent duplicate indexing, options to avoid spidering some files,
813     or index but avoid spidering. And since it's a perl program there's no
814     limit on the features you can add.
815    
816     =head3 Why does swish report "./swishspider: not found"?
817    
818     Does the file F<swishspider> exist where the error message displays? If not, either
819     set the configuration option L<SpiderDirectory|SWISH-CONFIG/"item_SpiderDir">
820     to point to the directory where the F<swishspider> program is found, or place the
821     F<swishspider> program in the current directory when running swish-e.
822    
823     If you are running Windows, make sure "perl" is in your path. Try typing F<perl> from
824     a command prompt.
825    
826     If you not running windows, make sure that the shebang line (the first line of the
827     swishspider program that starts with #!) points to the correct location of perl.
828     Typically this will be F</usr/bin/perl> or F</usr/local/bin/perl>. Also, make sure that
829     you have execute and read permissions on F<swishspider>.
830    
831     The F<swishspider> perl script is only used with the -S http method of indexing.
832    
833     =head3 I'm using the spider.pl program to spider my web site, but some
834     large files are not indexed.
835    
836     The C<spider.pl> program has a default limit of 5MB file size. This can
837     be changed with the C<max_size> parameter setting. See C<perldoc
838     spider.pl> for more information.
839    
840     =head3 I still don't think all my web pages are being indexed.
841    
842     The F<spider.pl> program has a number of debugging switches and can be
843     quite verbose in telling you what's happening, and why. See C<perldoc
844     spider.pl> for instructions.
845    
846     =head3 Swish is not spidering Javascript links!
847    
848     Swish cannot follow links generated by Javascript, as they are generated
849     by the browser and are not part of the document.
850    
851     =head3 How do I spider other websites and combine it with my own
852     (filesystem) index?
853    
854     You can either merge C<-M> two indexes into a single index, or use C<-f>
855     to specify more than one index while searching.
856    
857     You will have better results with the C<-f> method.
858    
859    
860     =head2 Searching
861    
862     =head3 How do I limit searches to just parts of the index?
863    
864     If you can identify "parts" of your index by the path name you have
865     two options.
866    
867     The first options is by indexing the document path. Add this to your
868     configuration:
869    
870     MetaNames swishdocpath
871    
872     Now you can search for words or phrases in the path name:
873    
874     swish-e -w 'foo AND swishdocpath=(sales)'
875    
876     So that will only find documents with the word "foo" and where the file's
877     path contains "sales". That might not works as well as you like, though,
878     as both of these paths will match:
879    
880     /web/sales/products/index.html
881     /web/accounting/private/sales_we_messed_up.html
882    
883     This can be solved by searching with a phrase (assuming "/" is not
884     a WordCharacter):
885    
886     swish-e -w 'foo AND swishdocpath=("/web/sales/")'
887     swish-e -w 'foo AND swishdocpath=("web sales")' (same thing)
888    
889    
890     The second option is a bit more powerful. With the C<ExtractPath>
891     directive you can use a regular expression to extract out a sub-set of
892     the path and save it as a separate meta name:
893    
894     MetaNames department
895     ExtractPath department regex !^/web/([^/]+).+$!$1/
896    
897     Which says match a path that starts with "/web/" and extract out
898     everything after that up to, but not including the next "/" and save it in
899     variable $1, and then match everything from the "/" onward. Then replace
900     the entire matches string with $1. And that gets indexed as meta name
901     "department".
902    
903     Now you can search like:
904    
905     swish-e -w 'foo AND department=sales'
906    
907     and be sure that you will only match the documents in the /www/sales/*
908     path. Note that you can map completely different areas of your file
909     system to the same metaname:
910    
911     # flag the marketing specific pages
912     ExtractPath department regex !^/web/(marketing|sales)/.+$!marketing/
913     ExtractPath department regex !^/internal/marketing/.+$!marketing/
914    
915     # flag the technical departments pages
916     ExtractPath department regex !^/web/(tech|bugs)/.+$!tech/
917    
918    
919     Finally, if you have something more complicated, use C<-S prog> and
920     write a perl program or use a filter to set a meta tag when processing
921     each file.
922    
923     =head3 How can I limit searches to the title, body, or comment?
924    
925     Use the C<-t> switch.
926    
927     =head3 I can't limit searches to title/body/comment.
928    
929     Or, I<I can't search with meta names, all the names are indexed as
930     "plain".>
931    
932     Check in the config.h file if #define INDEXTAGS is set to 1. If it is,
933     change it to 0, recompile, and index again. When INDEXTAGS is 1, ALL
934     the tags are indexed as plain text, that is you index "title", "h1", and
935     so on, AND they loose their indexing meaning. If INDEXTAGS is set to 0,
936     you will still index meta tags and comments, unless you have indicated
937     otherwise in the user config file with the IndexComments directive.
938    
939     Also, check for the C<UndefinedMetaTags> setting in your configuration
940     file.
941    
942     =head3 I've tried running the included CGI script and I get a "Internal
943     Server Error"
944    
945     Debugging CGI scripts are beyond the scope of this document.
946     Internal Server Error basically means "check the web server's log for
947     an error message", as it can mean a bad shebang (#!) line, a missing
948     perl module, FTP transfer error, or simply an error in the program.
949     The CGI script F<swish.cgi> in the F<example> directory contains some
950     debugging suggestions. Type C<perldoc swish.cgi> for information.
951    
952     There are also many, many CGI FAQs available on the Internet. A quick web
953     search should offer help. As a last resort you might ask your webadmin
954     for help...
955    
956     =head3 When I try to view the swish.cgi page I see the contents of the
957     Perl program.
958    
959     Your web server is not configured to run the program as a CGI script.
960     This problem is described in C<perldoc swish.cgi>.
961    
962    
963     =head3 How do I make Swish-e highlight words in search results?
964    
965     Short answer:
966    
967     Use the supplied swish.cgi script located in the F<examples> directory.
968    
969     Long answer:
970    
971     Swish-e can't because it doesn't have access to the source documents when
972     returning results, of course. But a front-end program of your creation
973     can highlight terms. Your program can open up the source documents and
974     then use regular expressions to replace search terms with highlighted
975     or bolded words.
976    
977     But, that will fail with all but the most simple source documents.
978     For HTML documents, for example, you must parse the document into words
979     and tags (and comments). A word you wish to highlight may span multiple
980     HTML tags, or be a word in a URL and you wish to highlight the entire
981     link text.
982    
983     Perl modules such as HTML::Parser and XML::Parser make word extraction
984     possible. Next, you need to consider that Swish-e uses settings such
985     as WordCharacters, BeginCharacters, EndCharacters, IgnoreFirstChar,
986     and IgnoreLast, char to define a "word". That is, you can't consider
987     that a string of characters with white space on each side is a word.
988    
989     Then things like TranslateCharacters, and HTML Entities may transform a
990     source word into something else, as far as Swish-e is concerned. Finally,
991     searches can be limited by metanames, so you may need to limit your
992     highlighting to only parts of the source document. Throw phrase searches
993     and stopwords into the equation and you can see that it's not a trivial
994     problem to solve.
995    
996     All hope is not lost, thought, as Swish-e does provide some help.
997     Using the C<-H> option it will return in the headers the current index
998     (or indexes) settings for WordCharacters (and others) required to parse
999     your source documents as it parses them during indexing, and will return a
1000     "Parsed Words:" header that will show how it parsed the query internally.
1001     If you use fuzzy indexing (word stemming, soundex, or metaphone)
1002     then you will also need to stem each word in your
1003     document before comparing with the "Parsed Words:" returned by Swish-e.
1004     The Swish-e stemming code is available either by using the Swish-e
1005     Perl module or C library (included with the swish-e distribution),
1006     or by using the SWISH::Stemmer module available on CPAN. Also on CPAN is
1007     the module Text::DoubleMetaphone.
1008    
1009     =head3 Do filters effect the performance during search?
1010    
1011     No. Filters (FileFilter or via "prog" method) are only used for building
1012     the search index database. During search requests there will be no
1013     filter calls.
1014    
1015    
1016     =head2 I have read the FAQ but I still have questions about using Swish-e.
1017    
1018     The Swish-e discussion list is the place to go. http://swish-e.org/.
1019     Please do not email developers directly. The list is the best place to
1020     ask questions.
1021    
1022     Before you post please read I<QUESTIONS AND TROUBLESHOOTING> located
1023     in the L<INSTALL|INSTALL> page. You should also search the Swish-e
1024     discussion list archive which can be found on the swish-e web site.
1025    
1026     In short, be sure to include in the following when asking for help.
1027    
1028     =over 4
1029    
1030     =item * The swish-e version (./swish-e -V)
1031    
1032     =item * What you are indexing (and perhaps a sample), and the number
1033     of files
1034    
1035     =item * Your Swish-e configuration file
1036    
1037     =item * Any error messages that Swish-e is reporting
1038    
1039     =back
1040    
1041     =head1 Document Info
1042    
1043     $Id: SWISH-FAQ.pod,v 1.24 2002/08/20 22:24:08 whmoseley Exp $
1044    
1045     .

  ViewVC Help
Powered by ViewVC 1.1.22