1 |
=head1 NAME |
2 |
|
3 |
The Swish-e FAQ - Answers to Common Questions |
4 |
|
5 |
=head1 Frequently Asked Questions |
6 |
|
7 |
=head2 General Questions |
8 |
|
9 |
=head3 What is Swish-e? |
10 |
|
11 |
Swish-e is B<S>imple B<W>eb B<I>ndexing B<S>ystem for B<H>umans - |
12 |
B<E>nhanced. With it, you can quickly and easily index directories of |
13 |
files or remote web sites and search the generated indexes for words |
14 |
and phrases. |
15 |
|
16 |
=head3 So, is Swish-e a search engine? |
17 |
|
18 |
Well, yes. Probably the most common use of Swish-e is to provide a search |
19 |
engine for web sites. The Swish-e distribution includes CGI scripts that |
20 |
can be used with it to add a I<search engine> for your web site. The CGI |
21 |
scripts can be found in the F<example> directory of the distribution |
22 |
package. See the F<README> file for information about the scripts. |
23 |
|
24 |
But Swish-e can also be used to index all sorts of data, such as email |
25 |
messages, data stored in a relational database management system, |
26 |
XML documents, or documents such as Word and PDF documents -- or any |
27 |
combination of those sources at the same time. Searches can be limited |
28 |
to fields or I<MetaNames> within a document, or limited to areas within |
29 |
an HTML document (e.g. body, title). Programs other than CGI applications |
30 |
can use Swish-e, as well. |
31 |
|
32 |
=head3 Should I upgrade if I'm already running a previous version |
33 |
of Swish-e? |
34 |
|
35 |
A large number of bug fixes, feature additions, and logic corrections were |
36 |
made in version 2.2. In addition, indexing speed has been drastically |
37 |
improved (reports of indexing times changing from four hours to 5 |
38 |
minutes), and major parts of the indexing and search parsers have been |
39 |
rewritten. There's better debugging options, enhanced output formats, |
40 |
more document meta data (e.g. last modified date, document summary), |
41 |
options for indexing from external data sources, and faster spidering |
42 |
just to name a few changes. (See the CHANGES file for more information. |
43 |
|
44 |
Since so much effort has gone into version 2.2, support for previous |
45 |
versions will probably be limited. |
46 |
|
47 |
=head3 Are there binary distributions available for Swish-e on platform foo? |
48 |
|
49 |
Foo? Well, yes there are some binary distributions available. Please see |
50 |
the Swish-e web site for a list at http://swish-e.org/. |
51 |
|
52 |
In general, it is recommended that you build Swish-e from source, |
53 |
if possible. |
54 |
|
55 |
=head3 Do I need to reindex my site each time I upgrade to a new Swish-e |
56 |
version? |
57 |
|
58 |
At times it might not strictly be necessary, but since you don't really |
59 |
know if anything in the index has changed, it is a good rule to reindex. |
60 |
|
61 |
=head3 What's the advantage of using the libxml2 library for parsing HTML? |
62 |
|
63 |
Swish-e may be linked with libxml2, a library for working with HTML and XML |
64 |
documents. Swish-e can use libxml2 for parsing HTML and XML documents. |
65 |
|
66 |
The libxml2 parser is a better parser than Swish-e's built-in HTML |
67 |
parser. It offers more features, and it does a much better job at |
68 |
extracting out the text from a web page. In addition, you can use the |
69 |
C<ParserWarningLevel> configuration setting to find structural errors |
70 |
in your documents that could (and would with Swish-e's HTML parser) |
71 |
cause documents to be indexed incorrectly. |
72 |
|
73 |
Libxml2 is not required, but is strongly recommended for parsing HTML |
74 |
documents. It's also recommended for parsing XML, as it offers many |
75 |
more features than the internal Expat xml.c parser. |
76 |
|
77 |
The internal HTML parser will have limited support, and does have a |
78 |
number of bugs. For example, HTML entities may not always be correctly |
79 |
converted and properties do not have entities converted. The internal |
80 |
parser tends to get confused when invalid HTML is parsed where the libxml2 |
81 |
parser doesn't get confused as often. The structure is better detected |
82 |
with the libxml2 parser. |
83 |
|
84 |
If you are using the Perl module (the C interface to the Swish-e |
85 |
library) you may wish to build two versions of Swish-e, one with the |
86 |
libxml2 library linked in the binary, and one without, and build the |
87 |
Perl module against the library without the libxml2 code. This is to |
88 |
save space in the library. Hopefully, the library will someday soon be |
89 |
split into indexing and searching code (volunteers welcome). |
90 |
|
91 |
=head3 Does Swish-e include a CGI interface? |
92 |
|
93 |
An example CGI script is included in the C<example> directory. |
94 |
(Type C<perldoc swish.cgi> in the C<example> directory for instructions.) |
95 |
|
96 |
Please be careful when picking a CGI script to use with Swish-e. Quite a |
97 |
few of the scripts that have been available for it are insecure and |
98 |
should not be used. |
99 |
|
100 |
The included example CGI script was designed with security in mind. |
101 |
Regardless, you are encouraged to have your local Perl expert review it |
102 |
(and all other CGI scripts you use) before placing into production. |
103 |
This is just a good policy to follow. |
104 |
|
105 |
=head3 How secure is Swish-e? |
106 |
|
107 |
We know of no security issues with using Swish-e. Careful attention |
108 |
has been made with regard to common security problems such as buffer |
109 |
overruns when programming Swish-e. |
110 |
|
111 |
The most likely security issue with Swish-e is when it is run via |
112 |
a poorly written CGI interface. This is not limited to CGI scripts |
113 |
written in Perl, as it's just as easy to write an insecure CGI script |
114 |
in C, Java, PHP, or Python. A good source of information is included |
115 |
with the Perl distribution. Type C<perldoc perlsec> at your local |
116 |
prompt for more information. Another must-read document is located at |
117 |
C<http://www.w3.org/Security/faq/wwwsf4.html>. |
118 |
|
119 |
Note that there are many I<free> yet insecure and poorly written CGI |
120 |
scripts available -- even some designed for use with Swish-e. Please |
121 |
carefully review any CGI script you use. Free is not such a good price |
122 |
when you get your server hacked... |
123 |
|
124 |
=head3 Should I run Swish-e as the superuser (root)? |
125 |
|
126 |
No. Never. |
127 |
|
128 |
=head3 What files does Swish-e write? |
129 |
|
130 |
Swish writes the index file, of course. This is specified with the |
131 |
C<IndexFile> configuration directive or by the C<-f> command line switch. |
132 |
|
133 |
The index file is actually a collection of files, but all start with |
134 |
the file name specified with the C<IndexFile> directive or the C<-f> |
135 |
command line switch. |
136 |
|
137 |
For example, the file ending in F<.prop> contains the document properties. |
138 |
|
139 |
When creating the index files Swish-e appends the extension F<.temp> |
140 |
to the index file names. When indexing is complete Swish-e renames the |
141 |
F<.temp> files to the index files specified by C<IndexFile> or C<-f>. |
142 |
This is done so that existing indexes remain untouched until it completes |
143 |
indexing. |
144 |
|
145 |
Swish-e also writes temporary files in some cases during indexing |
146 |
(e.g. C<-s http>, C<-s prog> with filters>, when merging, and when |
147 |
using C<-e>). Temporary files are created with the mkstemp(3) function |
148 |
(with 0600 permission on unix-like operating systems). |
149 |
|
150 |
The temporary files are created in the directory specified by the |
151 |
environment variables C<TMPDIR> and C<TMP> in that order. If those |
152 |
are not set then swish uses the setting the configuration setting |
153 |
L<TmpDir|SWISH-CONFIG/"item_TmpDir">. Otherwise, the temporary file |
154 |
will be located in the current directory. |
155 |
|
156 |
=head3 Can I index PDF and MS-Word documents? |
157 |
|
158 |
Yes, you can use a I<Filter> to convert documents while indexing, or you |
159 |
can use a program that "feeds" documents to Swish-e that have already |
160 |
been converted. See <Indexing> below. |
161 |
|
162 |
=head3 Can I index documents on a web server? |
163 |
|
164 |
Yes, Swish-e provides two ways to index (spider) documents on a web |
165 |
server. See C<Spidering> below. |
166 |
|
167 |
Swish-e can retrieve documents from a file system or from a remote web |
168 |
server. It can also execute a program that returns documents back |
169 |
to it. This program can retrieve documents from a database, filter |
170 |
compressed documents files, convert PDF files, extract data from mail |
171 |
archives, or spider remote web sites. |
172 |
|
173 |
=head3 Can I implement keywords in my documents? |
174 |
|
175 |
Yes, Swish-e can associate words with I<MetaNames> while indexing, |
176 |
and you can limit your searches to these MetaNames while searching. |
177 |
|
178 |
In your HTML files you can put keywords in HTML META tags or in XML blocks. |
179 |
|
180 |
META tags can have two formats in your source documents: |
181 |
|
182 |
<META NAME="DC.subject" CONTENT="digital libraries"> |
183 |
|
184 |
|
185 |
And in XML format (can also be used in HTML documents when using libxml2): |
186 |
|
187 |
<meta2> |
188 |
Some Content |
189 |
</meta2> |
190 |
|
191 |
|
192 |
Then, to inform Swish-e about the existence of the meta name in your |
193 |
documents, edit the line in your configuration file: |
194 |
|
195 |
MetaNames DC.subject meta1 meta2 |
196 |
|
197 |
When searching you can now limit some or all search terms to that |
198 |
MetaName. For example, to look for documents that contain the word |
199 |
apple and also have either fruit or cooking in the DC.subject meta tag. |
200 |
|
201 |
=head3 What are document properties? |
202 |
|
203 |
A document property is typically data that describes the document. |
204 |
For example, properties might include a document's path name, its last |
205 |
modified date, its title, or its size. Swish-e stores a document's |
206 |
properties in the index file, and they can be reported back in search |
207 |
results. |
208 |
|
209 |
Swish-e also uses properties for sorting. You may sort your results by |
210 |
one or more properties, in ascending or descending order. |
211 |
|
212 |
Properties can also be defined within your documents. HTML and |
213 |
XML files can specify tags (see previous question) as properties. |
214 |
The I<contents> of these tags can then be returned with search results. |
215 |
These user-defined properties can also be used for sorting search results. |
216 |
|
217 |
For example, if you had the following in your documents |
218 |
|
219 |
<meta name="creator" content="accounting department"> |
220 |
|
221 |
and C<creator> is defined as a property (see C<PropertyNames> in |
222 |
L<SWISH-CONFIG|SWISH-CONFIG>) Swish-e can return C<accounting department> |
223 |
with the result for that document. |
224 |
|
225 |
swish-e -w foo -p creator |
226 |
|
227 |
Or for sorting: |
228 |
|
229 |
swish-e -w foo -s creator |
230 |
|
231 |
=head3 What's the difference between MetaNames and PropertyNames? |
232 |
|
233 |
MetaNames allows keywords searches in your documents. That is, you can |
234 |
use MetaNames to restrict searches to just parts of your documents. |
235 |
|
236 |
PropertyNames, on the other hand, define text that can be returned with |
237 |
results, and can be used for sorting. |
238 |
|
239 |
Both use I<meta tags> found in your documents (as shown in the above two |
240 |
questions) to define the text you wish to use as a property or meta name. |
241 |
|
242 |
You may define a tag as B<both> a property and a meta name. For example: |
243 |
|
244 |
<meta name="creator" content="accounting department"> |
245 |
|
246 |
placed in your documents and then using configuration settings of: |
247 |
|
248 |
PropertyNames creator |
249 |
MetaNames creator |
250 |
|
251 |
will allow you to limit your searches to documents created by accounting: |
252 |
|
253 |
swish-e -w 'foo and creator=(accounting)' |
254 |
|
255 |
That will find all documents with the word C<foo> that also have a creator |
256 |
meta tag that contains the word C<accounting>. This is using MetaNames. |
257 |
|
258 |
And you can also say: |
259 |
|
260 |
swish-e -w foo -p creator |
261 |
|
262 |
which will return all documents with the word C<foo>, but the results will |
263 |
also include the contents of the C<creator> meta tag along with results. |
264 |
This is using properties. |
265 |
|
266 |
You can use properties and meta names at the same time, too: |
267 |
|
268 |
swish-e -w creator=(accounting or marketing) -p creator -s creator |
269 |
|
270 |
That searches only in the C<creator> I<meta name> for either of the words |
271 |
C<accounting> or C<marketing>, prints out the contents of the contents |
272 |
of the C<creator> I<property>, and sorts the results by the C<creator> |
273 |
I<property name>. |
274 |
|
275 |
(See also the C<-x> output format switch in L<SWISH-RUN|SWISH-RUN>.) |
276 |
|
277 |
=head3 Can Swish-e index multi-byte characters? |
278 |
|
279 |
No. This will require much work to change. But, Swish-e works with |
280 |
eight Bit characters, so many characters sets can be used. Note that it |
281 |
does call the ANSI-C tolower() function which does depend on the current |
282 |
locale setting. See C<locale(7)> for more information. |
283 |
|
284 |
=head2 Indexing |
285 |
|
286 |
=head3 How do I pass Swish-e a list of files to index? |
287 |
|
288 |
Currently, there is not a configuration directive to include a file that |
289 |
contains a list of files to index. But, there is a directive to include |
290 |
another configuration file. |
291 |
|
292 |
IncludeConfigFile /path/to/other/config |
293 |
|
294 |
And in C</path/to/other/config> you can say: |
295 |
|
296 |
IndexDir file1 file2 file3 file4 file5 ... |
297 |
IndexDir file20 file21 file22 |
298 |
|
299 |
You may also specify more than one configuration file on the command line: |
300 |
|
301 |
./swish-e -c config_one config_two config_three |
302 |
|
303 |
Another option is to create a directory with symbolic links of the files |
304 |
to index, and index just that directory. |
305 |
|
306 |
=head3 How does Swish-e know which parser to use? |
307 |
|
308 |
Swish can parse HTML, XML, and text documents. The parser is set by |
309 |
associating a file extension with a parser by the C<IndexContents> |
310 |
directive. You may set the default parser with the C<DefaultContents> |
311 |
directive. If a document is not assigned a parser it will default to |
312 |
the HTML parser (HTML2 if built with libxml2). |
313 |
|
314 |
You may use Filters or an external program to convert documents to HTML, |
315 |
XML, or text. |
316 |
|
317 |
=head3 Can I reindex and search at the same time? |
318 |
|
319 |
Yes. Starting with version 2.2 Swish-e indexes to temporary files, and then |
320 |
renames the files when indexing is complete. On most systems renames |
321 |
are atomic. But, since Swish-e also generates more than one file during |
322 |
indexing there will be a very short period of time between renaming the |
323 |
various files when the index is out of sync. |
324 |
|
325 |
Settings in F<config.h> control some options related to temporary files, |
326 |
and their use during indexing. |
327 |
|
328 |
=head3 Can I index phrases? |
329 |
|
330 |
Phrases are indexed automatically. To search for a phrase simply place |
331 |
double quotes around the phrase. |
332 |
|
333 |
For example: |
334 |
|
335 |
swish-e -w 'free and "fast search engine"' |
336 |
|
337 |
=head3 How can I prevent phrases from matching across sentences? |
338 |
|
339 |
Use the |
340 |
L<BumpPositionCounterCharacters|/"item_BumpPositionCounterCharacters"> |
341 |
configuration directive. |
342 |
|
343 |
=head3 Swish-e isn't indexing a certain word or phrase. |
344 |
|
345 |
There are a number of configuration parameters that control what Swish-e |
346 |
considers a "word" and it has a debugging feature to help pinpoint |
347 |
any indexing problems. |
348 |
|
349 |
Configuration file directives (L<SWISH-CONFIG|SWISH-CONFIG>) |
350 |
C<WordCharacters>, C<BeginCharacters>, C<EndCharacters>, |
351 |
C<IgnoreFirstChar>, and C<IgnoreLastChar> are the main settings that |
352 |
Swish-e uses to define a "word". See L<SWISH-CONFIG|SWISH-CONFIG> and |
353 |
L<SWISH-RUN|SWISH-RUN> for details. |
354 |
|
355 |
Swish-e also uses compile-time defaults for many settings. These are |
356 |
located in F<src/config.h> file. |
357 |
|
358 |
Use of the command line arguments C<-k>, C<-v> and C<-T> are useful when |
359 |
debugging these problems. Using C<-T INDEXED_WORDS> while indexing will |
360 |
display each word as it is indexed. You should specify one file when |
361 |
using this feature since it can generate a lot of output. |
362 |
|
363 |
./swish-e -c my.conf -i problem.file -T INDEXED_WORDS |
364 |
|
365 |
You may also wish to index a single file that contains words that are or |
366 |
are not indexing as you expect and use -T to output debugging information |
367 |
about the index. A useful command might be: |
368 |
|
369 |
./swish-e -f index.swish-e -T INDEX_FULL |
370 |
|
371 |
Once you see how Swish-e is parsing and indexing your words, you can |
372 |
adjust the configuration settings mentioned above to control what words |
373 |
are indexed. |
374 |
|
375 |
Another useful command might be: |
376 |
|
377 |
./swish-e -c my.conf -i problem.file -T PARSED_WORDS INDEXED_WORDS |
378 |
|
379 |
This will show white-spaced words parsed from the document (PARSED_WORDS), |
380 |
and how those words are split up into separate words for indexing |
381 |
(INDEXED_WORDS). |
382 |
|
383 |
|
384 |
=head3 How do I keep Swish-e from indexing numbers? |
385 |
|
386 |
Swish-e indexes words as defined by the C<WordCharacters> setting, as |
387 |
described above. So to avoid indexing numbers you simply remove digits |
388 |
from the C<WordCharacters> setting. |
389 |
|
390 |
There are also some settings in F<config.h> that control what "words" |
391 |
are indexed. You can configure swish to never index words that are all |
392 |
digits, vowels, or consonants, or that contain more than some consecutive |
393 |
number of digits, vowels, or consonants. In general, you won't need to |
394 |
change these settings. |
395 |
|
396 |
Also, there's an experimental feature called C<IgnoreNumberChars> |
397 |
which allows you to define a set of characters that describe a number. |
398 |
If a word is made up of B<only> those characters it will not be indexed. |
399 |
|
400 |
|
401 |
=head3 Swish-e crashes and burns on a certain file. What can I do? |
402 |
|
403 |
This shouldn't happen. If it does please post to the Swish-e discussion |
404 |
list the details so it can be reproduced by the developers. |
405 |
|
406 |
In the mean time, you can use a C<FileRules> directive to exclude the |
407 |
particular file name, or pathname, or its title. If there are serious |
408 |
problems in indexing certain types of files, they may not have valid text |
409 |
in them (they may be binary files, for instance). You can use NoContents |
410 |
to exclude that type of file. |
411 |
|
412 |
Swish-e will issue a warning if an embedded null character is found in a |
413 |
document. This warning will be an indication that you are trying to index |
414 |
binary data. If you need to index binary files try to find a program |
415 |
that will extract out the text (e.g. strings(1), catdoc(1), pdftotext(1)). |
416 |
|
417 |
=head3 How to I prevent indexing of some documents? |
418 |
|
419 |
When using the file system to index your files you can use the |
420 |
C<FileRules> directive. Other than C<FileRules title>, C<FileRules> |
421 |
only works with the file system (C<-S fs>) indexing method, not with |
422 |
C<-S prog> or C<-S http>. |
423 |
|
424 |
If you are spidering, use a F<robots.text> file in your document root. |
425 |
This is a standard way to excluded files from search engines, and is |
426 |
fully supported by Swish-e. See http://www.robotstxt.org/ |
427 |
|
428 |
You can also modify the F<spider.pl> spider perl program to skip, index |
429 |
content only, or spider only listed web pages. Type C<perldoc spider.pl> |
430 |
in the C<prog-bin> directory for details. |
431 |
|
432 |
If using the libxml2 library for parsing HTML, you may also use the Meta |
433 |
Robots Exclusion in your documents: |
434 |
|
435 |
<meta name="robots" content="noindex"> |
436 |
|
437 |
See the L<obeyRobotsNoIndex|SWISH-CONFIG/"item_obeyRobotsNoIndex"> directive. |
438 |
|
439 |
=head3 How do I prevent indexing parts of a document? |
440 |
|
441 |
To prevent Swish-e from indexing a common header, footer, or navigation |
442 |
bar, AND you are using libxml2 for parsing HTML, then you may |
443 |
use a fake HTML tag around the text you wish to ignore and use the |
444 |
C<IgnoreMetaTags> directive. This will generate an error message if |
445 |
the C<ParserWarningLevel> is set as it's invalid HTML. |
446 |
|
447 |
C<IgnoreMetaTags> works with XML documents (and HTML documents when |
448 |
using libxml2 as the parser), but not with documents parsed by the text |
449 |
(TXT) parser. |
450 |
|
451 |
If you are using the libxml2 parser (HTML2 and XML2) then you can use the the following |
452 |
comments in your documents to prevent indexing: |
453 |
|
454 |
<!-- SwishCommand noindex --> |
455 |
<!-- SwishCommand index --> |
456 |
|
457 |
and/or these may be used also: |
458 |
|
459 |
<!-- noindex --> |
460 |
<!-- index --> |
461 |
|
462 |
|
463 |
=head3 How do I modify the path or URL of the indexed documents. |
464 |
|
465 |
Use the C<ReplaceRules> configuration directive to rewrite path names |
466 |
and URLs. If you are using C<-S prog> input method you may set the path |
467 |
to any string. |
468 |
|
469 |
=head3 How can I index data from a database? |
470 |
|
471 |
Use the "prog" document source method of indexing. Write a program to |
472 |
extract out the data from your database, and format it as XML, HTML, |
473 |
or text. See the examples in the C<prog-bin> directory, and the next |
474 |
question. |
475 |
|
476 |
=head3 How do I index my PDF, Word, and compressed documents? |
477 |
|
478 |
Swish-e can internally only parse HTML, XML and TXT (text) files by |
479 |
default, but can make use of I<filters> that will convert other types |
480 |
of files such as MS Word documents, PDF, or gzipped files into one of |
481 |
the file types that Swish-e understands. |
482 |
|
483 |
The B<FileFilter> config directive is used to define programs to use |
484 |
as filters, based on file extension. For example, you can use the |
485 |
program C<catdoc> to convert MS-Word documents to text for indexing. |
486 |
Please see L<SWISH-CONFIG|SWISH-CONFIG/"Document Filter Directives"> |
487 |
and the examples in the C<filter-bin> directory for more information. |
488 |
|
489 |
Another option is to use the C<prog> document source input method. |
490 |
In this case you write a program (such as a perl script) that will read |
491 |
and convert your data as needed and then output one of the formats |
492 |
that Swish-e understands. Examples of using the C<prog> input method |
493 |
for filtering are included in the C<prog-bin> directory of the Swish-e |
494 |
distribution. |
495 |
|
496 |
The disadvantage of using the C<prog> input method is that you must |
497 |
write a program that reads the documents from the source (e.g. from the |
498 |
file system or via a spider to read files on a web server), and also |
499 |
include the code to filter the documents. It's much easier to use the |
500 |
C<FileFilter> option since the filter can often be implemented with just |
501 |
a single configuration directive. |
502 |
|
503 |
On the other hand, the advantage of using the C<prog> input method for |
504 |
indexing is speed. Filtering within a C<prog> input method program |
505 |
will be faster if your filtering program is something like a Perl script |
506 |
(something that has a large start-up cost). This may or may not be an |
507 |
issue for you, depending on how much time your indexing requires. |
508 |
|
509 |
You can also use a combination of methods. For example, say you are |
510 |
indexing a directory that contains PDF files using a C<FileFilter> |
511 |
directive. Now you want to index a MySQL database that also contains |
512 |
PDF files. You can write a C<prog> input method program to read your |
513 |
MySQL database and use the same C<FileFilter> configuration parameter |
514 |
(and filter program) to convert the PDF files into one of the native |
515 |
Swish-e formats (TXT, HTML, XML). |
516 |
|
517 |
Do note that it will be slower to use the C<FileFilter> method instead |
518 |
of running the filter directly from the C<prog> input method program. |
519 |
When C<FileFilter> is used with the C<prog> input method Swish-e must |
520 |
create a temporary file containing the output from your C<prog> method |
521 |
program, and then execute the filter program. |
522 |
|
523 |
In general, use the C<FileFilter> method to filter documents. If indexing |
524 |
speed is an issue, consider writing a C<prog> input method program. |
525 |
If you are already using the C<prog> method, then filtering will probably |
526 |
be best accomplished within that program. |
527 |
|
528 |
Here's two examples of how to run a filter program, one using Swish-e's |
529 |
C<FileFilter> directive, another using a C<prog> input method program. |
530 |
These filters simply use the program C</bin/cat> as a filter and only |
531 |
indexes .html files. |
532 |
|
533 |
First, using the C<FileFilter> method, here's the entire configuration |
534 |
file (swish.conf): |
535 |
|
536 |
IndexDir . |
537 |
IndexOnly .html |
538 |
FileFilter .html "/bin/cat" "'%p'" |
539 |
|
540 |
and index with the command |
541 |
|
542 |
swish-e -c swish.conf -v 1 |
543 |
|
544 |
Now, the same thing with using the C<prog> document source input method |
545 |
and a Perl program called catfilter.pl. You can see that's it's much |
546 |
more work than using the C<FileFilter> method above, but provides a |
547 |
place to do additional processing. In this example, the C<prog> method |
548 |
is only slightly faster. But if you needed a perl script to run as a |
549 |
FileFilter then C<prog> will be significantly faster. |
550 |
|
551 |
#!/usr/local/bin/perl -w |
552 |
use strict; |
553 |
use File::Find; # for recursing a directory tree |
554 |
|
555 |
$/ = undef; |
556 |
find( |
557 |
{ wanted => \&wanted, no_chdir => 1, }, |
558 |
'.', |
559 |
); |
560 |
|
561 |
sub wanted { |
562 |
return if -d; |
563 |
return unless /\.html$/; |
564 |
|
565 |
my $mtime = (stat)[9]; |
566 |
|
567 |
my $child = open( FH, '-|' ); |
568 |
die "Failed to fork $!" unless defined $child; |
569 |
exec '/bin/cat', $_ unless $child; |
570 |
|
571 |
my $content = <FH>; |
572 |
my $size = length $content; |
573 |
|
574 |
print <<EOF; |
575 |
Content-Length: $size |
576 |
Last-Mtime: $mtime |
577 |
Path-Name: $_ |
578 |
|
579 |
EOF |
580 |
|
581 |
print <FH>; |
582 |
} |
583 |
|
584 |
And index with the command: |
585 |
|
586 |
swish-e -S prog -i ./catfilter.pl -v 1 |
587 |
|
588 |
This example will probably not work under Windows due to the '-|' open. |
589 |
A simple piped open may work just as well: |
590 |
|
591 |
That is, replace: |
592 |
|
593 |
my $child = open( FH, '-|' ); |
594 |
die "Failed to fork $!" unless defined $child; |
595 |
exec '/bin/cat', $_ unless $child; |
596 |
|
597 |
with this: |
598 |
|
599 |
open( FH, "/bin/cat $_ |" ) or die $!; |
600 |
|
601 |
Perl will try to avoid running the command through the shell if meta |
602 |
characters are not passed to the open. See C<perldoc -f open> for |
603 |
more information. |
604 |
|
605 |
=head3 Eh, but I just want to know how to index PDF documents! |
606 |
|
607 |
See the examples in the F<conf> directory. |
608 |
|
609 |
=head3 I'm using the prog method to index PDF documents, but the file |
610 |
contents are not indexed. |
611 |
|
612 |
The some of the examples in the F<prog-bin> directory use a module to |
613 |
convert the PDF files into XML. So you must tell Swish-e that you are |
614 |
indexing XML files for the PDF extension. |
615 |
|
616 |
IndexContents XML .pdf |
617 |
|
618 |
=head3 I'm using Windows and can't get Filters or the prog input method |
619 |
to work! |
620 |
|
621 |
Both the C<-S prog> input method and filters use the C<popen()> system |
622 |
call to run the external program. If your external program is, for |
623 |
example, a perl script, you have to tell Swish-e to run perl, instead of |
624 |
the script. Also, you must use the backslash character in the program |
625 |
name since C<popen()> runs the command via the shell, which must be a |
626 |
backslash in windows. |
627 |
|
628 |
For example, you would need to specify the path to perl as (assuming |
629 |
this is where perl is on your system): |
630 |
|
631 |
IndexDir e:\\perl\\bin\\perl.exe |
632 |
|
633 |
Or run a filter like: |
634 |
|
635 |
FileFilter .foo e:\\perl\\bin\\perl.exe 'myscript.pl "%p"' |
636 |
|
637 |
|
638 |
=head3 How do I index non-English words? |
639 |
|
640 |
Swish-e indexes 8-bit characters only. This is the ISO 8859-1 Latin-1 |
641 |
character set, and includes many non-English letters (and symbols). |
642 |
As long as they are listed in C<WordCharacters> they will be indexed. |
643 |
|
644 |
Actually, you probably can index any 8-bit character set, as long as |
645 |
you don't mix character sets in the same index. |
646 |
|
647 |
The C<TranslateCharacters> directive (L<SWISH-CONFIG|SWISH-CONFIG>) |
648 |
can translate characters while indexing and searching. You may |
649 |
specify the mapping of one character to another character with the |
650 |
C<TranslateCharacters> directive. |
651 |
|
652 |
C<TranslateCharacters :ascii7:> is a predefined set of characters that |
653 |
will translate eight bit characters to ascii7 characters. Using the |
654 |
C<:ascii7:> rule will, for example, translate "Ääç" to "aac". This means: |
655 |
searching "Çelik", "çelik" or "celik" will all match the same word. |
656 |
|
657 |
Note: When using libxml2 for parsing, parsed documents are converted |
658 |
internally (within libxml2) to UTF-8. This is converted to ISO 8859-1 |
659 |
Latin-1 when indexing. In cases where a string can not be converted |
660 |
from UTF-8 to ISO 8859-1 (because it contains non 8859-1 characters), |
661 |
the string will be sent to Swish-e in UTF-8 encoding. This will results |
662 |
in some words indexed incorrectly. Setting C<ParserWarningLevel> to 1 |
663 |
or more will display warnings when UTF-8 to 8859-1 conversion fails. |
664 |
|
665 |
=head3 Can I add/remove files from an index? |
666 |
|
667 |
Not really. Swish-e currently has no way to add or remove items from |
668 |
its index. |
669 |
|
670 |
About the only way to delete items from the index is to stat(2) all the |
671 |
results to make sure that all the files still exist. |
672 |
|
673 |
Incremental additions can be handled in a couple of ways, depending on |
674 |
your situation. It's probably easiest to create one main index every |
675 |
night (or every week), and then create an index of just the new files |
676 |
between main indexing jobs and use the C<-f> option to pass both indexes |
677 |
to Swish-e while searching. |
678 |
|
679 |
You can merge the indexes into one index (instead of using -f), but it's |
680 |
not clear that this has any advantage over searching multiple indexes. |
681 |
Using C<-f> gives access to the individual headers of both indexes, |
682 |
while C<-M> merges the headers, and merging indexes with different |
683 |
indexing settings (Stemming, WordCharacters) may produce odd results. |
684 |
This is a question for the Swish-e discussion list. |
685 |
|
686 |
How does one create the incremental index? |
687 |
|
688 |
One method is by using the C<-N> switch to pass a file path to |
689 |
Swish-e when indexing. It will only index files that have a last |
690 |
modification date C<newer> than the file supplied with the C<-N> switch. |
691 |
|
692 |
This option has the disadvantage that Swish-e must process every file in |
693 |
every directory as if they were going to be indexed (the test for C<-N> |
694 |
is done last right before indexing of the file contents begin and after |
695 |
all other tests on the file have been completed) -- all that just to |
696 |
find a few new files. |
697 |
|
698 |
Also, if you use the Swish-e index file as the file passed to C<-N> there |
699 |
may be files that were added after indexing was started, but before the |
700 |
index file was written. This could result in a file not being added to |
701 |
the index. |
702 |
|
703 |
Another option is to maintain a parallel directory tree that contains |
704 |
symlinks pointing to the main files. When a new file is added (or |
705 |
changed) to the main directory tree you create a symlink to the real file |
706 |
in the parallel directory tree. Then just index the symlink directory |
707 |
to generate the incremental index. |
708 |
|
709 |
This option has the disadvantage that you need to have a central |
710 |
program that creates the new files that can also create the symlinks. |
711 |
But, indexing is quite fast since Swish-e only has to look at the files |
712 |
that need to be indexed. When you run full indexing you simply unlink |
713 |
(delete) all the symlinks. |
714 |
|
715 |
Both of these methods have issues where files could end up in both |
716 |
indexes, or files being left out of an index. Use of file locks while |
717 |
indexing, and hash lookups during searches can help prevent these |
718 |
problems. |
719 |
|
720 |
=head3 I run out of memory trying to index my files. |
721 |
|
722 |
It's true that indexing can take up a lot of memory! Swish-e is extremely |
723 |
fast at indexing, but that comes at the cost of memory. |
724 |
|
725 |
The best answer is install more memory. |
726 |
|
727 |
Another option is use the C<-e> switch. This will require less memory, |
728 |
but indexing will take longer as not all data will be stored in memory |
729 |
while indexing. How much less memory and how much more time depends on |
730 |
the documents you are indexing, and the hardware that you are using. |
731 |
|
732 |
Here's an example of indexing all .html files in /usr/doc on Linux. |
733 |
This first example is I<without> C<-e> and used about 84M of memory: |
734 |
|
735 |
270279 unique words indexed. |
736 |
23841 files indexed. 177640166 total bytes. |
737 |
Elapsed time: 00:04:45 CPU time: 00:03:19 |
738 |
|
739 |
This is I<with> C<-e>, and used about 26M or memory: |
740 |
|
741 |
270279 unique words indexed. |
742 |
23841 files indexed. 177640166 total bytes. |
743 |
Elapsed time: 00:06:43 CPU time: 00:04:12 |
744 |
|
745 |
You can also build a number of smaller indexes and then merge together |
746 |
with C<-M>. This will use more memory. Merging is not a great option. |
747 |
|
748 |
Finally, if you do build a number of smaller indexes, you can specify more |
749 |
than one index when searching by using the C<-f> switch. Sorting large |
750 |
results sets by a property will be slower when specifying multiple index |
751 |
files while searching. |
752 |
|
753 |
=head3 My system admin says Swish-e uses too much of the CPU! |
754 |
|
755 |
That's a good thing! That expensive CPU is suppose to be busy. |
756 |
|
757 |
Indexing takes a lot of work -- to make indexing fast much of the work is |
758 |
done in memory which reduces the amount of time Swish-e is waiting on I/O. |
759 |
But, there's two things you can try: |
760 |
|
761 |
The C<-e> option will run Swish-e in economy mode, which uses the disk |
762 |
to store data while indexing. This makes Swish-e run somewhat slower, |
763 |
but also uses less memory. Since it is writing to disk more often it |
764 |
will be spending more time waiting on I/O and less time in CPU. Maybe. |
765 |
|
766 |
The other thing is to simply lower the priority of the job using the |
767 |
nice(1) command: |
768 |
|
769 |
/bin/nice -15 swish-e -c search.conf |
770 |
|
771 |
If concerned about searching time, make sure you are using the -b and -m |
772 |
switches to only return a page at a time. If you know that your result |
773 |
sets will be large, and that you wish to return results one page at a |
774 |
time, and that often times many pages of the same query will be requested, |
775 |
you may be smart to request all the documents on the first request, and |
776 |
then cache the results to a temporary file. The perl module File::Cache |
777 |
makes this very simple to accomplish. |
778 |
|
779 |
=head2 Spidering |
780 |
|
781 |
=head3 How can I index documents on a web server? |
782 |
|
783 |
If possible, use the file system method C<-S fs> of indexing to index |
784 |
documents in you web area of the file system. This avoids the overhead |
785 |
of spidering a web server and is much faster. (C<-S fs> is the default |
786 |
method if C<-S> is not specified). |
787 |
|
788 |
If this is impossible (the web server is not local, or documents |
789 |
are dynamically generated), Swish-e provides two methods of spidering. |
790 |
First, it includes the http method of indexing C<-S http>. A number |
791 |
of special configuration directives are available that control spidering |
792 |
(see L<Directives for the HTTP Access Method Only|/"Directives for the |
793 |
HTTP Access Method Only">). A perl helper script (swishspider) is |
794 |
included in the F<src> directory to assist with spidering web servers. |
795 |
There are example configurations for spidering in the F<conf> directory. |
796 |
|
797 |
As of Swish-e 2.2, there's a general purpose "prog" document source where |
798 |
a program can feed documents to it for indexing. A number of example |
799 |
programs can be found in the C<prog-bin> directory, including a program |
800 |
to spider web servers. The provided spider.pl program is full-featured |
801 |
and is easily customized. |
802 |
|
803 |
The advantage of the "prog" document source feature over the "http" method |
804 |
is that the program is only executed one time, where the swishspider.pl |
805 |
program used in the "http" method is executed once for every document |
806 |
read from the web server. The forking of Swish-e and compiling of the |
807 |
perl script can be quite expensive, time-wise. |
808 |
|
809 |
The other advantage of the C<spider.pl> program is that it's simple and |
810 |
efficient to add filtering (such as for PDF or MS Word docs) right into |
811 |
the spider.pl's configuration, and it includes features such as MD5 checks |
812 |
to prevent duplicate indexing, options to avoid spidering some files, |
813 |
or index but avoid spidering. And since it's a perl program there's no |
814 |
limit on the features you can add. |
815 |
|
816 |
=head3 Why does swish report "./swishspider: not found"? |
817 |
|
818 |
Does the file F<swishspider> exist where the error message displays? If not, either |
819 |
set the configuration option L<SpiderDirectory|SWISH-CONFIG/"item_SpiderDir"> |
820 |
to point to the directory where the F<swishspider> program is found, or place the |
821 |
F<swishspider> program in the current directory when running swish-e. |
822 |
|
823 |
If you are running Windows, make sure "perl" is in your path. Try typing F<perl> from |
824 |
a command prompt. |
825 |
|
826 |
If you not running windows, make sure that the shebang line (the first line of the |
827 |
swishspider program that starts with #!) points to the correct location of perl. |
828 |
Typically this will be F</usr/bin/perl> or F</usr/local/bin/perl>. Also, make sure that |
829 |
you have execute and read permissions on F<swishspider>. |
830 |
|
831 |
The F<swishspider> perl script is only used with the -S http method of indexing. |
832 |
|
833 |
=head3 I'm using the spider.pl program to spider my web site, but some |
834 |
large files are not indexed. |
835 |
|
836 |
The C<spider.pl> program has a default limit of 5MB file size. This can |
837 |
be changed with the C<max_size> parameter setting. See C<perldoc |
838 |
spider.pl> for more information. |
839 |
|
840 |
=head3 I still don't think all my web pages are being indexed. |
841 |
|
842 |
The F<spider.pl> program has a number of debugging switches and can be |
843 |
quite verbose in telling you what's happening, and why. See C<perldoc |
844 |
spider.pl> for instructions. |
845 |
|
846 |
=head3 Swish is not spidering Javascript links! |
847 |
|
848 |
Swish cannot follow links generated by Javascript, as they are generated |
849 |
by the browser and are not part of the document. |
850 |
|
851 |
=head3 How do I spider other websites and combine it with my own |
852 |
(filesystem) index? |
853 |
|
854 |
You can either merge C<-M> two indexes into a single index, or use C<-f> |
855 |
to specify more than one index while searching. |
856 |
|
857 |
You will have better results with the C<-f> method. |
858 |
|
859 |
|
860 |
=head2 Searching |
861 |
|
862 |
=head3 How do I limit searches to just parts of the index? |
863 |
|
864 |
If you can identify "parts" of your index by the path name you have |
865 |
two options. |
866 |
|
867 |
The first options is by indexing the document path. Add this to your |
868 |
configuration: |
869 |
|
870 |
MetaNames swishdocpath |
871 |
|
872 |
Now you can search for words or phrases in the path name: |
873 |
|
874 |
swish-e -w 'foo AND swishdocpath=(sales)' |
875 |
|
876 |
So that will only find documents with the word "foo" and where the file's |
877 |
path contains "sales". That might not works as well as you like, though, |
878 |
as both of these paths will match: |
879 |
|
880 |
/web/sales/products/index.html |
881 |
/web/accounting/private/sales_we_messed_up.html |
882 |
|
883 |
This can be solved by searching with a phrase (assuming "/" is not |
884 |
a WordCharacter): |
885 |
|
886 |
swish-e -w 'foo AND swishdocpath=("/web/sales/")' |
887 |
swish-e -w 'foo AND swishdocpath=("web sales")' (same thing) |
888 |
|
889 |
|
890 |
The second option is a bit more powerful. With the C<ExtractPath> |
891 |
directive you can use a regular expression to extract out a sub-set of |
892 |
the path and save it as a separate meta name: |
893 |
|
894 |
MetaNames department |
895 |
ExtractPath department regex !^/web/([^/]+).+$!$1/ |
896 |
|
897 |
Which says match a path that starts with "/web/" and extract out |
898 |
everything after that up to, but not including the next "/" and save it in |
899 |
variable $1, and then match everything from the "/" onward. Then replace |
900 |
the entire matches string with $1. And that gets indexed as meta name |
901 |
"department". |
902 |
|
903 |
Now you can search like: |
904 |
|
905 |
swish-e -w 'foo AND department=sales' |
906 |
|
907 |
and be sure that you will only match the documents in the /www/sales/* |
908 |
path. Note that you can map completely different areas of your file |
909 |
system to the same metaname: |
910 |
|
911 |
# flag the marketing specific pages |
912 |
ExtractPath department regex !^/web/(marketing|sales)/.+$!marketing/ |
913 |
ExtractPath department regex !^/internal/marketing/.+$!marketing/ |
914 |
|
915 |
# flag the technical departments pages |
916 |
ExtractPath department regex !^/web/(tech|bugs)/.+$!tech/ |
917 |
|
918 |
|
919 |
Finally, if you have something more complicated, use C<-S prog> and |
920 |
write a perl program or use a filter to set a meta tag when processing |
921 |
each file. |
922 |
|
923 |
=head3 How can I limit searches to the title, body, or comment? |
924 |
|
925 |
Use the C<-t> switch. |
926 |
|
927 |
=head3 I can't limit searches to title/body/comment. |
928 |
|
929 |
Or, I<I can't search with meta names, all the names are indexed as |
930 |
"plain".> |
931 |
|
932 |
Check in the config.h file if #define INDEXTAGS is set to 1. If it is, |
933 |
change it to 0, recompile, and index again. When INDEXTAGS is 1, ALL |
934 |
the tags are indexed as plain text, that is you index "title", "h1", and |
935 |
so on, AND they loose their indexing meaning. If INDEXTAGS is set to 0, |
936 |
you will still index meta tags and comments, unless you have indicated |
937 |
otherwise in the user config file with the IndexComments directive. |
938 |
|
939 |
Also, check for the C<UndefinedMetaTags> setting in your configuration |
940 |
file. |
941 |
|
942 |
=head3 I've tried running the included CGI script and I get a "Internal |
943 |
Server Error" |
944 |
|
945 |
Debugging CGI scripts are beyond the scope of this document. |
946 |
Internal Server Error basically means "check the web server's log for |
947 |
an error message", as it can mean a bad shebang (#!) line, a missing |
948 |
perl module, FTP transfer error, or simply an error in the program. |
949 |
The CGI script F<swish.cgi> in the F<example> directory contains some |
950 |
debugging suggestions. Type C<perldoc swish.cgi> for information. |
951 |
|
952 |
There are also many, many CGI FAQs available on the Internet. A quick web |
953 |
search should offer help. As a last resort you might ask your webadmin |
954 |
for help... |
955 |
|
956 |
=head3 When I try to view the swish.cgi page I see the contents of the |
957 |
Perl program. |
958 |
|
959 |
Your web server is not configured to run the program as a CGI script. |
960 |
This problem is described in C<perldoc swish.cgi>. |
961 |
|
962 |
|
963 |
=head3 How do I make Swish-e highlight words in search results? |
964 |
|
965 |
Short answer: |
966 |
|
967 |
Use the supplied swish.cgi script located in the F<examples> directory. |
968 |
|
969 |
Long answer: |
970 |
|
971 |
Swish-e can't because it doesn't have access to the source documents when |
972 |
returning results, of course. But a front-end program of your creation |
973 |
can highlight terms. Your program can open up the source documents and |
974 |
then use regular expressions to replace search terms with highlighted |
975 |
or bolded words. |
976 |
|
977 |
But, that will fail with all but the most simple source documents. |
978 |
For HTML documents, for example, you must parse the document into words |
979 |
and tags (and comments). A word you wish to highlight may span multiple |
980 |
HTML tags, or be a word in a URL and you wish to highlight the entire |
981 |
link text. |
982 |
|
983 |
Perl modules such as HTML::Parser and XML::Parser make word extraction |
984 |
possible. Next, you need to consider that Swish-e uses settings such |
985 |
as WordCharacters, BeginCharacters, EndCharacters, IgnoreFirstChar, |
986 |
and IgnoreLast, char to define a "word". That is, you can't consider |
987 |
that a string of characters with white space on each side is a word. |
988 |
|
989 |
Then things like TranslateCharacters, and HTML Entities may transform a |
990 |
source word into something else, as far as Swish-e is concerned. Finally, |
991 |
searches can be limited by metanames, so you may need to limit your |
992 |
highlighting to only parts of the source document. Throw phrase searches |
993 |
and stopwords into the equation and you can see that it's not a trivial |
994 |
problem to solve. |
995 |
|
996 |
All hope is not lost, thought, as Swish-e does provide some help. |
997 |
Using the C<-H> option it will return in the headers the current index |
998 |
(or indexes) settings for WordCharacters (and others) required to parse |
999 |
your source documents as it parses them during indexing, and will return a |
1000 |
"Parsed Words:" header that will show how it parsed the query internally. |
1001 |
If you use fuzzy indexing (word stemming, soundex, or metaphone) |
1002 |
then you will also need to stem each word in your |
1003 |
document before comparing with the "Parsed Words:" returned by Swish-e. |
1004 |
The Swish-e stemming code is available either by using the Swish-e |
1005 |
Perl module or C library (included with the swish-e distribution), |
1006 |
or by using the SWISH::Stemmer module available on CPAN. Also on CPAN is |
1007 |
the module Text::DoubleMetaphone. |
1008 |
|
1009 |
=head3 Do filters effect the performance during search? |
1010 |
|
1011 |
No. Filters (FileFilter or via "prog" method) are only used for building |
1012 |
the search index database. During search requests there will be no |
1013 |
filter calls. |
1014 |
|
1015 |
|
1016 |
=head2 I have read the FAQ but I still have questions about using Swish-e. |
1017 |
|
1018 |
The Swish-e discussion list is the place to go. http://swish-e.org/. |
1019 |
Please do not email developers directly. The list is the best place to |
1020 |
ask questions. |
1021 |
|
1022 |
Before you post please read I<QUESTIONS AND TROUBLESHOOTING> located |
1023 |
in the L<INSTALL|INSTALL> page. You should also search the Swish-e |
1024 |
discussion list archive which can be found on the swish-e web site. |
1025 |
|
1026 |
In short, be sure to include in the following when asking for help. |
1027 |
|
1028 |
=over 4 |
1029 |
|
1030 |
=item * The swish-e version (./swish-e -V) |
1031 |
|
1032 |
=item * What you are indexing (and perhaps a sample), and the number |
1033 |
of files |
1034 |
|
1035 |
=item * Your Swish-e configuration file |
1036 |
|
1037 |
=item * Any error messages that Swish-e is reporting |
1038 |
|
1039 |
=back |
1040 |
|
1041 |
=head1 Document Info |
1042 |
|
1043 |
$Id: SWISH-FAQ.pod,v 1.24 2002/08/20 22:24:08 whmoseley Exp $ |
1044 |
|
1045 |
. |