1 |
=head1 NAME |
2 |
|
3 |
INSTALL - Swish-e Installation Instructions |
4 |
|
5 |
=head1 OVERVIEW |
6 |
|
7 |
This document describes how to download, build and install Swish-e. |
8 |
Also described is how to build Swish-e with optional, yet recommended libraries that |
9 |
extend and enhance Swish-e. |
10 |
|
11 |
This document also provides instructions on how to get help installing |
12 |
and using Swish-e (and the important information you should provide when asking for help). |
13 |
|
14 |
Also, below is a basic overview of using Swish-e to index documents, with pointers to |
15 |
other more advanced examples. |
16 |
|
17 |
For those in a hurry, see L<"Quick Start for the Impatient">. |
18 |
|
19 |
=head1 SYSTEM REQUIREMENTS |
20 |
|
21 |
Swish-e 2.x is written in C, and, up to this time, it has been tested on |
22 |
Solaris 2.6, AIX 4.3.2, OpenVMS 7.2-1 AXP, RedHat Linux 6.2 (and other |
23 |
Linux distributions) and Win32 platforms. |
24 |
|
25 |
Unless you are using the Win32 binary distribution, a C compiler is needed. |
26 |
Pretty much any standard compiler should do, although you will probably |
27 |
have best luck with a current version of gcc. If you are using something |
28 |
else (such as HP-UX or AIX) you may see more warnings during the build |
29 |
process. Any problems should be sent to the Swish-e discussion list |
30 |
after searching the list archives. |
31 |
|
32 |
B<libxml2> |
33 |
|
34 |
http://www.xmlsoft.org/ |
35 |
|
36 |
Swish-e 2.2 can (and probably should) use the libxml2 library for parsing |
37 |
HTML and XML files. Instructions for installing and enabling the library |
38 |
are described below. |
39 |
|
40 |
Currently, the libxml2 library is not required, but is a much better |
41 |
parser than the tired old Swish-e html parser (html.c). Please see |
42 |
the Swish-e FAQ L<SWISH-FAQ|SWISH-FAQ> for more discussion of the use |
43 |
of libxml2. |
44 |
|
45 |
Swish-e's old xml.c paser has been rewritten to use James Clark's Expat |
46 |
library (included with the Swish-e distribution), but Swish-e's old |
47 |
html.c code is still broken in a number of ways. Libxml2 is comparable to |
48 |
Expat, but offers a much better HTML parser that Swish-e's html.c parser. |
49 |
Use libxml2 if possible for parsing HTML and XML. |
50 |
|
51 |
Currently, setting a content type |
52 |
(L<IndexContents|SWISH-CONFIG/"item_IndeContents"> or L<DefaultContents|"SWISH-CONFIG/"DefaultContents">) |
53 |
of "HTML" uses Swish-e's html.c parser, where a setting of "HTML2" uses libxml2's HTML parser. |
54 |
Likewise, a setting of "XML" uses the included Expat library, where "XML2" |
55 |
uses libxml2 for parsing XML. All this may change in future releases. |
56 |
|
57 |
B<zlib compression> |
58 |
|
59 |
http://www.gzip.org/zlib/ |
60 |
|
61 |
Swish-e can make use of zlib to compress document properties. This is recommended |
62 |
if you are using L<StoreDescription|SWISH-CONFIG/"item_StoreDescription">. |
63 |
|
64 |
A Swish-e program built with zlib will read an index from a version of Swish-e that |
65 |
was not built with zlib. But, if you are searching an index that was compressed with |
66 |
zlib then you will need to use a version of Swish-e built with zlib. Therefore, it's |
67 |
recommended to always include zlib support. |
68 |
|
69 |
|
70 |
B<Memory> |
71 |
|
72 |
Swish needs quite a bit of memory while indexing. How much depends |
73 |
on what you are indexing. The index is portable between platforms, |
74 |
so you can index on a machine that has lots of memory available and |
75 |
move the index files to another machine for searching. Use the C<-e> |
76 |
switch if you are short on memory. |
77 |
|
78 |
B<Perl modules> |
79 |
|
80 |
http://www.cpan.org |
81 |
|
82 |
http://search.cpan.org |
83 |
|
84 |
Swish-e uses a perl script for spidering web sites. The script |
85 |
requires the LWP bundle of modules (see http://search.cpan.org/search?dist=libwww-perl ). |
86 |
(Note: depending on your perl installation, you might need to install additional modules required |
87 |
by LWP; for requirements and downloads check http://www.cpan.org |
88 |
or http://search.cpan.org). The Perl helper script was tested with |
89 |
perl 5.005, 5.6.0, and 5.6.1 although it should probably work with any version 5 release. |
90 |
Do note that the LWP, HTTP, and HTML modules are updated often for bug |
91 |
fixes and such -- do check for upgrades, and don't expect that your system admin |
92 |
as been keeping up with bug fixes. |
93 |
|
94 |
|
95 |
=head2 Platform Specific Information |
96 |
|
97 |
A C<configure> script is used to determine platform specific details |
98 |
for building swish. Please contact the Swish-e discussion list if you |
99 |
notice any platform specific problems while building Swish-e. |
100 |
|
101 |
Specific information for various platforms can be found in subdirectories |
102 |
of the C<src> directory. For example, the Win32 files can be found |
103 |
in C<src/win32>, and instructions for building under VMS can be found |
104 |
in C<src/vms>. |
105 |
|
106 |
The Windows binary is distributed as a separate package from the source |
107 |
distribution. See http://Swish-e.org for download information. |
108 |
|
109 |
=head1 INSTALLATION |
110 |
|
111 |
Instructions below are for installing Swish-e from source. |
112 |
Installing from source is recommended, but you should also check |
113 |
the Swish-e web site for binary distributions for your platform. |
114 |
|
115 |
Windows binary distributions are available from the Swish-e site. |
116 |
|
117 |
=head2 Brief Instructions |
118 |
|
119 |
./configure |
120 |
make |
121 |
make test |
122 |
su root |
123 |
make install |
124 |
|
125 |
Swish uses a F<configure> script to generate a Makefile for your platform. |
126 |
The F<configure> script should detect and use optional libraries if found on |
127 |
your system. |
128 |
|
129 |
=head2 Using libxml2 parser library (optional, but recommended) |
130 |
|
131 |
Daniel Veillard's libxml2 is a well supported library for working with |
132 |
HTML and XML documents. As of version 2.2 Swish-e can use libxml2 to parse HTML and |
133 |
XML documents. |
134 |
|
135 |
Installing the libxml2 library is not required at this time, but is |
136 |
recommended, espeically if you are parsing HTML. As mentioned above, |
137 |
the XML parser that is included with swish uses James Clarks's Expat |
138 |
library and works well. The HTML parser in Swish-e has been in use for |
139 |
years, but the parser provided by libxml2 is preferred. The libxml2 |
140 |
HTML parser offers more features (and more features for parsing XML), and |
141 |
is more accurate. If you are running Linux it may already be installed |
142 |
(look for libxml2.so.2.4.5 or higher). |
143 |
|
144 |
The library can be downloaded from http://www.xmlsoft.org/. Installation |
145 |
directions are included in the INSTALL file in the libxml2 package. |
146 |
Uncompressing, building, and installation of libxml2 is very similar to |
147 |
the way Swish-e is built. |
148 |
|
149 |
Many Linux distributions provide libxml2 packages directly via RPM or |
150 |
the Debian pacakage system. Check with your distributions web site for |
151 |
more information, as this is a very easy way to install this library. |
152 |
|
153 |
If libxml2 complains during compilation that it can not find zlib then |
154 |
you may need to specify the location of zlib. This happens (on Solaris) |
155 |
when the ./configure script finds the zlib header files, but the compiler |
156 |
and linker do not know to look in /usr/local/lib for the library. |
157 |
You may see an error like: |
158 |
|
159 |
ld: fatal: library -lz: not found |
160 |
ld: fatal: File processing errors. No output written to .libs/libxml2.so.2.4.5 |
161 |
*** Error code 1 |
162 |
|
163 |
In this case, try specifying where zlib can be found. For example, |
164 |
if libz was located in /usr/local/lib you would use this when building |
165 |
B<libxml2>: |
166 |
|
167 |
# building libxml2 (not swish) |
168 |
./configure --with-zlib=/usr/local |
169 |
|
170 |
Swish-e doesn't use libxml2 uncompression features, so you *should* |
171 |
be able to disable zlib when building B<libxml2>: |
172 |
|
173 |
./configure --without-zlib |
174 |
|
175 |
B<NOTE:> But, that doesn't seem to work at this time (as of version |
176 |
libxml2-2.4.5). |
177 |
|
178 |
If you do not have root access you can specify a prefix when building B<libxml2>: |
179 |
|
180 |
./configure --prefix=$HOME/local |
181 |
|
182 |
This will install the headers and library files in F<$HOME/local/include> |
183 |
and F<$HOME/local/lib>. You will need to inform the Swish-e build |
184 |
process of this non-standard directory location (explained below). |
185 |
|
186 |
Once you run the libxml2 F<configure> script you build and install the library |
187 |
as the libxml2 F<INSTALL> page instructs: |
188 |
|
189 |
make |
190 |
make install |
191 |
|
192 |
|
193 |
B<Building Swish-e with libxml2> |
194 |
|
195 |
Swish will try to detect if libxml2 is installed in the standard library locations. |
196 |
|
197 |
If libxml2 is installed in your system and you do B<not> want to build with libxml2: |
198 |
|
199 |
./configure --without-libxml2 |
200 |
|
201 |
If libxml2 was installed in a non-standard location then specify the |
202 |
path where libxml2 was installed. For example, |
203 |
|
204 |
./configure --with-libxml2=$HOME/local |
205 |
|
206 |
If libxml2 is installed in a non-standard location, Swish-e needs to know |
207 |
where that library is at run time. There seems to be a number of ways |
208 |
to do this. First, you can set the environment variable C<LD_RUN_PATH> |
209 |
*before* running make to create Swish-e. This will add the path directly |
210 |
to the Swish-e executable file. |
211 |
|
212 |
For example, under Bourne type shells: |
213 |
|
214 |
LD_RUN_PATH=$HOME/local/lib make |
215 |
|
216 |
Other shells (like csh and tcsh) may require: |
217 |
|
218 |
setenv LD_RUN_PATH $HOME/local/lib |
219 |
make |
220 |
|
221 |
Another option is to use the C<LD_LIBRARY_PATH> environment variable. |
222 |
This is a list of directories to search for libraries when a program |
223 |
is run. See the ld(8) man page for more info. |
224 |
|
225 |
Note that libxml2 will be linked as a shared library on many platforms, so once you |
226 |
compile Swish-e to use the library, the libxml2 library must not be |
227 |
deleted or moved. |
228 |
|
229 |
=head2 Building Swish-e with zlib |
230 |
|
231 |
Building with zlib is similar to the instructions for building Swish-e |
232 |
with libxml2 above. The F<configure> script will attempt to detect if zlib is |
233 |
installed in your system and if found link Swish-e with the zlib library. |
234 |
|
235 |
zlib is common on many systems, but may be out of date, and versions prior to 1.1.4 |
236 |
have a know security issue. You should run |
237 |
at least version 1.1.4. To link with zlib in a non-standard location use, |
238 |
for example: |
239 |
|
240 |
./configure --with-zlib=$HOME/zlib |
241 |
|
242 |
Again, as with compiling libxml2, you may need to use the C<LD_RUN_PATH> |
243 |
or C<LD_LIBRARY_PATH> variables. See above for more details. |
244 |
|
245 |
|
246 |
=head2 Downloading and unpacking and building Swish-e |
247 |
|
248 |
If you are reading this INSTALL document, then you probably already have |
249 |
downloaded and unpacked the distribution. But just in case... |
250 |
|
251 |
Make sure you are using the current release from |
252 |
http://Swish-e.org. If you have any questions about which version to use, please |
253 |
ask on the Swish-e discussion list. |
254 |
|
255 |
How you download Swish-e is up to you: lynx, lwp-download, |
256 |
wget are all common methods. |
257 |
|
258 |
=over 3 |
259 |
|
260 |
=item 1 Uncompress the distribution file |
261 |
|
262 |
|
263 |
gzip -dc swish-e.x.x.tar.gz | tar xof - |
264 |
|
265 |
or on some versions of tar, simply |
266 |
|
267 |
tar -zxof swish-e.x.x.tar.gz |
268 |
|
269 |
Uncompressing should create the following directories: |
270 |
|
271 |
swish-e-x.x/ configure script and top-level Makefile |
272 |
swish-e-x.x/pod/ Swish-e documentation |
273 |
swish-e-x.x/html/ HTML version of the documentation |
274 |
swish-e-x.x/src/ source code |
275 |
swish-e-x.x/conf/ example configuration files and stopword files |
276 |
swish-e-x.x/example/ working example CGI scripts |
277 |
swish-e-x.x/filter-bin/ filter samples |
278 |
swish-e-x.x/prog-bin/ -S prog a web spider and other examples |
279 |
swish-e-x.x/perl/ perl interface to the Swish-e C library |
280 |
swish-e-x.x/src/expat/ James Clark's Expat XML parser |
281 |
swish-e-x.x/src/win32/ win32 binary and buid files |
282 |
swish-e-x.x/src/vms/ files required for building under VMS |
283 |
swish-e-x.x/tests/ tests used for running "make test" |
284 |
swish-e-x.x/doc/ directory used or building the documentation |
285 |
|
286 |
|
287 |
=item 2 Make any needed changes in F<src/config.h> |
288 |
|
289 |
Compile-time configuration settings are adjusted in the file |
290 |
F<src/config.h>. Most of the settings may also be specified in the |
291 |
configuration file used during indexing. |
292 |
|
293 |
You probably will B<not> need to change this file, but it's helpful |
294 |
to become familiar with the default compiled-in settings. |
295 |
|
296 |
=item 3 Build Swish-e |
297 |
|
298 |
Building Swish-e on most systems is a simple procedure. In the |
299 |
Swish-e-x.x/ top level directory type the following commands |
300 |
|
301 |
./configure |
302 |
make |
303 |
make test |
304 |
|
305 |
You should build swish as a normal user (i.e. not as "root"). |
306 |
|
307 |
Note: If you wish to use libxml2 or zlib please see the previous section |
308 |
for the required configure options. |
309 |
|
310 |
The above will create the Swish-e executable F<src/swish-e> and test |
311 |
that the executable is working correctly. C<make test> will generate |
312 |
an index file in the F<tests> directory and run a number of searches |
313 |
against this index. At this time, the tests really just make sure that swish-e |
314 |
was compiled correctly and runs. |
315 |
|
316 |
You may optionally "build" the F<swish-search> executable. This is |
317 |
a version of Swish-e that cannot write to the index file. This |
318 |
version may provide somewhat improved security in a CGI environment. |
319 |
The binaries F<swish-e> and F<swish-search> are the same files -- the |
320 |
additional security is enabled when the binary is named I<swish-search>. |
321 |
F<swish-search> is not a substitute for good file system and CGI security. |
322 |
Please review the many CGI security papers available on-line. |
323 |
|
324 |
Again, this is an optional step: |
325 |
|
326 |
make swish-search |
327 |
|
328 |
which simply copies the file F<swish-e> to F<swish-search>. |
329 |
|
330 |
=item 4 Install Swish-e |
331 |
|
332 |
Move the F<swish-e> (and/or F<swish-search>) executable to its final |
333 |
location (normally /usr/local/bin). You may simply copy the program |
334 |
anywhere you see fit, or you may use the C<make install> command to |
335 |
install it to the location defined by the F<configure> script: |
336 |
|
337 |
You may need to superuser privileges: |
338 |
|
339 |
su root |
340 |
make install |
341 |
exit |
342 |
|
343 |
B<IMPORTANT:> Do not run swish-e as the superuser (root). |
344 |
|
345 |
The bin directory may be set when first running F<./configure>. For example: |
346 |
|
347 |
./configure --bindir=$HOME/bin |
348 |
|
349 |
sets the installation directory to F<$HOME/bin> and C<make install> |
350 |
will install the program in that location. |
351 |
|
352 |
=back |
353 |
|
354 |
=head2 Join the Swish-e discussion list |
355 |
|
356 |
The Swish-e discussion list is the place to ask questions about installing |
357 |
and using Swish-e, see or post bug fixes or security announcements, and |
358 |
a place where B<you> can offer help to others. |
359 |
|
360 |
The list is typically I<very low traffic>, so it won't overload your |
361 |
inbox. Please take time to subscribe. See http://Swish-e.org. |
362 |
|
363 |
If you are using Swish-e on a public site, please let the list know so |
364 |
it can be added to the list of sites that use Swish-e! |
365 |
|
366 |
Please review L<QUESTIONS AND TROUBLESHOOTING|QUESTIONS AND TROUBLESHOOTING> before posting |
367 |
a question to the Swish-e list. |
368 |
|
369 |
=head2 Installing the Swish-e C Library (optional) |
370 |
|
371 |
Swish 2.2 creates the C library F<libswish-e.a> during the build. |
372 |
Install this library if you wish to embed Swish-e into another |
373 |
application. For example, the library should be installed |
374 |
before using the high level Perl SWISH modules located on |
375 |
CPAN. http://search.cpan.org/search?mode=module&query=SWISH |
376 |
|
377 |
This is an *optional* step. Most users will not need to install the library. |
378 |
|
379 |
To install the library issue the following commands (again, you may need |
380 |
to su root) |
381 |
|
382 |
su root |
383 |
make install-lib |
384 |
exit |
385 |
|
386 |
By default this will install the library in /usr/local/lib, but this |
387 |
directory can be set when running ./configure with the --libdir option. |
388 |
For example: |
389 |
|
390 |
./configure --bindir=$HOME/bin --libdir=$HOME/lib |
391 |
|
392 |
So C<make install> will install the F<swish-e> binary in F<$HOME/bin> |
393 |
and C<make install-lib> will install the F<libswish-e.a> library in |
394 |
F<$HOME/lib>. |
395 |
|
396 |
Note: You may wish to run C<make realclean> before running ./configure again. |
397 |
|
398 |
=head2 Creating PDF and Postscript documentation (optional) |
399 |
|
400 |
The Swish-e documentation in HTML format was created with Pod::HtmlPsPdf, |
401 |
a package of Perl modules written and/or modified by Stas Bekman to automate |
402 |
the conversion of documents in pod format (see perldoc perlpod) to HTML, |
403 |
Postscript, and PDF. A slightly modified version of this package is |
404 |
include with the Swish-e distribution and used for building the HTML. |
405 |
|
406 |
If your system has the B<necessary tools> to build Postscript and the |
407 |
converter ps2pdf installed, you may be able to build the Postscript |
408 |
and PDF versions of the documentation. After you have run ./configure, |
409 |
type from the top-level directory of the distribution: |
410 |
|
411 |
make pdf |
412 |
|
413 |
And with any luck you will end up with the these two files in the top-level directory: |
414 |
|
415 |
swish-e_documentation.pdf |
416 |
swish-e_documentation.ps |
417 |
|
418 |
Most people find reading the documentation in HTML most convenient. |
419 |
|
420 |
=head2 Installing the Swish-e documentation as man(1) pages (optional) |
421 |
|
422 |
Part of the included Swish-e documentation can be installed as system |
423 |
man(1) pages. Only the reference related pages are installed (it's |
424 |
assumed that you don't need to install the README or INSTALL documents as |
425 |
man pages). You must have the pod2man program installed on your system |
426 |
(which you probably do if you have Perl). |
427 |
|
428 |
To build the man pages and install them into your system, type from the |
429 |
top-level directory (after running ./configure): |
430 |
|
431 |
su root |
432 |
make install-man |
433 |
exit |
434 |
|
435 |
You will need to C<su root> if you do not have write access to the man directory. |
436 |
|
437 |
The man pages are installed in the system man directory. This directory |
438 |
is determined by running ./configure and can be set by passing the |
439 |
directory when running ./configure. |
440 |
|
441 |
For example, |
442 |
|
443 |
./configure --mandir=/usr/local/doc/man |
444 |
|
445 |
Information on running ./configure can be found by typing: |
446 |
|
447 |
./configure --help |
448 |
|
449 |
The pod source files used to create the man files were written running |
450 |
under perl 5.6.1. Older version of Perl may complain slightly about the |
451 |
formatting of the pod files. This shouldn't be a problem, but please |
452 |
let the Swish-e list know if otherwise. Then upgrade your version of perl. ;) |
453 |
|
454 |
=head1 QUESTIONS AND TROUBLESHOOTING |
455 |
|
456 |
Please search the Swish-e list archive before posting a question, and |
457 |
check the L<SWISH-FAQ|SWISH-FAQ> to see if your question hasn't already |
458 |
been asked. |
459 |
|
460 |
Support for installation, configuration and usage is available via the |
461 |
Swish-e discussion list. Visit http://swish-e.org for information. |
462 |
Do not contact developers directly for help -- always post your question |
463 |
to the list. |
464 |
|
465 |
Before posting use tools available to narrow down the problem. |
466 |
|
467 |
Swish-e has the -T, -v, and -k switches that may help resolve issues. |
468 |
If possible find a single document that shows the problem, then index |
469 |
with -T INDEXED_WORDS and watch the exact words that are indexed. |
470 |
Use -H 9 when searching and look at C<Parsed Words:> to make sure you |
471 |
are searching the correct words. |
472 |
|
473 |
You can also use programs like C<gdb> to help find segfaults and other |
474 |
run-time errors, and programs like C<truss> or C<strace> can often |
475 |
provide interesting information, if you are adventurous. |
476 |
|
477 |
=head2 When posting please provide the following information: |
478 |
|
479 |
=over 4 |
480 |
|
481 |
=item * |
482 |
|
483 |
The exact version of Swish-e that you are using. Running Swish-e with the |
484 |
C<-V> switch will print the version number. Also, supply the output from |
485 |
C<uname -a> or similar command that identifies the operating system you |
486 |
are running on. If you are running an old version of swish be prepared |
487 |
for a response to your question of "upgrade." |
488 |
|
489 |
=item * |
490 |
|
491 |
A summary of the problem. This should include the commands issued |
492 |
(e.g. for indexing or searching) and their output, and why you don't |
493 |
think it's working correctly. Please cut-n-paste the exact commands |
494 |
and their output instead of retyping to avoid errors. |
495 |
|
496 |
=item * |
497 |
|
498 |
Include a copy of the configuration file you are using, if any. Swish-e |
499 |
has reasonable defaults so in many cases you can run it without using |
500 |
a configuration file. But, if you need to use a configuration file, |
501 |
reduce it down to the absolute minimum number of commands required to |
502 |
demonstrate your problem. Again, cut-n-paste. |
503 |
|
504 |
=item * |
505 |
|
506 |
A small copy of a source document that demonstrates the problem. |
507 |
|
508 |
If you are having problems spidering a web server, use lwp-download or |
509 |
wget to copy the file locally to make sure you can index the document |
510 |
using the file system method. |
511 |
|
512 |
If you do need help with spidering, don't post fake URLs, as it makes it |
513 |
impossible to help. If you don't want to expose your web page to the |
514 |
people on the Swish-e list, find some other site to test spidering on. |
515 |
If that works, but you still cannot spider your own site then post your |
516 |
real URL if you want help. |
517 |
|
518 |
=item * |
519 |
|
520 |
If you are having trouble building Swish-e please cut-n-paste the output |
521 |
from make (or from ./configure if that's where the problem is). |
522 |
|
523 |
|
524 |
=back |
525 |
|
526 |
=head1 BASIC CONFIGURATION AND USAGE |
527 |
|
528 |
This section should give you a basic overview of indexing and searching |
529 |
with B<Swish-e>. Other examples can be found in the F<conf> directory, which will |
530 |
step you through a number of different configurations. |
531 |
Also, please review the L<SWISH-FAQ|SWISH-FAQ>. |
532 |
|
533 |
Swish-e reads a configuration file (see L<SWISH-CONFIG|SWISH-CONFIG>) |
534 |
for directives that control what and how Swish-e indexes files. |
535 |
Then running Swish-e is controlled by command line arguments (see |
536 |
L<SWISH-RUN|SWISH-RUN>). |
537 |
|
538 |
Swish-e does not require a configuration file, but |
539 |
most people need to change the default behavior by placing settings |
540 |
in a configuration file. |
541 |
|
542 |
To try the examples below change to the F<tests> subdirectory of the |
543 |
distribution. The tests will use the *.html files in this directory when |
544 |
creating the test index. You may wish to review these *.html files to |
545 |
get an idea of the various native file formats that Swish-e supports. |
546 |
|
547 |
=head2 Step 1: Create a Configuration File |
548 |
|
549 |
The configuration file controls what and how Swish-e indexes. The |
550 |
configuration file consists of directives, comments, and blank lines. |
551 |
The configuration file can be any name you like. |
552 |
|
553 |
This example will work with the documents in the F<tests> directory. |
554 |
You may wish to review the F<tests/test.config> configuration file used |
555 |
for the C<make test> tests. |
556 |
|
557 |
For example, a simple configuration file (F<Swish-e.conf>): |
558 |
|
559 |
# Example Swish-e Configuration file |
560 |
|
561 |
# Define *what* to index |
562 |
# IndexDir can point to a directories and/or a files |
563 |
|
564 |
# Here it's pointing to the current directory |
565 |
IndexDir . |
566 |
|
567 |
# But only index the .html files |
568 |
IndexOnly .html |
569 |
|
570 |
# Show basic info while indexing |
571 |
IndexReport 1 |
572 |
|
573 |
And that's a simple configuration file. It says to index all the |
574 |
.html files in the current directory, and provide some basic output |
575 |
while indexing. |
576 |
|
577 |
The complete list of all configuration file directives are described |
578 |
in L<SWISH-CONFIG|SWISH-CONFIG>. |
579 |
|
580 |
=head2 Step 2: Index your Files |
581 |
|
582 |
Now, make sure you are in the F<tests> directory and save the above |
583 |
example configuration file as F<swish-e.conf>. Then run Swish-e using |
584 |
the C<-c> switch to specify the name of the configuration file. |
585 |
|
586 |
../src/swish-e -c swish-e.conf |
587 |
|
588 |
Indexing Data Source: "File-System" |
589 |
Indexing "." |
590 |
Removing very common words... |
591 |
no words removed. |
592 |
Writing main index... |
593 |
Sorting words ... |
594 |
Sorting 55 words alphabetically |
595 |
Writing header ... |
596 |
Writing index entries ... |
597 |
Writing word text: Complete |
598 |
Writing word hash: Complete |
599 |
Writing word data: Complete |
600 |
55 unique words indexed. |
601 |
Writing file list ... |
602 |
Property Sorting complete. |
603 |
Writing sorted index ... |
604 |
5 files indexed. 1252 total bytes. |
605 |
Elapsed time: 00:00:00 CPU time: 00:00:00 |
606 |
Indexing done! |
607 |
|
608 |
This created the index file F<index.swish-e>. This is the default |
609 |
index file name unless the B<IndexFile> directive is specified in the |
610 |
configuration file: |
611 |
|
612 |
IndexFile ./website.index |
613 |
|
614 |
=head2 Step 3: Search |
615 |
|
616 |
You specify your search terms with the C<-w> switch. For example, to find |
617 |
the files that contain the word B<sample> you would issue the command: |
618 |
|
619 |
../src/swish-e -w sample |
620 |
|
621 |
This example assumes that you are in the F<tests> directory, and the |
622 |
Swish-e binary is in the F<../src> directory. Swish-e returns in response |
623 |
to that command the following: |
624 |
|
625 |
../src/swish-e -w sample |
626 |
|
627 |
# SWISH format: 2.2 |
628 |
# Search words: sample |
629 |
# Number of hits: 2 |
630 |
# Search time: 0.000 seconds |
631 |
# Run time: 0.005 seconds |
632 |
1000 ./test_xml.html "If you are seeing this, the METATAG XML search was successful!" 159 |
633 |
1000 ./test.html "If you are seeing this, the test was successful!" 437 |
634 |
. |
635 |
|
636 |
So the word B<sample> was found in two documents. The first number |
637 |
shown is the relevance or rank of the search term, followed by the file |
638 |
containing the search term, the title of the document, and finally the |
639 |
length of the document. |
640 |
|
641 |
The period (".") alone at the end marks the end of results. |
642 |
|
643 |
Much more information may be retrieved while searching by using |
644 |
the C<-x> and C<-H> switches (see L<SWISH-RUN|SWISH-RUN>) |
645 |
and by using Document Properties (see L<SWISH-CONFIG|SWISH-CONFIG>). |
646 |
|
647 |
=head2 Phrase Searching |
648 |
|
649 |
To search for a phrase in a document use double-quotes to delimit your |
650 |
search terms. (The phrase delimiter is set in src/swish.h.) |
651 |
|
652 |
You must protect the quotes from the shell. |
653 |
|
654 |
For example, under Unix: |
655 |
|
656 |
swish-e -w '"this is a pharase" or (this and that)' |
657 |
swish-e -w 'meta1=("this is a pharase") or (this and that)' |
658 |
|
659 |
Or under Windows F<command.com> shell. |
660 |
|
661 |
swish-e -w \"this is a pharase\" or (this and that) |
662 |
|
663 |
The phrase delimiter can be set with the C<-P> switch. |
664 |
|
665 |
=head2 Boolean Searching |
666 |
|
667 |
You can use the Boolean operators B<and>, B<or>, or B<not> in searching. |
668 |
Without these Boolean, Swish-e will assume you're B<and>ing the words together. |
669 |
|
670 |
Here are some examples: |
671 |
|
672 |
../src/swish-e -w 'apples oranges' |
673 |
../src/swish-e -w 'apples and oranges' ( Same thing ) |
674 |
|
675 |
../src/swish-e -w 'apples or oranges' |
676 |
|
677 |
../src/swish-e -w 'apples or oranges not juice' -f myIndex |
678 |
|
679 |
retrieves first the files that contain both the words "apples" and "oranges"; |
680 |
then among those the ones that do not contain the word "juice" |
681 |
|
682 |
A few others to ponder: |
683 |
|
684 |
../src/swish-e -w 'apples and oranges or pears' |
685 |
../src/swish-e -w '(apples and oranges) or pears' ( Same thing ) |
686 |
../src/swish-e -w 'apples and (oranges or pears)' ( Not the same thing ) |
687 |
|
688 |
See L<SWISH-SEARCH|SWISH-SEARCH> for more information. |
689 |
|
690 |
|
691 |
=head2 Context Searching |
692 |
|
693 |
The C<-t> option in the search command line allows you to search for |
694 |
words that exist only in specific HTML tags. Each character in the |
695 |
string you specify in the argument to this option represents a different |
696 |
tag in which the word is searched; that is you can use any combinations |
697 |
of the following characters: |
698 |
|
699 |
H means all <HEAD> tags |
700 |
B stands for <BODY> tags |
701 |
t is all <TITLE> tags |
702 |
h is <H1> to <H6> (header) tags |
703 |
e is emphasized tags (this may be <B>, <I>, <EM>, or <STRONG>) |
704 |
c is HTML comment tags (<!-- ... -->) |
705 |
|
706 |
For example: |
707 |
|
708 |
# Find only documents with the word "linux" in the E<lg>TITLEE<gt> tags. |
709 |
./swish-e -w linux -t t |
710 |
|
711 |
# Find the word "apple" in titles or comments |
712 |
./swish-e -w apple -t tc |
713 |
|
714 |
|
715 |
=head2 META Tags |
716 |
|
717 |
For the last example we will instruct Swish-e to use META tags to define |
718 |
I<fields> in your documents. |
719 |
|
720 |
META names are a way to define "fields" in your documents. You can |
721 |
use the META names in your queries to limit the search to just the words |
722 |
contained in that META name of your document. For example, you might have |
723 |
a META tagged field in your documents called C<subjects> and then you can |
724 |
search your documents for the word "foo" but only return documents where |
725 |
"foo" is within the C<subjects> META tag. |
726 |
|
727 |
Document I<Properties> are somewhat related to meta tags: Properties |
728 |
allow the contents of a META tag in a source document to be stored within |
729 |
the index, and that text to be returned along with search results. |
730 |
|
731 |
META tags can have two formats in your documents. |
732 |
|
733 |
<META NAME="keyName" CONTENT="some Content"> |
734 |
|
735 |
And in XML format |
736 |
|
737 |
<keyName> |
738 |
Some Content |
739 |
</keyName> |
740 |
|
741 |
If using libxml, you can optionally use a non-html tag as a metaname: |
742 |
|
743 |
<html> |
744 |
<body> |
745 |
Hello swish users! |
746 |
<keyName> |
747 |
this is meta data |
748 |
</keyName>. |
749 |
</body> |
750 |
|
751 |
This, of course, is invalid HTML. |
752 |
|
753 |
To continue with our sample F<Swish-e.conf> file, add the following lines: |
754 |
|
755 |
# Define META tags |
756 |
MetaNames meta1 meta2 meta3 |
757 |
|
758 |
Reindex to include the changes: |
759 |
|
760 |
../src/swish-e -c swish-e.conf |
761 |
|
762 |
Now search, but this time limit your search to META tag "meta1": |
763 |
|
764 |
../src/swish-e -w 'meta1=metatest1' |
765 |
|
766 |
Again, please see L<SWISH-RUN|SWISH-RUN> and L<SWISH-CONFIG|SWISH-CONFIG> |
767 |
for complete documentation of the various indexing and searching options. |
768 |
|
769 |
=head2 Additional Examples |
770 |
|
771 |
The above example indexes local files using the file system access method |
772 |
C<-S fs>. You may also index files that are located on a local or remote |
773 |
web server by using the HTTP access method C<-S http>, or via the prog |
774 |
input method C<-S prog>. These are described in L<SWISH-RUN|SWISH-RUN> |
775 |
and example configuration files for using these methods can be found in |
776 |
the F<conf> directory of the Swish-e distribution. |
777 |
|
778 |
The C<-S prog> access method can be used to index any type of document, |
779 |
such as documents stored in a database (RDBMS), or documents that need |
780 |
to be processed before they can be indexed. Examples for using the |
781 |
C<-S prog> method are shown in the F<prog-bin> directory. |
782 |
|
783 |
Swish-e can also use I<filters> to convert documents as they are |
784 |
processed by Swish-e. For example, MS-Word or PDF documents can be |
785 |
converted and indexed by Swish-e by using filters. See the section on |
786 |
filters in L<SWISH-CONFIG|SWISH-CONFIG>, and the examples shown in the |
787 |
C<filter-bin> directory. |
788 |
|
789 |
=head1 QUICK START FOR THE IMPATIENT |
790 |
|
791 |
Here's I<one> example of the steps to install Swish-e, index documents by spidering, and |
792 |
how to search using the included CGI script. |
793 |
|
794 |
These steps are on Linux, and assume |
795 |
that you have the libraries libxml2 and zlib installed in the system, you have a current version of Perl |
796 |
and current versions of LWP, HTML:*, and HTTP:* modules installed, and Apache is installed and operating. |
797 |
|
798 |
If you have any trouble with these instructions please read the detailed installation instructions above, |
799 |
and see the documentation included with the F<swish.cgi> script and the F<spider.pl> programs. |
800 |
Please don't ask for help without reading the "real" documentation first. |
801 |
|
802 |
Not all output is included below. You should carefully watch for errors while building Swish-e. |
803 |
|
804 |
=over 4 |
805 |
|
806 |
=item 1 Download and build Swish-e |
807 |
|
808 |
~ $ wget http://swish-e.org/<path to current swish-e version>.tar.gz |
809 |
~ $ tar zxof <path to current swish-e version>.tar.gz |
810 |
~ $ cd swish-e-2.2 (this directory will depend on the version of Swish-e) |
811 |
|
812 |
~/swish-e-2.2 $ ./configure |
813 |
~/swish-e-2.2 $ make |
814 |
~/swish-e-2.2 $ make test |
815 |
... |
816 |
** All tests completed! ** |
817 |
|
818 |
=item 2 Make a working directory and copy files |
819 |
|
820 |
~/swish-e-2.2 $ mkdir ~/swishtest |
821 |
~/swish-e-2.2 $ cd ~/swishtest |
822 |
|
823 |
~/swishtest $ cp ~/swish-e-2.2/src/swish-e . |
824 |
~/swishtest $ cp ~/swish-e-2.2/prog-bin/spider.pl . |
825 |
~/swishtest $ cp ~/swish-e-2.2/example/swish.cgi . |
826 |
~/swishtest $ cp -rp ~/swish-e-2.2/example/modules/ . |
827 |
~/swishtest $ chmod 755 swish.cgi spider.pl |
828 |
~/swishtest $ chmod 644 modules/* |
829 |
|
830 |
=item 3 Create the index |
831 |
|
832 |
You must create a swish configuration file and a spider configuration |
833 |
file. |
834 |
|
835 |
~/swishtest $ cat swish.conf |
836 |
|
837 |
# Program to read documents |
838 |
IndexDir ./spider.pl |
839 |
|
840 |
# Define the config file for the spider to use |
841 |
SwishProgParameters spider.conf |
842 |
|
843 |
# Use libxm2 for parsing documents |
844 |
DefaultContents HTML2 |
845 |
IndexContents TXT2 txt |
846 |
|
847 |
# Cache document contents in the index for context display |
848 |
StoreDescription HTML2 <body> |
849 |
|
850 |
|
851 |
~/swishtest $ cat spider.conf |
852 |
|
853 |
# Example spider configuration file to index the |
854 |
# split version of the swish-e documentation |
855 |
|
856 |
@servers = ( |
857 |
{ |
858 |
|
859 |
base_url => 'http://swish-e.org/2.2/docs/split/index.html', |
860 |
same_hosts => [ qw/www.swish-e.org/ ], |
861 |
email => 'swish-impatient@domain.invalid', |
862 |
delay_min => .0001, |
863 |
|
864 |
# Define call-back functions to fine-tune the spider |
865 |
|
866 |
test_url => sub { |
867 |
my $uri = shift; |
868 |
|
869 |
# Skip requesting files that are probably not text |
870 |
return if $uri->path =~ m[\.(?:gif|jpeg|png)$]i; |
871 |
|
872 |
|
873 |
# Limit spidering to the /2.2/docs/split/ path |
874 |
return unless $uri->path =~ m[/2.2/docs/split/]; |
875 |
|
876 |
return 1; # otherwise, ok to search |
877 |
}, |
878 |
|
879 |
|
880 |
# Only index text/html or text/plain |
881 |
test_response => sub { |
882 |
my ( $uri, $server, $response ) = @_; |
883 |
|
884 |
return $response->content_type =~ m[(?:text/html|text/plain)]; |
885 |
}, |
886 |
}, |
887 |
); |
888 |
1; |
889 |
|
890 |
Now begin indexing: |
891 |
|
892 |
~/swishtest $ ./swish-e -S prog -c swish.conf -v 2 |
893 |
Indexing Data Source: "External-Program" |
894 |
Indexing "./spider.pl" |
895 |
./spider.pl: Reading parameters from 'spider.conf' |
896 |
Processing http://swish-e.org/2.2/docs/split/index.html... |
897 |
Processing http://swish-e.org/2.2/docs/split/index_long.html... |
898 |
Processing http://swish-e.org/2.2/docs/split/search.cgi.. |
899 |
... |
900 |
2566 unique words indexed. |
901 |
5 properties sorted. |
902 |
155 files indexed. 609775 total bytes. 49962 total words. |
903 |
Elapsed time: 00:00:33 CPU time: 00:00:01 |
904 |
Indexing done! |
905 |
|
906 |
=item 4 Test swish-e from the command line |
907 |
|
908 |
~/swishtest $ ./swish-e -w foo -m 1 |
909 |
# SWISH format: 2.1-dev-25 |
910 |
# Search words: foo |
911 |
# Number of hits: 18 |
912 |
# Search time: 0.000 seconds |
913 |
# Run time: 0.038 seconds |
914 |
1000 http://swish-e.org/2.2/docs/split/SWISH-CONFIG/Document_Contents_Directives.html "SWISH-CONFIG/Document Contents Directives" 57466 |
915 |
. |
916 |
|
917 |
|
918 |
=item 5 Test the CGI script from the command line |
919 |
|
920 |
~/swishtest $ ./swish.cgi | head |
921 |
Content-Type: text/html; charset=ISO-8859-1 |
922 |
|
923 |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
924 |
<html> |
925 |
<head> |
926 |
<title> |
927 |
Search our site |
928 |
</title> |
929 |
</head> |
930 |
<body> |
931 |
|
932 |
Refer to the swish.cgi documentation if you have any problems with running the CGI script. |
933 |
|
934 |
=item 6 Configure Apache |
935 |
|
936 |
~/swishtest $ su -c "ln -s $HOME/swishtest /usr/local/apache/htdocs/swishdocs" |
937 |
Password: ********* |
938 |
|
939 |
~/swishtest $ cat .htaccess |
940 |
# Deny everything by default |
941 |
Deny From All |
942 |
|
943 |
# But allow just the CGI script |
944 |
<files swish.cgi> |
945 |
Options ExecCGI |
946 |
Allow From All |
947 |
SetHandler cgi-script |
948 |
</files> |
949 |
|
950 |
=item 7 Test from the command line |
951 |
|
952 |
~/swishtest $ GET http://localhost/swishdocs/swish.cgi?query=install | head |
953 |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
954 |
<html> |
955 |
<head> |
956 |
<title> |
957 |
43 Results for [install] |
958 |
</title> |
959 |
</head> |
960 |
<body> |
961 |
|
962 |
=back |
963 |
|
964 |
Now you are ready to search. |
965 |
|
966 |
=head1 Document Info |
967 |
|
968 |
$Id: INSTALL.pod,v 1.19 2002/05/31 23:37:22 whmoseley Exp $ |
969 |
|
970 |
. |