/[MITgcm]/mitgcm.org/devel/buildweb/pkg/swish-e/pod/INSTALL.pod
ViewVC logotype

Annotation of /mitgcm.org/devel/buildweb/pkg/swish-e/pod/INSTALL.pod

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.1.1.1 - (hide annotations) (download) (vendor branch)
Fri Sep 20 19:47:29 2002 UTC (22 years, 10 months ago) by adcroft
Branch: Import, MAIN
CVS Tags: baseline, HEAD
Changes since 1.1: +0 -0 lines
Importing web-site building process.

1 adcroft 1.1 =head1 NAME
2    
3     INSTALL - Swish-e Installation Instructions
4    
5     =head1 OVERVIEW
6    
7     This document describes how to download, build and install Swish-e.
8     Also described is how to build Swish-e with optional, yet recommended libraries that
9     extend and enhance Swish-e.
10    
11     This document also provides instructions on how to get help installing
12     and using Swish-e (and the important information you should provide when asking for help).
13    
14     Also, below is a basic overview of using Swish-e to index documents, with pointers to
15     other more advanced examples.
16    
17     For those in a hurry, see L<"Quick Start for the Impatient">.
18    
19     =head1 SYSTEM REQUIREMENTS
20    
21     Swish-e 2.x is written in C, and, up to this time, it has been tested on
22     Solaris 2.6, AIX 4.3.2, OpenVMS 7.2-1 AXP, RedHat Linux 6.2 (and other
23     Linux distributions) and Win32 platforms.
24    
25     Unless you are using the Win32 binary distribution, a C compiler is needed.
26     Pretty much any standard compiler should do, although you will probably
27     have best luck with a current version of gcc. If you are using something
28     else (such as HP-UX or AIX) you may see more warnings during the build
29     process. Any problems should be sent to the Swish-e discussion list
30     after searching the list archives.
31    
32     B<libxml2>
33    
34     http://www.xmlsoft.org/
35    
36     Swish-e 2.2 can (and probably should) use the libxml2 library for parsing
37     HTML and XML files. Instructions for installing and enabling the library
38     are described below.
39    
40     Currently, the libxml2 library is not required, but is a much better
41     parser than the tired old Swish-e html parser (html.c). Please see
42     the Swish-e FAQ L<SWISH-FAQ|SWISH-FAQ> for more discussion of the use
43     of libxml2.
44    
45     Swish-e's old xml.c paser has been rewritten to use James Clark's Expat
46     library (included with the Swish-e distribution), but Swish-e's old
47     html.c code is still broken in a number of ways. Libxml2 is comparable to
48     Expat, but offers a much better HTML parser that Swish-e's html.c parser.
49     Use libxml2 if possible for parsing HTML and XML.
50    
51     Currently, setting a content type
52     (L<IndexContents|SWISH-CONFIG/"item_IndeContents"> or L<DefaultContents|"SWISH-CONFIG/"DefaultContents">)
53     of "HTML" uses Swish-e's html.c parser, where a setting of "HTML2" uses libxml2's HTML parser.
54     Likewise, a setting of "XML" uses the included Expat library, where "XML2"
55     uses libxml2 for parsing XML. All this may change in future releases.
56    
57     B<zlib compression>
58    
59     http://www.gzip.org/zlib/
60    
61     Swish-e can make use of zlib to compress document properties. This is recommended
62     if you are using L<StoreDescription|SWISH-CONFIG/"item_StoreDescription">.
63    
64     A Swish-e program built with zlib will read an index from a version of Swish-e that
65     was not built with zlib. But, if you are searching an index that was compressed with
66     zlib then you will need to use a version of Swish-e built with zlib. Therefore, it's
67     recommended to always include zlib support.
68    
69    
70     B<Memory>
71    
72     Swish needs quite a bit of memory while indexing. How much depends
73     on what you are indexing. The index is portable between platforms,
74     so you can index on a machine that has lots of memory available and
75     move the index files to another machine for searching. Use the C<-e>
76     switch if you are short on memory.
77    
78     B<Perl modules>
79    
80     http://www.cpan.org
81    
82     http://search.cpan.org
83    
84     Swish-e uses a perl script for spidering web sites. The script
85     requires the LWP bundle of modules (see http://search.cpan.org/search?dist=libwww-perl ).
86     (Note: depending on your perl installation, you might need to install additional modules required
87     by LWP; for requirements and downloads check http://www.cpan.org
88     or http://search.cpan.org). The Perl helper script was tested with
89     perl 5.005, 5.6.0, and 5.6.1 although it should probably work with any version 5 release.
90     Do note that the LWP, HTTP, and HTML modules are updated often for bug
91     fixes and such -- do check for upgrades, and don't expect that your system admin
92     as been keeping up with bug fixes.
93    
94    
95     =head2 Platform Specific Information
96    
97     A C<configure> script is used to determine platform specific details
98     for building swish. Please contact the Swish-e discussion list if you
99     notice any platform specific problems while building Swish-e.
100    
101     Specific information for various platforms can be found in subdirectories
102     of the C<src> directory. For example, the Win32 files can be found
103     in C<src/win32>, and instructions for building under VMS can be found
104     in C<src/vms>.
105    
106     The Windows binary is distributed as a separate package from the source
107     distribution. See http://Swish-e.org for download information.
108    
109     =head1 INSTALLATION
110    
111     Instructions below are for installing Swish-e from source.
112     Installing from source is recommended, but you should also check
113     the Swish-e web site for binary distributions for your platform.
114    
115     Windows binary distributions are available from the Swish-e site.
116    
117     =head2 Brief Instructions
118    
119     ./configure
120     make
121     make test
122     su root
123     make install
124    
125     Swish uses a F<configure> script to generate a Makefile for your platform.
126     The F<configure> script should detect and use optional libraries if found on
127     your system.
128    
129     =head2 Using libxml2 parser library (optional, but recommended)
130    
131     Daniel Veillard's libxml2 is a well supported library for working with
132     HTML and XML documents. As of version 2.2 Swish-e can use libxml2 to parse HTML and
133     XML documents.
134    
135     Installing the libxml2 library is not required at this time, but is
136     recommended, espeically if you are parsing HTML. As mentioned above,
137     the XML parser that is included with swish uses James Clarks's Expat
138     library and works well. The HTML parser in Swish-e has been in use for
139     years, but the parser provided by libxml2 is preferred. The libxml2
140     HTML parser offers more features (and more features for parsing XML), and
141     is more accurate. If you are running Linux it may already be installed
142     (look for libxml2.so.2.4.5 or higher).
143    
144     The library can be downloaded from http://www.xmlsoft.org/. Installation
145     directions are included in the INSTALL file in the libxml2 package.
146     Uncompressing, building, and installation of libxml2 is very similar to
147     the way Swish-e is built.
148    
149     Many Linux distributions provide libxml2 packages directly via RPM or
150     the Debian pacakage system. Check with your distributions web site for
151     more information, as this is a very easy way to install this library.
152    
153     If libxml2 complains during compilation that it can not find zlib then
154     you may need to specify the location of zlib. This happens (on Solaris)
155     when the ./configure script finds the zlib header files, but the compiler
156     and linker do not know to look in /usr/local/lib for the library.
157     You may see an error like:
158    
159     ld: fatal: library -lz: not found
160     ld: fatal: File processing errors. No output written to .libs/libxml2.so.2.4.5
161     *** Error code 1
162    
163     In this case, try specifying where zlib can be found. For example,
164     if libz was located in /usr/local/lib you would use this when building
165     B<libxml2>:
166    
167     # building libxml2 (not swish)
168     ./configure --with-zlib=/usr/local
169    
170     Swish-e doesn't use libxml2 uncompression features, so you *should*
171     be able to disable zlib when building B<libxml2>:
172    
173     ./configure --without-zlib
174    
175     B<NOTE:> But, that doesn't seem to work at this time (as of version
176     libxml2-2.4.5).
177    
178     If you do not have root access you can specify a prefix when building B<libxml2>:
179    
180     ./configure --prefix=$HOME/local
181    
182     This will install the headers and library files in F<$HOME/local/include>
183     and F<$HOME/local/lib>. You will need to inform the Swish-e build
184     process of this non-standard directory location (explained below).
185    
186     Once you run the libxml2 F<configure> script you build and install the library
187     as the libxml2 F<INSTALL> page instructs:
188    
189     make
190     make install
191    
192    
193     B<Building Swish-e with libxml2>
194    
195     Swish will try to detect if libxml2 is installed in the standard library locations.
196    
197     If libxml2 is installed in your system and you do B<not> want to build with libxml2:
198    
199     ./configure --without-libxml2
200    
201     If libxml2 was installed in a non-standard location then specify the
202     path where libxml2 was installed. For example,
203    
204     ./configure --with-libxml2=$HOME/local
205    
206     If libxml2 is installed in a non-standard location, Swish-e needs to know
207     where that library is at run time. There seems to be a number of ways
208     to do this. First, you can set the environment variable C<LD_RUN_PATH>
209     *before* running make to create Swish-e. This will add the path directly
210     to the Swish-e executable file.
211    
212     For example, under Bourne type shells:
213    
214     LD_RUN_PATH=$HOME/local/lib make
215    
216     Other shells (like csh and tcsh) may require:
217    
218     setenv LD_RUN_PATH $HOME/local/lib
219     make
220    
221     Another option is to use the C<LD_LIBRARY_PATH> environment variable.
222     This is a list of directories to search for libraries when a program
223     is run. See the ld(8) man page for more info.
224    
225     Note that libxml2 will be linked as a shared library on many platforms, so once you
226     compile Swish-e to use the library, the libxml2 library must not be
227     deleted or moved.
228    
229     =head2 Building Swish-e with zlib
230    
231     Building with zlib is similar to the instructions for building Swish-e
232     with libxml2 above. The F<configure> script will attempt to detect if zlib is
233     installed in your system and if found link Swish-e with the zlib library.
234    
235     zlib is common on many systems, but may be out of date, and versions prior to 1.1.4
236     have a know security issue. You should run
237     at least version 1.1.4. To link with zlib in a non-standard location use,
238     for example:
239    
240     ./configure --with-zlib=$HOME/zlib
241    
242     Again, as with compiling libxml2, you may need to use the C<LD_RUN_PATH>
243     or C<LD_LIBRARY_PATH> variables. See above for more details.
244    
245    
246     =head2 Downloading and unpacking and building Swish-e
247    
248     If you are reading this INSTALL document, then you probably already have
249     downloaded and unpacked the distribution. But just in case...
250    
251     Make sure you are using the current release from
252     http://Swish-e.org. If you have any questions about which version to use, please
253     ask on the Swish-e discussion list.
254    
255     How you download Swish-e is up to you: lynx, lwp-download,
256     wget are all common methods.
257    
258     =over 3
259    
260     =item 1 Uncompress the distribution file
261    
262    
263     gzip -dc swish-e.x.x.tar.gz | tar xof -
264    
265     or on some versions of tar, simply
266    
267     tar -zxof swish-e.x.x.tar.gz
268    
269     Uncompressing should create the following directories:
270    
271     swish-e-x.x/ configure script and top-level Makefile
272     swish-e-x.x/pod/ Swish-e documentation
273     swish-e-x.x/html/ HTML version of the documentation
274     swish-e-x.x/src/ source code
275     swish-e-x.x/conf/ example configuration files and stopword files
276     swish-e-x.x/example/ working example CGI scripts
277     swish-e-x.x/filter-bin/ filter samples
278     swish-e-x.x/prog-bin/ -S prog a web spider and other examples
279     swish-e-x.x/perl/ perl interface to the Swish-e C library
280     swish-e-x.x/src/expat/ James Clark's Expat XML parser
281     swish-e-x.x/src/win32/ win32 binary and buid files
282     swish-e-x.x/src/vms/ files required for building under VMS
283     swish-e-x.x/tests/ tests used for running "make test"
284     swish-e-x.x/doc/ directory used or building the documentation
285    
286    
287     =item 2 Make any needed changes in F<src/config.h>
288    
289     Compile-time configuration settings are adjusted in the file
290     F<src/config.h>. Most of the settings may also be specified in the
291     configuration file used during indexing.
292    
293     You probably will B<not> need to change this file, but it's helpful
294     to become familiar with the default compiled-in settings.
295    
296     =item 3 Build Swish-e
297    
298     Building Swish-e on most systems is a simple procedure. In the
299     Swish-e-x.x/ top level directory type the following commands
300    
301     ./configure
302     make
303     make test
304    
305     You should build swish as a normal user (i.e. not as "root").
306    
307     Note: If you wish to use libxml2 or zlib please see the previous section
308     for the required configure options.
309    
310     The above will create the Swish-e executable F<src/swish-e> and test
311     that the executable is working correctly. C<make test> will generate
312     an index file in the F<tests> directory and run a number of searches
313     against this index. At this time, the tests really just make sure that swish-e
314     was compiled correctly and runs.
315    
316     You may optionally "build" the F<swish-search> executable. This is
317     a version of Swish-e that cannot write to the index file. This
318     version may provide somewhat improved security in a CGI environment.
319     The binaries F<swish-e> and F<swish-search> are the same files -- the
320     additional security is enabled when the binary is named I<swish-search>.
321     F<swish-search> is not a substitute for good file system and CGI security.
322     Please review the many CGI security papers available on-line.
323    
324     Again, this is an optional step:
325    
326     make swish-search
327    
328     which simply copies the file F<swish-e> to F<swish-search>.
329    
330     =item 4 Install Swish-e
331    
332     Move the F<swish-e> (and/or F<swish-search>) executable to its final
333     location (normally /usr/local/bin). You may simply copy the program
334     anywhere you see fit, or you may use the C<make install> command to
335     install it to the location defined by the F<configure> script:
336    
337     You may need to superuser privileges:
338    
339     su root
340     make install
341     exit
342    
343     B<IMPORTANT:> Do not run swish-e as the superuser (root).
344    
345     The bin directory may be set when first running F<./configure>. For example:
346    
347     ./configure --bindir=$HOME/bin
348    
349     sets the installation directory to F<$HOME/bin> and C<make install>
350     will install the program in that location.
351    
352     =back
353    
354     =head2 Join the Swish-e discussion list
355    
356     The Swish-e discussion list is the place to ask questions about installing
357     and using Swish-e, see or post bug fixes or security announcements, and
358     a place where B<you> can offer help to others.
359    
360     The list is typically I<very low traffic>, so it won't overload your
361     inbox. Please take time to subscribe. See http://Swish-e.org.
362    
363     If you are using Swish-e on a public site, please let the list know so
364     it can be added to the list of sites that use Swish-e!
365    
366     Please review L<QUESTIONS AND TROUBLESHOOTING|QUESTIONS AND TROUBLESHOOTING> before posting
367     a question to the Swish-e list.
368    
369     =head2 Installing the Swish-e C Library (optional)
370    
371     Swish 2.2 creates the C library F<libswish-e.a> during the build.
372     Install this library if you wish to embed Swish-e into another
373     application. For example, the library should be installed
374     before using the high level Perl SWISH modules located on
375     CPAN. http://search.cpan.org/search?mode=module&query=SWISH
376    
377     This is an *optional* step. Most users will not need to install the library.
378    
379     To install the library issue the following commands (again, you may need
380     to su root)
381    
382     su root
383     make install-lib
384     exit
385    
386     By default this will install the library in /usr/local/lib, but this
387     directory can be set when running ./configure with the --libdir option.
388     For example:
389    
390     ./configure --bindir=$HOME/bin --libdir=$HOME/lib
391    
392     So C<make install> will install the F<swish-e> binary in F<$HOME/bin>
393     and C<make install-lib> will install the F<libswish-e.a> library in
394     F<$HOME/lib>.
395    
396     Note: You may wish to run C<make realclean> before running ./configure again.
397    
398     =head2 Creating PDF and Postscript documentation (optional)
399    
400     The Swish-e documentation in HTML format was created with Pod::HtmlPsPdf,
401     a package of Perl modules written and/or modified by Stas Bekman to automate
402     the conversion of documents in pod format (see perldoc perlpod) to HTML,
403     Postscript, and PDF. A slightly modified version of this package is
404     include with the Swish-e distribution and used for building the HTML.
405    
406     If your system has the B<necessary tools> to build Postscript and the
407     converter ps2pdf installed, you may be able to build the Postscript
408     and PDF versions of the documentation. After you have run ./configure,
409     type from the top-level directory of the distribution:
410    
411     make pdf
412    
413     And with any luck you will end up with the these two files in the top-level directory:
414    
415     swish-e_documentation.pdf
416     swish-e_documentation.ps
417    
418     Most people find reading the documentation in HTML most convenient.
419    
420     =head2 Installing the Swish-e documentation as man(1) pages (optional)
421    
422     Part of the included Swish-e documentation can be installed as system
423     man(1) pages. Only the reference related pages are installed (it's
424     assumed that you don't need to install the README or INSTALL documents as
425     man pages). You must have the pod2man program installed on your system
426     (which you probably do if you have Perl).
427    
428     To build the man pages and install them into your system, type from the
429     top-level directory (after running ./configure):
430    
431     su root
432     make install-man
433     exit
434    
435     You will need to C<su root> if you do not have write access to the man directory.
436    
437     The man pages are installed in the system man directory. This directory
438     is determined by running ./configure and can be set by passing the
439     directory when running ./configure.
440    
441     For example,
442    
443     ./configure --mandir=/usr/local/doc/man
444    
445     Information on running ./configure can be found by typing:
446    
447     ./configure --help
448    
449     The pod source files used to create the man files were written running
450     under perl 5.6.1. Older version of Perl may complain slightly about the
451     formatting of the pod files. This shouldn't be a problem, but please
452     let the Swish-e list know if otherwise. Then upgrade your version of perl. ;)
453    
454     =head1 QUESTIONS AND TROUBLESHOOTING
455    
456     Please search the Swish-e list archive before posting a question, and
457     check the L<SWISH-FAQ|SWISH-FAQ> to see if your question hasn't already
458     been asked.
459    
460     Support for installation, configuration and usage is available via the
461     Swish-e discussion list. Visit http://swish-e.org for information.
462     Do not contact developers directly for help -- always post your question
463     to the list.
464    
465     Before posting use tools available to narrow down the problem.
466    
467     Swish-e has the -T, -v, and -k switches that may help resolve issues.
468     If possible find a single document that shows the problem, then index
469     with -T INDEXED_WORDS and watch the exact words that are indexed.
470     Use -H 9 when searching and look at C<Parsed Words:> to make sure you
471     are searching the correct words.
472    
473     You can also use programs like C<gdb> to help find segfaults and other
474     run-time errors, and programs like C<truss> or C<strace> can often
475     provide interesting information, if you are adventurous.
476    
477     =head2 When posting please provide the following information:
478    
479     =over 4
480    
481     =item *
482    
483     The exact version of Swish-e that you are using. Running Swish-e with the
484     C<-V> switch will print the version number. Also, supply the output from
485     C<uname -a> or similar command that identifies the operating system you
486     are running on. If you are running an old version of swish be prepared
487     for a response to your question of "upgrade."
488    
489     =item *
490    
491     A summary of the problem. This should include the commands issued
492     (e.g. for indexing or searching) and their output, and why you don't
493     think it's working correctly. Please cut-n-paste the exact commands
494     and their output instead of retyping to avoid errors.
495    
496     =item *
497    
498     Include a copy of the configuration file you are using, if any. Swish-e
499     has reasonable defaults so in many cases you can run it without using
500     a configuration file. But, if you need to use a configuration file,
501     reduce it down to the absolute minimum number of commands required to
502     demonstrate your problem. Again, cut-n-paste.
503    
504     =item *
505    
506     A small copy of a source document that demonstrates the problem.
507    
508     If you are having problems spidering a web server, use lwp-download or
509     wget to copy the file locally to make sure you can index the document
510     using the file system method.
511    
512     If you do need help with spidering, don't post fake URLs, as it makes it
513     impossible to help. If you don't want to expose your web page to the
514     people on the Swish-e list, find some other site to test spidering on.
515     If that works, but you still cannot spider your own site then post your
516     real URL if you want help.
517    
518     =item *
519    
520     If you are having trouble building Swish-e please cut-n-paste the output
521     from make (or from ./configure if that's where the problem is).
522    
523    
524     =back
525    
526     =head1 BASIC CONFIGURATION AND USAGE
527    
528     This section should give you a basic overview of indexing and searching
529     with B<Swish-e>. Other examples can be found in the F<conf> directory, which will
530     step you through a number of different configurations.
531     Also, please review the L<SWISH-FAQ|SWISH-FAQ>.
532    
533     Swish-e reads a configuration file (see L<SWISH-CONFIG|SWISH-CONFIG>)
534     for directives that control what and how Swish-e indexes files.
535     Then running Swish-e is controlled by command line arguments (see
536     L<SWISH-RUN|SWISH-RUN>).
537    
538     Swish-e does not require a configuration file, but
539     most people need to change the default behavior by placing settings
540     in a configuration file.
541    
542     To try the examples below change to the F<tests> subdirectory of the
543     distribution. The tests will use the *.html files in this directory when
544     creating the test index. You may wish to review these *.html files to
545     get an idea of the various native file formats that Swish-e supports.
546    
547     =head2 Step 1: Create a Configuration File
548    
549     The configuration file controls what and how Swish-e indexes. The
550     configuration file consists of directives, comments, and blank lines.
551     The configuration file can be any name you like.
552    
553     This example will work with the documents in the F<tests> directory.
554     You may wish to review the F<tests/test.config> configuration file used
555     for the C<make test> tests.
556    
557     For example, a simple configuration file (F<Swish-e.conf>):
558    
559     # Example Swish-e Configuration file
560    
561     # Define *what* to index
562     # IndexDir can point to a directories and/or a files
563    
564     # Here it's pointing to the current directory
565     IndexDir .
566    
567     # But only index the .html files
568     IndexOnly .html
569    
570     # Show basic info while indexing
571     IndexReport 1
572    
573     And that's a simple configuration file. It says to index all the
574     .html files in the current directory, and provide some basic output
575     while indexing.
576    
577     The complete list of all configuration file directives are described
578     in L<SWISH-CONFIG|SWISH-CONFIG>.
579    
580     =head2 Step 2: Index your Files
581    
582     Now, make sure you are in the F<tests> directory and save the above
583     example configuration file as F<swish-e.conf>. Then run Swish-e using
584     the C<-c> switch to specify the name of the configuration file.
585    
586     ../src/swish-e -c swish-e.conf
587    
588     Indexing Data Source: "File-System"
589     Indexing "."
590     Removing very common words...
591     no words removed.
592     Writing main index...
593     Sorting words ...
594     Sorting 55 words alphabetically
595     Writing header ...
596     Writing index entries ...
597     Writing word text: Complete
598     Writing word hash: Complete
599     Writing word data: Complete
600     55 unique words indexed.
601     Writing file list ...
602     Property Sorting complete.
603     Writing sorted index ...
604     5 files indexed. 1252 total bytes.
605     Elapsed time: 00:00:00 CPU time: 00:00:00
606     Indexing done!
607    
608     This created the index file F<index.swish-e>. This is the default
609     index file name unless the B<IndexFile> directive is specified in the
610     configuration file:
611    
612     IndexFile ./website.index
613    
614     =head2 Step 3: Search
615    
616     You specify your search terms with the C<-w> switch. For example, to find
617     the files that contain the word B<sample> you would issue the command:
618    
619     ../src/swish-e -w sample
620    
621     This example assumes that you are in the F<tests> directory, and the
622     Swish-e binary is in the F<../src> directory. Swish-e returns in response
623     to that command the following:
624    
625     ../src/swish-e -w sample
626    
627     # SWISH format: 2.2
628     # Search words: sample
629     # Number of hits: 2
630     # Search time: 0.000 seconds
631     # Run time: 0.005 seconds
632     1000 ./test_xml.html "If you are seeing this, the METATAG XML search was successful!" 159
633     1000 ./test.html "If you are seeing this, the test was successful!" 437
634     .
635    
636     So the word B<sample> was found in two documents. The first number
637     shown is the relevance or rank of the search term, followed by the file
638     containing the search term, the title of the document, and finally the
639     length of the document.
640    
641     The period (".") alone at the end marks the end of results.
642    
643     Much more information may be retrieved while searching by using
644     the C<-x> and C<-H> switches (see L<SWISH-RUN|SWISH-RUN>)
645     and by using Document Properties (see L<SWISH-CONFIG|SWISH-CONFIG>).
646    
647     =head2 Phrase Searching
648    
649     To search for a phrase in a document use double-quotes to delimit your
650     search terms. (The phrase delimiter is set in src/swish.h.)
651    
652     You must protect the quotes from the shell.
653    
654     For example, under Unix:
655    
656     swish-e -w '"this is a pharase" or (this and that)'
657     swish-e -w 'meta1=("this is a pharase") or (this and that)'
658    
659     Or under Windows F<command.com> shell.
660    
661     swish-e -w \"this is a pharase\" or (this and that)
662    
663     The phrase delimiter can be set with the C<-P> switch.
664    
665     =head2 Boolean Searching
666    
667     You can use the Boolean operators B<and>, B<or>, or B<not> in searching.
668     Without these Boolean, Swish-e will assume you're B<and>ing the words together.
669    
670     Here are some examples:
671    
672     ../src/swish-e -w 'apples oranges'
673     ../src/swish-e -w 'apples and oranges' ( Same thing )
674    
675     ../src/swish-e -w 'apples or oranges'
676    
677     ../src/swish-e -w 'apples or oranges not juice' -f myIndex
678    
679     retrieves first the files that contain both the words "apples" and "oranges";
680     then among those the ones that do not contain the word "juice"
681    
682     A few others to ponder:
683    
684     ../src/swish-e -w 'apples and oranges or pears'
685     ../src/swish-e -w '(apples and oranges) or pears' ( Same thing )
686     ../src/swish-e -w 'apples and (oranges or pears)' ( Not the same thing )
687    
688     See L<SWISH-SEARCH|SWISH-SEARCH> for more information.
689    
690    
691     =head2 Context Searching
692    
693     The C<-t> option in the search command line allows you to search for
694     words that exist only in specific HTML tags. Each character in the
695     string you specify in the argument to this option represents a different
696     tag in which the word is searched; that is you can use any combinations
697     of the following characters:
698    
699     H means all <HEAD> tags
700     B stands for <BODY> tags
701     t is all <TITLE> tags
702     h is <H1> to <H6> (header) tags
703     e is emphasized tags (this may be <B>, <I>, <EM>, or <STRONG>)
704     c is HTML comment tags (<!-- ... -->)
705    
706     For example:
707    
708     # Find only documents with the word "linux" in the E<lg>TITLEE<gt> tags.
709     ./swish-e -w linux -t t
710    
711     # Find the word "apple" in titles or comments
712     ./swish-e -w apple -t tc
713    
714    
715     =head2 META Tags
716    
717     For the last example we will instruct Swish-e to use META tags to define
718     I<fields> in your documents.
719    
720     META names are a way to define "fields" in your documents. You can
721     use the META names in your queries to limit the search to just the words
722     contained in that META name of your document. For example, you might have
723     a META tagged field in your documents called C<subjects> and then you can
724     search your documents for the word "foo" but only return documents where
725     "foo" is within the C<subjects> META tag.
726    
727     Document I<Properties> are somewhat related to meta tags: Properties
728     allow the contents of a META tag in a source document to be stored within
729     the index, and that text to be returned along with search results.
730    
731     META tags can have two formats in your documents.
732    
733     <META NAME="keyName" CONTENT="some Content">
734    
735     And in XML format
736    
737     <keyName>
738     Some Content
739     </keyName>
740    
741     If using libxml, you can optionally use a non-html tag as a metaname:
742    
743     <html>
744     <body>
745     Hello swish users!
746     <keyName>
747     this is meta data
748     </keyName>.
749     </body>
750    
751     This, of course, is invalid HTML.
752    
753     To continue with our sample F<Swish-e.conf> file, add the following lines:
754    
755     # Define META tags
756     MetaNames meta1 meta2 meta3
757    
758     Reindex to include the changes:
759    
760     ../src/swish-e -c swish-e.conf
761    
762     Now search, but this time limit your search to META tag "meta1":
763    
764     ../src/swish-e -w 'meta1=metatest1'
765    
766     Again, please see L<SWISH-RUN|SWISH-RUN> and L<SWISH-CONFIG|SWISH-CONFIG>
767     for complete documentation of the various indexing and searching options.
768    
769     =head2 Additional Examples
770    
771     The above example indexes local files using the file system access method
772     C<-S fs>. You may also index files that are located on a local or remote
773     web server by using the HTTP access method C<-S http>, or via the prog
774     input method C<-S prog>. These are described in L<SWISH-RUN|SWISH-RUN>
775     and example configuration files for using these methods can be found in
776     the F<conf> directory of the Swish-e distribution.
777    
778     The C<-S prog> access method can be used to index any type of document,
779     such as documents stored in a database (RDBMS), or documents that need
780     to be processed before they can be indexed. Examples for using the
781     C<-S prog> method are shown in the F<prog-bin> directory.
782    
783     Swish-e can also use I<filters> to convert documents as they are
784     processed by Swish-e. For example, MS-Word or PDF documents can be
785     converted and indexed by Swish-e by using filters. See the section on
786     filters in L<SWISH-CONFIG|SWISH-CONFIG>, and the examples shown in the
787     C<filter-bin> directory.
788    
789     =head1 QUICK START FOR THE IMPATIENT
790    
791     Here's I<one> example of the steps to install Swish-e, index documents by spidering, and
792     how to search using the included CGI script.
793    
794     These steps are on Linux, and assume
795     that you have the libraries libxml2 and zlib installed in the system, you have a current version of Perl
796     and current versions of LWP, HTML:*, and HTTP:* modules installed, and Apache is installed and operating.
797    
798     If you have any trouble with these instructions please read the detailed installation instructions above,
799     and see the documentation included with the F<swish.cgi> script and the F<spider.pl> programs.
800     Please don't ask for help without reading the "real" documentation first.
801    
802     Not all output is included below. You should carefully watch for errors while building Swish-e.
803    
804     =over 4
805    
806     =item 1 Download and build Swish-e
807    
808     ~ $ wget http://swish-e.org/<path to current swish-e version>.tar.gz
809     ~ $ tar zxof <path to current swish-e version>.tar.gz
810     ~ $ cd swish-e-2.2 (this directory will depend on the version of Swish-e)
811    
812     ~/swish-e-2.2 $ ./configure
813     ~/swish-e-2.2 $ make
814     ~/swish-e-2.2 $ make test
815     ...
816     ** All tests completed! **
817    
818     =item 2 Make a working directory and copy files
819    
820     ~/swish-e-2.2 $ mkdir ~/swishtest
821     ~/swish-e-2.2 $ cd ~/swishtest
822    
823     ~/swishtest $ cp ~/swish-e-2.2/src/swish-e .
824     ~/swishtest $ cp ~/swish-e-2.2/prog-bin/spider.pl .
825     ~/swishtest $ cp ~/swish-e-2.2/example/swish.cgi .
826     ~/swishtest $ cp -rp ~/swish-e-2.2/example/modules/ .
827     ~/swishtest $ chmod 755 swish.cgi spider.pl
828     ~/swishtest $ chmod 644 modules/*
829    
830     =item 3 Create the index
831    
832     You must create a swish configuration file and a spider configuration
833     file.
834    
835     ~/swishtest $ cat swish.conf
836    
837     # Program to read documents
838     IndexDir ./spider.pl
839    
840     # Define the config file for the spider to use
841     SwishProgParameters spider.conf
842    
843     # Use libxm2 for parsing documents
844     DefaultContents HTML2
845     IndexContents TXT2 txt
846    
847     # Cache document contents in the index for context display
848     StoreDescription HTML2 <body>
849    
850    
851     ~/swishtest $ cat spider.conf
852    
853     # Example spider configuration file to index the
854     # split version of the swish-e documentation
855    
856     @servers = (
857     {
858    
859     base_url => 'http://swish-e.org/2.2/docs/split/index.html',
860     same_hosts => [ qw/www.swish-e.org/ ],
861     email => 'swish-impatient@domain.invalid',
862     delay_min => .0001,
863    
864     # Define call-back functions to fine-tune the spider
865    
866     test_url => sub {
867     my $uri = shift;
868    
869     # Skip requesting files that are probably not text
870     return if $uri->path =~ m[\.(?:gif|jpeg|png)$]i;
871    
872    
873     # Limit spidering to the /2.2/docs/split/ path
874     return unless $uri->path =~ m[/2.2/docs/split/];
875    
876     return 1; # otherwise, ok to search
877     },
878    
879    
880     # Only index text/html or text/plain
881     test_response => sub {
882     my ( $uri, $server, $response ) = @_;
883    
884     return $response->content_type =~ m[(?:text/html|text/plain)];
885     },
886     },
887     );
888     1;
889    
890     Now begin indexing:
891    
892     ~/swishtest $ ./swish-e -S prog -c swish.conf -v 2
893     Indexing Data Source: "External-Program"
894     Indexing "./spider.pl"
895     ./spider.pl: Reading parameters from 'spider.conf'
896     Processing http://swish-e.org/2.2/docs/split/index.html...
897     Processing http://swish-e.org/2.2/docs/split/index_long.html...
898     Processing http://swish-e.org/2.2/docs/split/search.cgi..
899     ...
900     2566 unique words indexed.
901     5 properties sorted.
902     155 files indexed. 609775 total bytes. 49962 total words.
903     Elapsed time: 00:00:33 CPU time: 00:00:01
904     Indexing done!
905    
906     =item 4 Test swish-e from the command line
907    
908     ~/swishtest $ ./swish-e -w foo -m 1
909     # SWISH format: 2.1-dev-25
910     # Search words: foo
911     # Number of hits: 18
912     # Search time: 0.000 seconds
913     # Run time: 0.038 seconds
914     1000 http://swish-e.org/2.2/docs/split/SWISH-CONFIG/Document_Contents_Directives.html "SWISH-CONFIG/Document Contents Directives" 57466
915     .
916    
917    
918     =item 5 Test the CGI script from the command line
919    
920     ~/swishtest $ ./swish.cgi | head
921     Content-Type: text/html; charset=ISO-8859-1
922    
923     <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
924     <html>
925     <head>
926     <title>
927     Search our site
928     </title>
929     </head>
930     <body>
931    
932     Refer to the swish.cgi documentation if you have any problems with running the CGI script.
933    
934     =item 6 Configure Apache
935    
936     ~/swishtest $ su -c "ln -s $HOME/swishtest /usr/local/apache/htdocs/swishdocs"
937     Password: *********
938    
939     ~/swishtest $ cat .htaccess
940     # Deny everything by default
941     Deny From All
942    
943     # But allow just the CGI script
944     <files swish.cgi>
945     Options ExecCGI
946     Allow From All
947     SetHandler cgi-script
948     </files>
949    
950     =item 7 Test from the command line
951    
952     ~/swishtest $ GET http://localhost/swishdocs/swish.cgi?query=install | head
953     <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
954     <html>
955     <head>
956     <title>
957     43 Results for [install]
958     </title>
959     </head>
960     <body>
961    
962     =back
963    
964     Now you are ready to search.
965    
966     =head1 Document Info
967    
968     $Id: INSTALL.pod,v 1.19 2002/05/31 23:37:22 whmoseley Exp $
969    
970     .

  ViewVC Help
Powered by ViewVC 1.1.22