/[MITgcm]/mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-SEARCH.pod
ViewVC logotype

Annotation of /mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-SEARCH.pod

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.1.1.1 - (hide annotations) (download) (vendor branch)
Fri Sep 20 19:47:29 2002 UTC (22 years, 10 months ago) by adcroft
Branch: Import, MAIN
CVS Tags: baseline, HEAD
Changes since 1.1: +0 -0 lines
Importing web-site building process.

1 adcroft 1.1 =head1 NAME
2    
3     SWISH-SEARCH - Swish-e Searching Instructions
4    
5     =head1 OVERVIEW
6    
7     This page describes the process of searching with Swish-e. Please see
8     the L<SWISH-CONFIG|SWISH-CONFIG> page for information the Swish-e
9     configuration file directives, and L<SWISH-RUN|SWISH-RUN> for a complete
10     list of command line arguments.
11    
12     Searching a Swish-e index involves passing L<command line
13     arguments|SWISH-RUN> to it that specify the index file to use, and
14     the L<query|/"Searching Syntax and Operations"> (or search words) to
15     locate in the index. Swish-e returns a list of file names (or URLs)
16     that contain the matched search words. L<Perl|/"Searching with Perl">
17     is often used as a front-end to Swish-e such as in CGI applications,
18     and L<perl modules|/"Perl Modules"> exist to for interfacing with Swish-e.
19    
20     =head1 Searching Syntax and Operations
21    
22     The C<-w> command line argument (switch) is used specify the search
23     query to Swish-e. When running Swish-e from a shell prompt, be careful
24     to protect your query from shell metacharacters and shell expansions.
25     This often means placing single or double quotes around your query.
26     See L<Searching with Perl> if you plan to use Perl as a front end
27     to Swish-e.
28    
29     The following section describes various aspects of searching with Swish-e.
30    
31     =head2 Boolean Operators
32    
33     You can use the Boolean operators B<and>, B<or>, or B<not> in searching.
34     Without these Boolean operators, Swish-e will assume you're B<and>ing
35     the words together. The operators are not case sensitive.
36    
37     [Note: you can change the default to B<or>ing by changing the variable
38     DEFAULT_RULE in the config.h file and recompiling Swish-e.]
39    
40     Evaluation takes place from B<left to right> only, although you can use
41     parentheses to force the order of evaluation.
42    
43     Examples:
44    
45     swish-e -w "smilla or snow" -f myIndex
46    
47     Retrieves files containing either the words "smilla" or "snow".
48    
49     swish-e -w "smilla and snow not sense" -f myIndex
50     swish-e -w "(smilla and snow) and not sense" -f myIndex (same thing)
51    
52     retrieves first the files that contain both the words "smilla" and
53     "snow"; then among those the ones that do not contain the word "sense".
54    
55    
56     =head2 Truncation
57    
58     The wildcard (*) is available, however it can only be used at the end
59     of a word: otherwise is is considerd a normal character (i.e. can be
60     searched for if included in the WordCharacters directive).
61    
62     swish-e -w "librarian" -f myIndex
63    
64     this query only retrieves files which contain the given word.
65    
66     On the other hand:
67    
68     swish-e -w "librarian*" -f myIndex
69    
70     retrieves "librarians", "librarianship", etc. along with "librarian".
71    
72     Note that wildcard searches combined with word stemming can lead
73     to unexpected results. If stemming is enabled, a search term with a
74     wildcard will be stemmed internally before searching. So searching for
75     C<running*> will actually be a search for C<run*>, so C<running*> would
76     find C<runway>. Also, searching for C<runn*> will not find C<running>
77     as you might expect, since C<running> stems to C<run> in the index,
78     and thus C<runn*> will not find C<run>.
79    
80    
81     =head2 Order of Evaluation
82    
83     Expressions are always evaluated left to right:
84    
85     swish -w "juliet not ophelia and pac" -f myIndex
86    
87     retrieves files which contain "juliet" and "pac" but not "ophelia"
88    
89     However it is always possible to force the order of evaluation by using
90     parenthesis. For example:
91    
92     swish-e -w "juliet not (ophelia and pac)" -f myIndex
93    
94     retrieves files with "juliet" and containing neither "ophelia" nor "pac".
95    
96     =head2 Meta Tags
97    
98     MetaNames are used to represent I<fields> (called I<columns> in a
99     database) and provide a way to search in only parts of a document.
100     See L<SWISH-CONFIG|SWISH-CONFIG/"Document Contents Directives"> for
101     a description of MetaNames, and how they are specified in the source
102     document.
103    
104     To limit a search to words found in a meta tag you prefix the keywords
105     with the name of the meta tag, followed by the equal sign:
106    
107     metaname = word
108     metaname = (this or that)
109     metatname = ( (this or that) or "this phrase" )
110    
111     It is not necessary to have spaces at either side of the "=", consequently
112     the following are equivalent:
113    
114     swish-e -w "metaName=word"
115     swish-e -w "metaName = word"
116     swish-e -w "metaName= word"
117    
118     To search on a word that contains a "=", precede the "=" with a "\"
119     (backslash).
120    
121     swish-e -w "test\=3 = x\=4 or y\=5" -f <index.file>
122    
123     this query returns the files where the word "x=4" is associated with
124     the metaName "test=3" or that contains the word "y=5" not associated
125     with any metaName.
126    
127     Queries can be also constructed using any of the usual search features,
128     moreover metaName and plain search can be mixed in a single query.
129    
130     swish-e -w "metaName1 = (a1 or a4) not (a3 and a7)" -f index.swish-e
131    
132     This query will retrieve all the files in which "a1" or "a2" are found
133     in the META tag "metaName1" and that do not contain the words "a3" and
134     "a7", where "a3" and "a7" are not associated to any meta name.
135    
136     =head2 Phrase Searching
137    
138     To search for a phrase in a document use double-quotes to delimit your
139     search terms. (The phrase delimiter is set in src/swish.h.)
140    
141     You must protect the quotes from the shell.
142    
143     For example, under Unix:
144    
145     swish-e -w '"this is a pharase" or (this and that)'
146     swish-e -w 'meta1=("this is a pharase") or (this and that)'
147    
148     Or under Windows:
149    
150     swish-e -w \"this is a pharase\" or (this and that)
151    
152     You can use the C<-P> switch to set the phrase delimiter character.
153     See L<SWISH-RUN|SWISH-RUN> for examples.
154    
155    
156     =head2 Context
157    
158     At times you might not want to search for a word in every part of
159     your files since you know that the word(s) are present in a particular
160     tag. The ability to seach according to context greatly increases the
161     chances that your hits will be relevant, and Swish-e provides a mechanism
162     to do just that.
163    
164     The -t option in the search command line allows you to search for words
165     that exist only in specific HTML tags. Each character in the string
166     you specify in the argument to this option represents a different tag
167     in which the word is searched; that is you can use any combinations of
168     the following characters:
169    
170     H means all<HEAD> tags
171     B stands for <BODY> tags
172     t is all <TITLE> tags
173     h is <H1> to <H6> (header) tags
174     e is emphasized tags (this may be <B>, <I>, <EM>, or <STRONG>)
175     c is HTML comment tags (<!-- ... -->)
176    
177     # This search will look for files with these two words in their titles only.
178     swish-e -w "apples oranges" -t t -f myIndex
179    
180     # This search will look for files with these words in comments only.
181     swish-e -w "keywords draft release" -t c -f myIndex
182    
183     This search will look for words in titles, headers, and emphasized tags.
184     swish-e -w "world wide web" -t the -f myIndex
185    
186     =head1 Searching with Perl
187    
188     Perl ( http://www.perl.com/ ) is probably the most common programming
189     language used with Swish-e, especially in CGI interfaces. Perl makes
190     searching and parsing results with Swish-e easy, but if not done properly
191     can leave your server vulnerable to attacks.
192    
193     When designing your CGI scripts you should carefully screen user input,
194     and include features such as paged results and a timer to limit time
195     required for a search to complete. These are to protect your web site
196     against a denial of service (DoS) attack.
197    
198     Included with every distribution of Perl is a document called perlsec --
199     Perl Security. I<Please> take time to read and understand that document
200     before writing CGI scripts in perl.
201    
202     Type at your shell/command prompt:
203    
204     perldoc perlsec
205    
206     If nothing else, start every CGI program in perl as such:
207    
208     #!/usr/local/bin/perl -wT
209     use strict;
210    
211     That alone won't make your script secure, but may help you find insecure
212     code.
213    
214     =head2 CGI Danger!
215    
216     There are many examples of CGI scripts on the Internet. Many are poorly
217     written and insecure. A commonly seen way to execute Swish-e from a
218     perl CGI script is with a I<piped open>. For example, it is common to
219     see this type of C<open()>:
220    
221     open(SWISH, "$swish -w $query -f $index|");
222    
223     This C<open()> gives shell access to the entire Internet! Often an
224     attempt is made to strip C<$query> of I<bad> characters. But, this
225     often fails since it's hard to guess what every I<bad> character is.
226     Would you have thought about a null? A better approach is to only allow
227     I<in> known safe characters.
228    
229     Even if you can be sure that any user supplied data is safe, this
230     I<piped open> still passes the command parameters through the shell.
231     If nothing else, it's just an extra unnecessary step to running Swish-e.
232    
233     Therefore, the recommended approach is to fork and exec C<swish-e> directly
234     without passing through the shell. This process is described in the
235     perl man page C<perlipc> under the appropriate heading B<Safe Pipe Opens>.
236    
237     Type:
238    
239     perldoc perlipc
240    
241     If all this sounds complicated you may wish to use a Perl module that
242     does all the hard work for you.
243    
244     =head2 Perl Modules
245    
246     There are a couple of Perl modules for accessing Swish-e. One of the
247     modules is included with the distribution, and the other module (or set
248     of modules) is located on CPAN. The included module provides a way to
249     embed Swish-e into your perl program, while the modules on CPAN provide an
250     abstracted interface to it. Hopefully, they make using Swish-e easier.
251    
252     B<The Included SWISHE Perl Module>
253    
254     When compiling Swish-e from source the build process creates a C library
255     (see the L<Swish-e INSTALL|INSTALL/"Installing_the_SWISH_E_C_Library">
256     documentation). The Swish-e distribution includes a F<perl> directory
257     with files required to create the F<SWISHE.pm> module. This module
258     will I<embed> Swish-e into your perl program so that searching does not
259     require running an external program. Embedding the Swish-e program into
260     your perl program results in faster Swish-e searches since it avoids the
261     cost of forking and exec'ing a separate program and opening the index
262     file for each request.
263    
264     You will probably B<not> want to embed Swish-e into perl if running under
265     mod_perl as you will end up with very large Apache processes.
266    
267     Building and usage instructions for the F<SWISHE.pm> module can be found
268     in the L<SWISH-PERL|SWISH-PERL> man page.
269    
270     Here's an edited snip from that man page:
271    
272     my $handle = SwishOpen( $indexfiles )
273     or die "Failed to open '$indexfiles'";
274    
275     my $num_results = SwishSearch($handle, $query, 1, $props, $sort);
276    
277     unless ( $num_results ) {
278     print "No Results\n";
279    
280     my $error = SwishError( $handle );
281     print "Error number: $error\n" if $error;
282    
283     return; # or next.
284     }
285    
286     while( my($rank,$index,$file,$title,$start,$size,@props)
287     = SwishNext( $handle ))
288     {
289     print join( ' ',
290     $rank,
291     $index,
292     $file,
293     qq["$title"],
294     $start,
295     $size,
296     map{ qq["$_"] } @props,
297     ),"\n";
298     }
299    
300    
301     B<SWISH Modules on CPAN>
302    
303     The Comprehensive Perl Archive Network, or CPAN, is a collection of
304     modules for use with Perl. Always search CPAN (http://search.cpan.org/)
305     before starting any new program. Chances are someone has written just
306     what you need.
307    
308     On CPAN are also modules for searching with Swish-e. They can be found
309     at http://search.cpan.org/search?mode=module&query=SWISH The main
310     SWISH module (different from the SWISHE<E> module included with the
311     Swish-e distribution) provides a high-level Object Oriented interface
312     to Swish-e, and the same interface can be used to used to either fork
313     and exec the Swish-e binary, or use the Swish-e C Library if installed
314     by just changing one line of code. A server interface will be written
315     when a Swish-e server is written.
316    
317     The main idea is that you can write a program to search with Swish-e,
318     but not have to change your code (much) when you wish to change to a
319     new way of accessing Swish-e.
320    
321     Here's an example of SWISH module usage from the synopsis:
322    
323     use SWISH;
324    
325     $sh = SWISH->connect('Fork',
326     prog => '/usr/local/bin/swish-e',
327     indexes => 'index.swish-e',
328     results => sub { print $_[1]->as_string,"\n" },
329     ) or die $SWISH::errstr unless $sh;
330    
331     $hits = $sh->query('metaname=(foo or bar)');
332    
333     This takes care of running Swish-e in a secure way, parsing the output
334     from it, and providing OO methods of accessing the resulting data.
335    
336     =head1 Document Info
337    
338     $Id: SWISH-SEARCH.pod,v 1.4 2002/04/15 02:34:43 whmoseley Exp $
339    
340     .

  ViewVC Help
Powered by ViewVC 1.1.22