/[MITgcm]/mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-SEARCH.pod
ViewVC logotype

Contents of /mitgcm.org/devel/buildweb/pkg/swish-e/pod/SWISH-SEARCH.pod

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.1.1.1 - (show annotations) (download) (vendor branch)
Fri Sep 20 19:47:29 2002 UTC (22 years, 10 months ago) by adcroft
Branch: Import, MAIN
CVS Tags: baseline, HEAD
Changes since 1.1: +0 -0 lines
Error occurred while calculating annotation data.
Importing web-site building process.

1 =head1 NAME
2
3 SWISH-SEARCH - Swish-e Searching Instructions
4
5 =head1 OVERVIEW
6
7 This page describes the process of searching with Swish-e. Please see
8 the L<SWISH-CONFIG|SWISH-CONFIG> page for information the Swish-e
9 configuration file directives, and L<SWISH-RUN|SWISH-RUN> for a complete
10 list of command line arguments.
11
12 Searching a Swish-e index involves passing L<command line
13 arguments|SWISH-RUN> to it that specify the index file to use, and
14 the L<query|/"Searching Syntax and Operations"> (or search words) to
15 locate in the index. Swish-e returns a list of file names (or URLs)
16 that contain the matched search words. L<Perl|/"Searching with Perl">
17 is often used as a front-end to Swish-e such as in CGI applications,
18 and L<perl modules|/"Perl Modules"> exist to for interfacing with Swish-e.
19
20 =head1 Searching Syntax and Operations
21
22 The C<-w> command line argument (switch) is used specify the search
23 query to Swish-e. When running Swish-e from a shell prompt, be careful
24 to protect your query from shell metacharacters and shell expansions.
25 This often means placing single or double quotes around your query.
26 See L<Searching with Perl> if you plan to use Perl as a front end
27 to Swish-e.
28
29 The following section describes various aspects of searching with Swish-e.
30
31 =head2 Boolean Operators
32
33 You can use the Boolean operators B<and>, B<or>, or B<not> in searching.
34 Without these Boolean operators, Swish-e will assume you're B<and>ing
35 the words together. The operators are not case sensitive.
36
37 [Note: you can change the default to B<or>ing by changing the variable
38 DEFAULT_RULE in the config.h file and recompiling Swish-e.]
39
40 Evaluation takes place from B<left to right> only, although you can use
41 parentheses to force the order of evaluation.
42
43 Examples:
44
45 swish-e -w "smilla or snow" -f myIndex
46
47 Retrieves files containing either the words "smilla" or "snow".
48
49 swish-e -w "smilla and snow not sense" -f myIndex
50 swish-e -w "(smilla and snow) and not sense" -f myIndex (same thing)
51
52 retrieves first the files that contain both the words "smilla" and
53 "snow"; then among those the ones that do not contain the word "sense".
54
55
56 =head2 Truncation
57
58 The wildcard (*) is available, however it can only be used at the end
59 of a word: otherwise is is considerd a normal character (i.e. can be
60 searched for if included in the WordCharacters directive).
61
62 swish-e -w "librarian" -f myIndex
63
64 this query only retrieves files which contain the given word.
65
66 On the other hand:
67
68 swish-e -w "librarian*" -f myIndex
69
70 retrieves "librarians", "librarianship", etc. along with "librarian".
71
72 Note that wildcard searches combined with word stemming can lead
73 to unexpected results. If stemming is enabled, a search term with a
74 wildcard will be stemmed internally before searching. So searching for
75 C<running*> will actually be a search for C<run*>, so C<running*> would
76 find C<runway>. Also, searching for C<runn*> will not find C<running>
77 as you might expect, since C<running> stems to C<run> in the index,
78 and thus C<runn*> will not find C<run>.
79
80
81 =head2 Order of Evaluation
82
83 Expressions are always evaluated left to right:
84
85 swish -w "juliet not ophelia and pac" -f myIndex
86
87 retrieves files which contain "juliet" and "pac" but not "ophelia"
88
89 However it is always possible to force the order of evaluation by using
90 parenthesis. For example:
91
92 swish-e -w "juliet not (ophelia and pac)" -f myIndex
93
94 retrieves files with "juliet" and containing neither "ophelia" nor "pac".
95
96 =head2 Meta Tags
97
98 MetaNames are used to represent I<fields> (called I<columns> in a
99 database) and provide a way to search in only parts of a document.
100 See L<SWISH-CONFIG|SWISH-CONFIG/"Document Contents Directives"> for
101 a description of MetaNames, and how they are specified in the source
102 document.
103
104 To limit a search to words found in a meta tag you prefix the keywords
105 with the name of the meta tag, followed by the equal sign:
106
107 metaname = word
108 metaname = (this or that)
109 metatname = ( (this or that) or "this phrase" )
110
111 It is not necessary to have spaces at either side of the "=", consequently
112 the following are equivalent:
113
114 swish-e -w "metaName=word"
115 swish-e -w "metaName = word"
116 swish-e -w "metaName= word"
117
118 To search on a word that contains a "=", precede the "=" with a "\"
119 (backslash).
120
121 swish-e -w "test\=3 = x\=4 or y\=5" -f <index.file>
122
123 this query returns the files where the word "x=4" is associated with
124 the metaName "test=3" or that contains the word "y=5" not associated
125 with any metaName.
126
127 Queries can be also constructed using any of the usual search features,
128 moreover metaName and plain search can be mixed in a single query.
129
130 swish-e -w "metaName1 = (a1 or a4) not (a3 and a7)" -f index.swish-e
131
132 This query will retrieve all the files in which "a1" or "a2" are found
133 in the META tag "metaName1" and that do not contain the words "a3" and
134 "a7", where "a3" and "a7" are not associated to any meta name.
135
136 =head2 Phrase Searching
137
138 To search for a phrase in a document use double-quotes to delimit your
139 search terms. (The phrase delimiter is set in src/swish.h.)
140
141 You must protect the quotes from the shell.
142
143 For example, under Unix:
144
145 swish-e -w '"this is a pharase" or (this and that)'
146 swish-e -w 'meta1=("this is a pharase") or (this and that)'
147
148 Or under Windows:
149
150 swish-e -w \"this is a pharase\" or (this and that)
151
152 You can use the C<-P> switch to set the phrase delimiter character.
153 See L<SWISH-RUN|SWISH-RUN> for examples.
154
155
156 =head2 Context
157
158 At times you might not want to search for a word in every part of
159 your files since you know that the word(s) are present in a particular
160 tag. The ability to seach according to context greatly increases the
161 chances that your hits will be relevant, and Swish-e provides a mechanism
162 to do just that.
163
164 The -t option in the search command line allows you to search for words
165 that exist only in specific HTML tags. Each character in the string
166 you specify in the argument to this option represents a different tag
167 in which the word is searched; that is you can use any combinations of
168 the following characters:
169
170 H means all<HEAD> tags
171 B stands for <BODY> tags
172 t is all <TITLE> tags
173 h is <H1> to <H6> (header) tags
174 e is emphasized tags (this may be <B>, <I>, <EM>, or <STRONG>)
175 c is HTML comment tags (<!-- ... -->)
176
177 # This search will look for files with these two words in their titles only.
178 swish-e -w "apples oranges" -t t -f myIndex
179
180 # This search will look for files with these words in comments only.
181 swish-e -w "keywords draft release" -t c -f myIndex
182
183 This search will look for words in titles, headers, and emphasized tags.
184 swish-e -w "world wide web" -t the -f myIndex
185
186 =head1 Searching with Perl
187
188 Perl ( http://www.perl.com/ ) is probably the most common programming
189 language used with Swish-e, especially in CGI interfaces. Perl makes
190 searching and parsing results with Swish-e easy, but if not done properly
191 can leave your server vulnerable to attacks.
192
193 When designing your CGI scripts you should carefully screen user input,
194 and include features such as paged results and a timer to limit time
195 required for a search to complete. These are to protect your web site
196 against a denial of service (DoS) attack.
197
198 Included with every distribution of Perl is a document called perlsec --
199 Perl Security. I<Please> take time to read and understand that document
200 before writing CGI scripts in perl.
201
202 Type at your shell/command prompt:
203
204 perldoc perlsec
205
206 If nothing else, start every CGI program in perl as such:
207
208 #!/usr/local/bin/perl -wT
209 use strict;
210
211 That alone won't make your script secure, but may help you find insecure
212 code.
213
214 =head2 CGI Danger!
215
216 There are many examples of CGI scripts on the Internet. Many are poorly
217 written and insecure. A commonly seen way to execute Swish-e from a
218 perl CGI script is with a I<piped open>. For example, it is common to
219 see this type of C<open()>:
220
221 open(SWISH, "$swish -w $query -f $index|");
222
223 This C<open()> gives shell access to the entire Internet! Often an
224 attempt is made to strip C<$query> of I<bad> characters. But, this
225 often fails since it's hard to guess what every I<bad> character is.
226 Would you have thought about a null? A better approach is to only allow
227 I<in> known safe characters.
228
229 Even if you can be sure that any user supplied data is safe, this
230 I<piped open> still passes the command parameters through the shell.
231 If nothing else, it's just an extra unnecessary step to running Swish-e.
232
233 Therefore, the recommended approach is to fork and exec C<swish-e> directly
234 without passing through the shell. This process is described in the
235 perl man page C<perlipc> under the appropriate heading B<Safe Pipe Opens>.
236
237 Type:
238
239 perldoc perlipc
240
241 If all this sounds complicated you may wish to use a Perl module that
242 does all the hard work for you.
243
244 =head2 Perl Modules
245
246 There are a couple of Perl modules for accessing Swish-e. One of the
247 modules is included with the distribution, and the other module (or set
248 of modules) is located on CPAN. The included module provides a way to
249 embed Swish-e into your perl program, while the modules on CPAN provide an
250 abstracted interface to it. Hopefully, they make using Swish-e easier.
251
252 B<The Included SWISHE Perl Module>
253
254 When compiling Swish-e from source the build process creates a C library
255 (see the L<Swish-e INSTALL|INSTALL/"Installing_the_SWISH_E_C_Library">
256 documentation). The Swish-e distribution includes a F<perl> directory
257 with files required to create the F<SWISHE.pm> module. This module
258 will I<embed> Swish-e into your perl program so that searching does not
259 require running an external program. Embedding the Swish-e program into
260 your perl program results in faster Swish-e searches since it avoids the
261 cost of forking and exec'ing a separate program and opening the index
262 file for each request.
263
264 You will probably B<not> want to embed Swish-e into perl if running under
265 mod_perl as you will end up with very large Apache processes.
266
267 Building and usage instructions for the F<SWISHE.pm> module can be found
268 in the L<SWISH-PERL|SWISH-PERL> man page.
269
270 Here's an edited snip from that man page:
271
272 my $handle = SwishOpen( $indexfiles )
273 or die "Failed to open '$indexfiles'";
274
275 my $num_results = SwishSearch($handle, $query, 1, $props, $sort);
276
277 unless ( $num_results ) {
278 print "No Results\n";
279
280 my $error = SwishError( $handle );
281 print "Error number: $error\n" if $error;
282
283 return; # or next.
284 }
285
286 while( my($rank,$index,$file,$title,$start,$size,@props)
287 = SwishNext( $handle ))
288 {
289 print join( ' ',
290 $rank,
291 $index,
292 $file,
293 qq["$title"],
294 $start,
295 $size,
296 map{ qq["$_"] } @props,
297 ),"\n";
298 }
299
300
301 B<SWISH Modules on CPAN>
302
303 The Comprehensive Perl Archive Network, or CPAN, is a collection of
304 modules for use with Perl. Always search CPAN (http://search.cpan.org/)
305 before starting any new program. Chances are someone has written just
306 what you need.
307
308 On CPAN are also modules for searching with Swish-e. They can be found
309 at http://search.cpan.org/search?mode=module&query=SWISH The main
310 SWISH module (different from the SWISHE<E> module included with the
311 Swish-e distribution) provides a high-level Object Oriented interface
312 to Swish-e, and the same interface can be used to used to either fork
313 and exec the Swish-e binary, or use the Swish-e C Library if installed
314 by just changing one line of code. A server interface will be written
315 when a Swish-e server is written.
316
317 The main idea is that you can write a program to search with Swish-e,
318 but not have to change your code (much) when you wish to change to a
319 new way of accessing Swish-e.
320
321 Here's an example of SWISH module usage from the synopsis:
322
323 use SWISH;
324
325 $sh = SWISH->connect('Fork',
326 prog => '/usr/local/bin/swish-e',
327 indexes => 'index.swish-e',
328 results => sub { print $_[1]->as_string,"\n" },
329 ) or die $SWISH::errstr unless $sh;
330
331 $hits = $sh->query('metaname=(foo or bar)');
332
333 This takes care of running Swish-e in a secure way, parsing the output
334 from it, and providing OO methods of accessing the resulting data.
335
336 =head1 Document Info
337
338 $Id: SWISH-SEARCH.pod,v 1.4 2002/04/15 02:34:43 whmoseley Exp $
339
340 .

  ViewVC Help
Powered by ViewVC 1.1.22