1 |
=head1 NAME |
2 |
|
3 |
SWISH-SEARCH - Swish-e Searching Instructions |
4 |
|
5 |
=head1 OVERVIEW |
6 |
|
7 |
This page describes the process of searching with Swish-e. Please see |
8 |
the L<SWISH-CONFIG|SWISH-CONFIG> page for information the Swish-e |
9 |
configuration file directives, and L<SWISH-RUN|SWISH-RUN> for a complete |
10 |
list of command line arguments. |
11 |
|
12 |
Searching a Swish-e index involves passing L<command line |
13 |
arguments|SWISH-RUN> to it that specify the index file to use, and |
14 |
the L<query|/"Searching Syntax and Operations"> (or search words) to |
15 |
locate in the index. Swish-e returns a list of file names (or URLs) |
16 |
that contain the matched search words. L<Perl|/"Searching with Perl"> |
17 |
is often used as a front-end to Swish-e such as in CGI applications, |
18 |
and L<perl modules|/"Perl Modules"> exist to for interfacing with Swish-e. |
19 |
|
20 |
=head1 Searching Syntax and Operations |
21 |
|
22 |
The C<-w> command line argument (switch) is used specify the search |
23 |
query to Swish-e. When running Swish-e from a shell prompt, be careful |
24 |
to protect your query from shell metacharacters and shell expansions. |
25 |
This often means placing single or double quotes around your query. |
26 |
See L<Searching with Perl> if you plan to use Perl as a front end |
27 |
to Swish-e. |
28 |
|
29 |
The following section describes various aspects of searching with Swish-e. |
30 |
|
31 |
=head2 Boolean Operators |
32 |
|
33 |
You can use the Boolean operators B<and>, B<or>, or B<not> in searching. |
34 |
Without these Boolean operators, Swish-e will assume you're B<and>ing |
35 |
the words together. The operators are not case sensitive. |
36 |
|
37 |
[Note: you can change the default to B<or>ing by changing the variable |
38 |
DEFAULT_RULE in the config.h file and recompiling Swish-e.] |
39 |
|
40 |
Evaluation takes place from B<left to right> only, although you can use |
41 |
parentheses to force the order of evaluation. |
42 |
|
43 |
Examples: |
44 |
|
45 |
swish-e -w "smilla or snow" -f myIndex |
46 |
|
47 |
Retrieves files containing either the words "smilla" or "snow". |
48 |
|
49 |
swish-e -w "smilla and snow not sense" -f myIndex |
50 |
swish-e -w "(smilla and snow) and not sense" -f myIndex (same thing) |
51 |
|
52 |
retrieves first the files that contain both the words "smilla" and |
53 |
"snow"; then among those the ones that do not contain the word "sense". |
54 |
|
55 |
|
56 |
=head2 Truncation |
57 |
|
58 |
The wildcard (*) is available, however it can only be used at the end |
59 |
of a word: otherwise is is considerd a normal character (i.e. can be |
60 |
searched for if included in the WordCharacters directive). |
61 |
|
62 |
swish-e -w "librarian" -f myIndex |
63 |
|
64 |
this query only retrieves files which contain the given word. |
65 |
|
66 |
On the other hand: |
67 |
|
68 |
swish-e -w "librarian*" -f myIndex |
69 |
|
70 |
retrieves "librarians", "librarianship", etc. along with "librarian". |
71 |
|
72 |
Note that wildcard searches combined with word stemming can lead |
73 |
to unexpected results. If stemming is enabled, a search term with a |
74 |
wildcard will be stemmed internally before searching. So searching for |
75 |
C<running*> will actually be a search for C<run*>, so C<running*> would |
76 |
find C<runway>. Also, searching for C<runn*> will not find C<running> |
77 |
as you might expect, since C<running> stems to C<run> in the index, |
78 |
and thus C<runn*> will not find C<run>. |
79 |
|
80 |
|
81 |
=head2 Order of Evaluation |
82 |
|
83 |
Expressions are always evaluated left to right: |
84 |
|
85 |
swish -w "juliet not ophelia and pac" -f myIndex |
86 |
|
87 |
retrieves files which contain "juliet" and "pac" but not "ophelia" |
88 |
|
89 |
However it is always possible to force the order of evaluation by using |
90 |
parenthesis. For example: |
91 |
|
92 |
swish-e -w "juliet not (ophelia and pac)" -f myIndex |
93 |
|
94 |
retrieves files with "juliet" and containing neither "ophelia" nor "pac". |
95 |
|
96 |
=head2 Meta Tags |
97 |
|
98 |
MetaNames are used to represent I<fields> (called I<columns> in a |
99 |
database) and provide a way to search in only parts of a document. |
100 |
See L<SWISH-CONFIG|SWISH-CONFIG/"Document Contents Directives"> for |
101 |
a description of MetaNames, and how they are specified in the source |
102 |
document. |
103 |
|
104 |
To limit a search to words found in a meta tag you prefix the keywords |
105 |
with the name of the meta tag, followed by the equal sign: |
106 |
|
107 |
metaname = word |
108 |
metaname = (this or that) |
109 |
metatname = ( (this or that) or "this phrase" ) |
110 |
|
111 |
It is not necessary to have spaces at either side of the "=", consequently |
112 |
the following are equivalent: |
113 |
|
114 |
swish-e -w "metaName=word" |
115 |
swish-e -w "metaName = word" |
116 |
swish-e -w "metaName= word" |
117 |
|
118 |
To search on a word that contains a "=", precede the "=" with a "\" |
119 |
(backslash). |
120 |
|
121 |
swish-e -w "test\=3 = x\=4 or y\=5" -f <index.file> |
122 |
|
123 |
this query returns the files where the word "x=4" is associated with |
124 |
the metaName "test=3" or that contains the word "y=5" not associated |
125 |
with any metaName. |
126 |
|
127 |
Queries can be also constructed using any of the usual search features, |
128 |
moreover metaName and plain search can be mixed in a single query. |
129 |
|
130 |
swish-e -w "metaName1 = (a1 or a4) not (a3 and a7)" -f index.swish-e |
131 |
|
132 |
This query will retrieve all the files in which "a1" or "a2" are found |
133 |
in the META tag "metaName1" and that do not contain the words "a3" and |
134 |
"a7", where "a3" and "a7" are not associated to any meta name. |
135 |
|
136 |
=head2 Phrase Searching |
137 |
|
138 |
To search for a phrase in a document use double-quotes to delimit your |
139 |
search terms. (The phrase delimiter is set in src/swish.h.) |
140 |
|
141 |
You must protect the quotes from the shell. |
142 |
|
143 |
For example, under Unix: |
144 |
|
145 |
swish-e -w '"this is a pharase" or (this and that)' |
146 |
swish-e -w 'meta1=("this is a pharase") or (this and that)' |
147 |
|
148 |
Or under Windows: |
149 |
|
150 |
swish-e -w \"this is a pharase\" or (this and that) |
151 |
|
152 |
You can use the C<-P> switch to set the phrase delimiter character. |
153 |
See L<SWISH-RUN|SWISH-RUN> for examples. |
154 |
|
155 |
|
156 |
=head2 Context |
157 |
|
158 |
At times you might not want to search for a word in every part of |
159 |
your files since you know that the word(s) are present in a particular |
160 |
tag. The ability to seach according to context greatly increases the |
161 |
chances that your hits will be relevant, and Swish-e provides a mechanism |
162 |
to do just that. |
163 |
|
164 |
The -t option in the search command line allows you to search for words |
165 |
that exist only in specific HTML tags. Each character in the string |
166 |
you specify in the argument to this option represents a different tag |
167 |
in which the word is searched; that is you can use any combinations of |
168 |
the following characters: |
169 |
|
170 |
H means all<HEAD> tags |
171 |
B stands for <BODY> tags |
172 |
t is all <TITLE> tags |
173 |
h is <H1> to <H6> (header) tags |
174 |
e is emphasized tags (this may be <B>, <I>, <EM>, or <STRONG>) |
175 |
c is HTML comment tags (<!-- ... -->) |
176 |
|
177 |
# This search will look for files with these two words in their titles only. |
178 |
swish-e -w "apples oranges" -t t -f myIndex |
179 |
|
180 |
# This search will look for files with these words in comments only. |
181 |
swish-e -w "keywords draft release" -t c -f myIndex |
182 |
|
183 |
This search will look for words in titles, headers, and emphasized tags. |
184 |
swish-e -w "world wide web" -t the -f myIndex |
185 |
|
186 |
=head1 Searching with Perl |
187 |
|
188 |
Perl ( http://www.perl.com/ ) is probably the most common programming |
189 |
language used with Swish-e, especially in CGI interfaces. Perl makes |
190 |
searching and parsing results with Swish-e easy, but if not done properly |
191 |
can leave your server vulnerable to attacks. |
192 |
|
193 |
When designing your CGI scripts you should carefully screen user input, |
194 |
and include features such as paged results and a timer to limit time |
195 |
required for a search to complete. These are to protect your web site |
196 |
against a denial of service (DoS) attack. |
197 |
|
198 |
Included with every distribution of Perl is a document called perlsec -- |
199 |
Perl Security. I<Please> take time to read and understand that document |
200 |
before writing CGI scripts in perl. |
201 |
|
202 |
Type at your shell/command prompt: |
203 |
|
204 |
perldoc perlsec |
205 |
|
206 |
If nothing else, start every CGI program in perl as such: |
207 |
|
208 |
#!/usr/local/bin/perl -wT |
209 |
use strict; |
210 |
|
211 |
That alone won't make your script secure, but may help you find insecure |
212 |
code. |
213 |
|
214 |
=head2 CGI Danger! |
215 |
|
216 |
There are many examples of CGI scripts on the Internet. Many are poorly |
217 |
written and insecure. A commonly seen way to execute Swish-e from a |
218 |
perl CGI script is with a I<piped open>. For example, it is common to |
219 |
see this type of C<open()>: |
220 |
|
221 |
open(SWISH, "$swish -w $query -f $index|"); |
222 |
|
223 |
This C<open()> gives shell access to the entire Internet! Often an |
224 |
attempt is made to strip C<$query> of I<bad> characters. But, this |
225 |
often fails since it's hard to guess what every I<bad> character is. |
226 |
Would you have thought about a null? A better approach is to only allow |
227 |
I<in> known safe characters. |
228 |
|
229 |
Even if you can be sure that any user supplied data is safe, this |
230 |
I<piped open> still passes the command parameters through the shell. |
231 |
If nothing else, it's just an extra unnecessary step to running Swish-e. |
232 |
|
233 |
Therefore, the recommended approach is to fork and exec C<swish-e> directly |
234 |
without passing through the shell. This process is described in the |
235 |
perl man page C<perlipc> under the appropriate heading B<Safe Pipe Opens>. |
236 |
|
237 |
Type: |
238 |
|
239 |
perldoc perlipc |
240 |
|
241 |
If all this sounds complicated you may wish to use a Perl module that |
242 |
does all the hard work for you. |
243 |
|
244 |
=head2 Perl Modules |
245 |
|
246 |
There are a couple of Perl modules for accessing Swish-e. One of the |
247 |
modules is included with the distribution, and the other module (or set |
248 |
of modules) is located on CPAN. The included module provides a way to |
249 |
embed Swish-e into your perl program, while the modules on CPAN provide an |
250 |
abstracted interface to it. Hopefully, they make using Swish-e easier. |
251 |
|
252 |
B<The Included SWISHE Perl Module> |
253 |
|
254 |
When compiling Swish-e from source the build process creates a C library |
255 |
(see the L<Swish-e INSTALL|INSTALL/"Installing_the_SWISH_E_C_Library"> |
256 |
documentation). The Swish-e distribution includes a F<perl> directory |
257 |
with files required to create the F<SWISHE.pm> module. This module |
258 |
will I<embed> Swish-e into your perl program so that searching does not |
259 |
require running an external program. Embedding the Swish-e program into |
260 |
your perl program results in faster Swish-e searches since it avoids the |
261 |
cost of forking and exec'ing a separate program and opening the index |
262 |
file for each request. |
263 |
|
264 |
You will probably B<not> want to embed Swish-e into perl if running under |
265 |
mod_perl as you will end up with very large Apache processes. |
266 |
|
267 |
Building and usage instructions for the F<SWISHE.pm> module can be found |
268 |
in the L<SWISH-PERL|SWISH-PERL> man page. |
269 |
|
270 |
Here's an edited snip from that man page: |
271 |
|
272 |
my $handle = SwishOpen( $indexfiles ) |
273 |
or die "Failed to open '$indexfiles'"; |
274 |
|
275 |
my $num_results = SwishSearch($handle, $query, 1, $props, $sort); |
276 |
|
277 |
unless ( $num_results ) { |
278 |
print "No Results\n"; |
279 |
|
280 |
my $error = SwishError( $handle ); |
281 |
print "Error number: $error\n" if $error; |
282 |
|
283 |
return; # or next. |
284 |
} |
285 |
|
286 |
while( my($rank,$index,$file,$title,$start,$size,@props) |
287 |
= SwishNext( $handle )) |
288 |
{ |
289 |
print join( ' ', |
290 |
$rank, |
291 |
$index, |
292 |
$file, |
293 |
qq["$title"], |
294 |
$start, |
295 |
$size, |
296 |
map{ qq["$_"] } @props, |
297 |
),"\n"; |
298 |
} |
299 |
|
300 |
|
301 |
B<SWISH Modules on CPAN> |
302 |
|
303 |
The Comprehensive Perl Archive Network, or CPAN, is a collection of |
304 |
modules for use with Perl. Always search CPAN (http://search.cpan.org/) |
305 |
before starting any new program. Chances are someone has written just |
306 |
what you need. |
307 |
|
308 |
On CPAN are also modules for searching with Swish-e. They can be found |
309 |
at http://search.cpan.org/search?mode=module&query=SWISH The main |
310 |
SWISH module (different from the SWISHE<E> module included with the |
311 |
Swish-e distribution) provides a high-level Object Oriented interface |
312 |
to Swish-e, and the same interface can be used to used to either fork |
313 |
and exec the Swish-e binary, or use the Swish-e C Library if installed |
314 |
by just changing one line of code. A server interface will be written |
315 |
when a Swish-e server is written. |
316 |
|
317 |
The main idea is that you can write a program to search with Swish-e, |
318 |
but not have to change your code (much) when you wish to change to a |
319 |
new way of accessing Swish-e. |
320 |
|
321 |
Here's an example of SWISH module usage from the synopsis: |
322 |
|
323 |
use SWISH; |
324 |
|
325 |
$sh = SWISH->connect('Fork', |
326 |
prog => '/usr/local/bin/swish-e', |
327 |
indexes => 'index.swish-e', |
328 |
results => sub { print $_[1]->as_string,"\n" }, |
329 |
) or die $SWISH::errstr unless $sh; |
330 |
|
331 |
$hits = $sh->query('metaname=(foo or bar)'); |
332 |
|
333 |
This takes care of running Swish-e in a secure way, parsing the output |
334 |
from it, and providing OO methods of accessing the resulting data. |
335 |
|
336 |
=head1 Document Info |
337 |
|
338 |
$Id: SWISH-SEARCH.pod,v 1.4 2002/04/15 02:34:43 whmoseley Exp $ |
339 |
|
340 |
. |