1 |
adcroft |
1.1 |
=head1 NAME |
2 |
|
|
|
3 |
|
|
SWISH-RUN - Running Swish-e and Command Line Switches |
4 |
|
|
|
5 |
|
|
=head1 OVERVIEW |
6 |
|
|
|
7 |
|
|
The Swish-e program is controlled by command line arguments (called |
8 |
|
|
I<switches>). Often, it is run manually from a shell (command |
9 |
|
|
prompt), or from a program such as a CGI script that passes the command |
10 |
|
|
line arguments to swish. |
11 |
|
|
|
12 |
|
|
Note: A number of the command line switches may be specified in the |
13 |
|
|
Swish-e configuration file specified with the C<-c> command line argument. |
14 |
|
|
Please see L<SWISH-CONFIG|SWISH-CONFIG> for a complete description of |
15 |
|
|
available configuration file directives. |
16 |
|
|
|
17 |
|
|
There are two basic operating modes of Swish-e: indexing and searching. |
18 |
|
|
There are command line arguments that are unique to each mode, and |
19 |
|
|
others that apply to both (yet may have different meaning depending on |
20 |
|
|
the operating mode). These command line arguments are listed below, |
21 |
|
|
grouped by: |
22 |
|
|
|
23 |
|
|
L<INDEXING|/"INDEXING> -- describes the command line arguments used |
24 |
|
|
while indexing. |
25 |
|
|
|
26 |
|
|
L<SEARCHING|/"SEARCHING> -- lists the command line arguments used while |
27 |
|
|
searching. |
28 |
|
|
|
29 |
|
|
L<OTHER SWITCHES|/"OTHER SWITCHES> -- lists switches that don't apply |
30 |
|
|
to searching or indexing. |
31 |
|
|
|
32 |
|
|
Beginning with Swish-e version 2.1, you may embed its search engine into |
33 |
|
|
your applications. Please see L<SWISH-LIBRARY|SWISH-LIBRARY>. |
34 |
|
|
|
35 |
|
|
|
36 |
|
|
=head1 INDEXING |
37 |
|
|
|
38 |
|
|
Swish-e indexing is initiated by passing I<command line arguments> to |
39 |
|
|
swish. The command line arguments used for I<searching> are described |
40 |
|
|
in L<SEARCHING|/"SEARCHING>. Also, see L<SWISH-SEARCH|SWISH-SEARCH> |
41 |
|
|
for examples of searching with Swish-e. |
42 |
|
|
|
43 |
|
|
Swish-e usage: |
44 |
|
|
|
45 |
|
|
swish-e [-i dir file ... ] [-c file] [-f file] [-l] \ |
46 |
|
|
[-v (num)] [-S method(fs|http|prog)] [-N path] |
47 |
|
|
|
48 |
|
|
The C<-h> switch (help) will list the available Swish-e command line |
49 |
|
|
arguments: |
50 |
|
|
|
51 |
|
|
swish-e -h |
52 |
|
|
|
53 |
|
|
Typically, most if not all indexing settings are placed in a configuration |
54 |
|
|
file (specified with the C<-c> switch). Once the configuration file is |
55 |
|
|
setup indexing is initiated as: |
56 |
|
|
|
57 |
|
|
swish-e -c /path/to/config/file |
58 |
|
|
|
59 |
|
|
See L<SWISH-CONFIG|SWISH-CONFIG> for information on the configuration |
60 |
|
|
file. |
61 |
|
|
|
62 |
|
|
Security Note: If the swish binary is named F<swish-search> then swish |
63 |
|
|
will not allow any operation that would cause swish to write to the |
64 |
|
|
index file. |
65 |
|
|
|
66 |
|
|
When indexing it may be advisable to index to a temporary file, and |
67 |
|
|
then after indexing has successfully completed rename the file to the |
68 |
|
|
final location. This is especially important when replacing an index |
69 |
|
|
that is currently in use. |
70 |
|
|
|
71 |
|
|
swish-e -c swish.config -f index.tmp |
72 |
|
|
[check return code from swish or look for err: output] |
73 |
|
|
mv index.tmp index.swish-e |
74 |
|
|
|
75 |
|
|
|
76 |
|
|
=head2 Indexing Command Line Arguments |
77 |
|
|
|
78 |
|
|
=over 4 |
79 |
|
|
|
80 |
|
|
=item -i *directories and/or files* (input file) |
81 |
|
|
|
82 |
|
|
This specifies the directories and/or files to index. Directories will be |
83 |
|
|
indexed recursively. This is typically specified in the L<configuration |
84 |
|
|
file|SWISH-CONFIG> with the B<IndexDir> directive instead of on the |
85 |
|
|
command line. Use of this switch overrides the configuration file |
86 |
|
|
settings. |
87 |
|
|
|
88 |
|
|
=item -S [fs|http|prog] (document source/access mode) |
89 |
|
|
|
90 |
|
|
This specifies the method to use for accessing documents to index. |
91 |
|
|
Can be either C<fs> for local indexing via the file system (the default), |
92 |
|
|
C<http> for spidering, or C<prog> for reading documents from an external program. |
93 |
|
|
|
94 |
|
|
Located in the C<conf> directory are example configuration files that demonstrate |
95 |
|
|
indexing with the different document source methods. |
96 |
|
|
|
97 |
|
|
See the L<SWISH-FAQ|SWISH-FAQ> for a discussion on the different indexing methods, and the difference |
98 |
|
|
between spidering with the http method vs. using the file system method. |
99 |
|
|
|
100 |
|
|
=over 4 |
101 |
|
|
|
102 |
|
|
=item fs - file system |
103 |
|
|
|
104 |
|
|
The C<fs> method simply reads files from a local (or networked) drive. This is the default |
105 |
|
|
method if the C<-S> switch is not specified. |
106 |
|
|
See L<SWISH-CONFIG|SWISH-CONFIG> for configuration |
107 |
|
|
directives specific to the C<fs> method. |
108 |
|
|
|
109 |
|
|
=item http - spider a web server |
110 |
|
|
|
111 |
|
|
The C<http> method is used to spider web servers. It uses an included helper |
112 |
|
|
program called F<swishspider> located in the F<src> directory. Swish needs to be able to locate |
113 |
|
|
this program when using the C<http> method. See L<SWISH-CONFIG|SWISH-CONFIG> for configuration |
114 |
|
|
directives specific to the C<http> method. |
115 |
|
|
|
116 |
|
|
By default, swish looks in the current directory for the F<swishspider> program, or in the directory |
117 |
|
|
specified by the C<SwishSpiderDir> directive. The first line of the F<swishspider> program |
118 |
|
|
(the "shebang" line) must point to the location of the Perl program (if your operating system uses it). |
119 |
|
|
|
120 |
|
|
Security Note: Under Windows swish passes the URLs fetched from remote documents through the shell (swish |
121 |
|
|
uses the system() command for running F<swishspider> under Windows), and this may be considered |
122 |
|
|
an additional security risk. |
123 |
|
|
|
124 |
|
|
The C<http> method is depreciated (or at least not very well appreciated). Consider using |
125 |
|
|
the C<prog> method described below for spidering. There's a spider program available in the |
126 |
|
|
F<prog-bin> directory for use with the C<prog> method. |
127 |
|
|
|
128 |
|
|
=item prog - general purpose access method |
129 |
|
|
|
130 |
|
|
The C<prog> method is new to Swish-e version 2.2. It's designed as a general |
131 |
|
|
purpose method to feed documents to swish from an external program. |
132 |
|
|
|
133 |
|
|
For example, the external program can read a database (e.g. MySQL), spider a web |
134 |
|
|
server, or convert documents from one format to another (e.g. pdf to html). Or, |
135 |
|
|
you can simply use it to read the files of the file system (like C<-S fs>), yet provide |
136 |
|
|
you with full control of what files are indexed. |
137 |
|
|
|
138 |
|
|
The external program name to run is passed to swish either by the L<IndexDir|SWISH-CONFIG/"item_IndexDir"> directive, |
139 |
|
|
or via the C<-i> option. Additional parameters may be passed to the external program |
140 |
|
|
via the L<SwishProgParameters|SWISH-CONFIG/"item_SwishProgParameters"> directive. |
141 |
|
|
|
142 |
|
|
A special name "stdin" may be used with C<-i> or L<IndexDir|SWISH-CONFIG/"item_IndexDir"> |
143 |
|
|
which tells swish to read from standard input instead of from an external program. See example below. |
144 |
|
|
|
145 |
|
|
The external program prints to standard output (which swish captures) |
146 |
|
|
a set of headers followed by the content of the file to index. The output looks similar to |
147 |
|
|
an email message or a HTTP document returned by a web server in that it includes name/value pairs |
148 |
|
|
of headers, a blank line, and the content. |
149 |
|
|
|
150 |
|
|
The content length is determined by a content-length header |
151 |
|
|
supplied to swish by the program; there is no "end of record" character or flag sent between documents. |
152 |
|
|
Therefore, it is critical that the content-length header is correct. This is a common source of errors. |
153 |
|
|
|
154 |
|
|
One advantage of this method (over using filters, for example) is that the external program is run only once |
155 |
|
|
for the entire indexing job, instead of once for every document. This avoids forking and creating |
156 |
|
|
a new process for every document, and makes a huge difference when your external program is something like |
157 |
|
|
perl that has a large startup cost. |
158 |
|
|
|
159 |
|
|
Here's a simple example written in Perl: |
160 |
|
|
|
161 |
|
|
#!/usr/local/bin/perl -w |
162 |
|
|
use strict; |
163 |
|
|
|
164 |
|
|
# Build a document |
165 |
|
|
my $doc = <<EOF; |
166 |
|
|
<html> |
167 |
|
|
<head> |
168 |
|
|
<title>Document Title</title> |
169 |
|
|
</head> |
170 |
|
|
<body> |
171 |
|
|
This is the text. |
172 |
|
|
</body> |
173 |
|
|
</html> |
174 |
|
|
EOF |
175 |
|
|
|
176 |
|
|
|
177 |
|
|
# Prepare the headers for swish |
178 |
|
|
my $path = 'Example.file'; |
179 |
|
|
my $size = length $doc; |
180 |
|
|
my $mtime = time; |
181 |
|
|
|
182 |
|
|
# Output the document (to swish) |
183 |
|
|
print <<EOF; |
184 |
|
|
Path-Name: $path |
185 |
|
|
Content-Length: $size |
186 |
|
|
Last-Mtime: $mtime |
187 |
|
|
Document-Type: HTML |
188 |
|
|
|
189 |
|
|
EOF |
190 |
|
|
|
191 |
|
|
print $doc; |
192 |
|
|
|
193 |
|
|
The external program must pass to swish the C<Path-Name:> and C<Content-Length:> headers. |
194 |
|
|
The optional C<Last-Mtime:> parameter is the last modification time of the file, and must |
195 |
|
|
be a time stamp (seconds since the Epoch on your platform). You may override swish's |
196 |
|
|
determination of document type (C<Indexcontents>) by using the C<Document-Type:> header. |
197 |
|
|
|
198 |
|
|
The above program only returns one document and exits, which is not very useful. Normally, |
199 |
|
|
your program would read data from some source, such as files or a database, format as |
200 |
|
|
XML, HTML, or text, and pass them to swish, one after another. The C<Content-Length:> header |
201 |
|
|
tells swish where each document ends -- there is not any special "end of record" character or |
202 |
|
|
marker. |
203 |
|
|
|
204 |
|
|
To index with the above example you need to make sure that the program is executable |
205 |
|
|
(and that the path to perl is correct), and then call swish telling to run in C<prog> |
206 |
|
|
mode, and the name of the program to use for input. |
207 |
|
|
|
208 |
|
|
% chmod 755 example.pl |
209 |
|
|
% ./swish-e -S prog -i ./example.pl |
210 |
|
|
|
211 |
|
|
Programs can and should be tested prior to running swish. For example: |
212 |
|
|
|
213 |
|
|
% ./example.pl > test.out |
214 |
|
|
|
215 |
|
|
A few more useful example programs are provided in the swish-e distribution |
216 |
|
|
located in the F<prog-bin> directory. Some include documentation: |
217 |
|
|
|
218 |
|
|
% cd prog-bin |
219 |
|
|
% perldoc spider.pl |
220 |
|
|
|
221 |
|
|
Others are small examples that include comments: |
222 |
|
|
|
223 |
|
|
% cd prog-bin |
224 |
|
|
% less DirTree.pl |
225 |
|
|
|
226 |
|
|
The F<spider.pl> program can be used as a replacement for the F<-S http> method. |
227 |
|
|
|
228 |
|
|
If you use the special program name "stdin" with C<-i> or L<IndexDir|SWISH-CONFIG/"item_IndexDir"> |
229 |
|
|
then swish-e will read from standard input instead of from a program. For example: |
230 |
|
|
|
231 |
|
|
% ./example.pl /path/to/data --count=1000 | ./swish-e -S prog -i stdin |
232 |
|
|
|
233 |
|
|
This is basically the same as using a swish-e configuration file of: |
234 |
|
|
|
235 |
|
|
SwishProgParameters /path/to/data --count=1000 |
236 |
|
|
IndexDir ./example.pl |
237 |
|
|
|
238 |
|
|
in a config file and running |
239 |
|
|
|
240 |
|
|
% ./swish-e -S prog -c swish.conf |
241 |
|
|
|
242 |
|
|
This gives an easy way to run swish without a configuration file |
243 |
|
|
with a C<-S prog> program that requires parameters. |
244 |
|
|
|
245 |
|
|
Using "stdin" might also be useful for programs that call swish (instead of swish calling the |
246 |
|
|
program). |
247 |
|
|
|
248 |
|
|
(The reason "stdin" is used instead of the more common "-" dash is due to the rotten way |
249 |
|
|
swish parses the command line. This should be fixed in the future.) |
250 |
|
|
|
251 |
|
|
The C<prog> method bypasses some of the configuration parameters available |
252 |
|
|
to the file system method -- settings such as |
253 |
|
|
C<IndexOnly>, C<FileRules>, C<FileMatch> and C<FollowSymLinks> |
254 |
|
|
are ignored when using the C<prog> method. It's expected that these operations |
255 |
|
|
are better accomplished in the external program before passing the document onto swish. In |
256 |
|
|
other words, when using the C<prog> method, only send the documents to swish |
257 |
|
|
that you want indexed. |
258 |
|
|
|
259 |
|
|
You may use swish's filter feature with the C<prog> method, but performance will be better if you |
260 |
|
|
run filtering programs from within your external program. |
261 |
|
|
|
262 |
|
|
B<Notes when using -S prog on MS Windows> |
263 |
|
|
|
264 |
|
|
Windows does not use the shebang (#!) line of a program to determine the program to run. So, when running, |
265 |
|
|
for example, a perl program you will need to specify the perl.exe binary as the program, and use the |
266 |
|
|
C<SwishProgParameters> to name the file. |
267 |
|
|
|
268 |
|
|
IndexDir e:/perl/bin/perl.exe |
269 |
|
|
SwishProgParameters read_database.pl |
270 |
|
|
|
271 |
|
|
Swish will replace the forward slashes with backslashes before running the command specified with |
272 |
|
|
C<IndexDir>. Swish uses the popen(3) command which passes the command through the shell. |
273 |
|
|
|
274 |
|
|
|
275 |
|
|
=back |
276 |
|
|
|
277 |
|
|
|
278 |
|
|
=item -f *indexfile* (index file) |
279 |
|
|
|
280 |
|
|
If you are indexing, this specifies the file to save the generated index in, |
281 |
|
|
and you can only specify one file. See also B<IndexFile> in the L<configuration file|SWISH-CONFIG>. |
282 |
|
|
|
283 |
|
|
If you are searching, this specifies the index |
284 |
|
|
files (one or more) to search from. The default index file is index.swish-e in the current directory. |
285 |
|
|
|
286 |
|
|
=item -c *file ...* (configuration files) |
287 |
|
|
|
288 |
|
|
Specify the configuration file(s) to use for indexing. This file contains many directives that |
289 |
|
|
control how Swish-e proceeds. |
290 |
|
|
See L<SWISH-CONFIG|SWISH-CONFIG> for a complete listing of configuration file directives. |
291 |
|
|
|
292 |
|
|
|
293 |
|
|
|
294 |
|
|
Example: |
295 |
|
|
|
296 |
|
|
swish-e -c docs.conf |
297 |
|
|
|
298 |
|
|
|
299 |
|
|
If you specify a directory to index, an index file, or the verbose option on the command-line, |
300 |
|
|
these values will override any specified in the configuration file. |
301 |
|
|
|
302 |
|
|
You can specify multiple configuration files. For example, you may have one configuration file |
303 |
|
|
that has common site-wide settings, and another for a specific index. |
304 |
|
|
|
305 |
|
|
Examples: |
306 |
|
|
|
307 |
|
|
1) swish-e -c swish-e.conf |
308 |
|
|
2) swish-e -i /usr/local/www -f index.swish-e -v -c swish-e.conf |
309 |
|
|
3) swish-e -c swish-e.conf stopwords.conf |
310 |
|
|
|
311 |
|
|
=over 3 |
312 |
|
|
|
313 |
|
|
=item 1 |
314 |
|
|
|
315 |
|
|
The settings in the configuration file will be used to index a site. |
316 |
|
|
|
317 |
|
|
=item 2 |
318 |
|
|
|
319 |
|
|
These command-line options will override anything in the configuration file. |
320 |
|
|
|
321 |
|
|
=item 3 |
322 |
|
|
|
323 |
|
|
The variables in swish-e.conf will be read, then the variable in stopwords.conf will be read. |
324 |
|
|
Note that if the same variables occur in both files, older values may be written over. |
325 |
|
|
|
326 |
|
|
=back |
327 |
|
|
|
328 |
|
|
=item -e (economy mode) |
329 |
|
|
|
330 |
|
|
For large sites indexing may require more RAM than is available. The C<-e> switch tells swish to use |
331 |
|
|
disk space to store data structures while indexing, saving memory. This option is recommended if |
332 |
|
|
swish uses so much RAM that the computer begins to swap excessively, and you cannot increase available |
333 |
|
|
memory. The trade-off is longer indexing times, and a busy disk drive. |
334 |
|
|
|
335 |
|
|
=item -l (symbolic links) |
336 |
|
|
|
337 |
|
|
Specifying this option tells swish to follow symbolic links when indexing. |
338 |
|
|
The configuration file value B<FollowSymLinks> will override the command-line value. |
339 |
|
|
|
340 |
|
|
The default is not to follow symlinks. A small improvement in indexing time my result |
341 |
|
|
from enabling FollowSymLinks since swish does not need to stat every directory and file |
342 |
|
|
processed to determine if it is a symbolic link. |
343 |
|
|
|
344 |
|
|
=item -N path (index only newer files) |
345 |
|
|
|
346 |
|
|
The C<-N> option takes a path to a file, and only files I<newer> than the specified |
347 |
|
|
file will be indexed. This is helpful for creating incremental indexes -- that is, |
348 |
|
|
indexes that contain just files added since the last full index was created of all files. |
349 |
|
|
|
350 |
|
|
Example (bad example) |
351 |
|
|
|
352 |
|
|
swish-e -c config.file -N index.swish-e -f index.new |
353 |
|
|
|
354 |
|
|
This will index as normal, but only files with a modified date newer |
355 |
|
|
than F<index.swish-e> will be indexed. |
356 |
|
|
|
357 |
|
|
This is a bad example because it uses F<index.swish-e> which one might assume |
358 |
|
|
was the date of last indexing. The problem is that files might have been added |
359 |
|
|
between the time indexing read the directory and when the F<index.swish-e> file |
360 |
|
|
was created -- which can be quite a bit of time for very large indexing jobs. |
361 |
|
|
|
362 |
|
|
The only solution is to prevent any new file additions while full indexing is running. |
363 |
|
|
If this is impossible then it will be slightly better to do this: |
364 |
|
|
|
365 |
|
|
Full indexing: |
366 |
|
|
|
367 |
|
|
touch indexing_time.file |
368 |
|
|
swish-e -c config.file -f index.tmp |
369 |
|
|
mv index.tmp index.full |
370 |
|
|
|
371 |
|
|
Incremental indexing: |
372 |
|
|
|
373 |
|
|
swish-e -c config.file -N indexing_time.file -f index.tmp |
374 |
|
|
mv index.tmp index.incremental |
375 |
|
|
|
376 |
|
|
Then search with |
377 |
|
|
|
378 |
|
|
swish-e -w foo -f index.full index.incremental |
379 |
|
|
|
380 |
|
|
or merge the indexes |
381 |
|
|
|
382 |
|
|
swish-e -M index.full index.incremental index.tmp |
383 |
|
|
mv index.tmp index.swish-e |
384 |
|
|
swish-e -w foo |
385 |
|
|
|
386 |
|
|
|
387 |
|
|
=item -v [0|1|2|3] (verbosity level) |
388 |
|
|
|
389 |
|
|
The C<-v> option can take a numerical value from 0 to 3. |
390 |
|
|
Specify 0 for completely silent operation and 3 for detailed reports. |
391 |
|
|
|
392 |
|
|
If no value is given then 1 is assumed. |
393 |
|
|
See also B<IndexReport> in the L<configuration file|SWISH-CONFIG>. |
394 |
|
|
|
395 |
|
|
Warnings and errors are reported regardless of the verbosity level. In addition, |
396 |
|
|
all error and warnings are written to standard out. This is for historical reasons (many |
397 |
|
|
scripts exist that parse standard out for error messages). |
398 |
|
|
|
399 |
|
|
=back |
400 |
|
|
|
401 |
|
|
=head1 SEARCHING |
402 |
|
|
|
403 |
|
|
The following command line arguments are available when searching with Swish-e. These switches are used |
404 |
|
|
to select the index to search, what fields to search, and how and what to print as results. |
405 |
|
|
|
406 |
|
|
This section just lists the available command line arguments and their usage. |
407 |
|
|
Please see L<SWISH-SEARCH|SWISH-SEARCH> for detailed searching instructions. |
408 |
|
|
|
409 |
|
|
B<Warning>: If using Swish-e via a CGI interface, please see L<CGI Danger!|SWISH-SEARCH/"CGI Danger!"> |
410 |
|
|
|
411 |
|
|
Security Note: If the swish binary is named F<swish-search> then swish will not allow any operation that |
412 |
|
|
would cause swish to write to the index file. |
413 |
|
|
|
414 |
|
|
=head2 Searching Command Line Arguments |
415 |
|
|
|
416 |
|
|
=over 4 |
417 |
|
|
|
418 |
|
|
=item -w *word1 word2 ...* (query words) |
419 |
|
|
|
420 |
|
|
This performs a case-insensitive search using a number of keywords. |
421 |
|
|
If no index file to search is specified (via the C<-f> switch), swish-e will try to search a file called |
422 |
|
|
index.swish-e in the current directory. |
423 |
|
|
|
424 |
|
|
swish-e -w word |
425 |
|
|
|
426 |
|
|
Phrase searching is accomplished by placing the quote delimiter (a double-quote by default) around |
427 |
|
|
the search phrase. |
428 |
|
|
|
429 |
|
|
swish-e -w 'word or "this phrase"' |
430 |
|
|
|
431 |
|
|
Search would should be protected from the shell by quotes. Typically, this is single quotes when |
432 |
|
|
running under Unix. |
433 |
|
|
|
434 |
|
|
Under Windows F<command.com> you may not need to use quotes, but you will need to |
435 |
|
|
backslash the quotes used to delimit phrases: |
436 |
|
|
|
437 |
|
|
swish-e -w \"a phrase\" |
438 |
|
|
|
439 |
|
|
The phrase delimiter can be set with the C<-P> switch. |
440 |
|
|
|
441 |
|
|
The search may be limited to a I<MetaName>. |
442 |
|
|
For example: |
443 |
|
|
|
444 |
|
|
swish-e -w meta1=(foo or baz) |
445 |
|
|
|
446 |
|
|
will only search within the B<meta1> tag. |
447 |
|
|
|
448 |
|
|
Please see L<SWISH-SEARCH|SWISH-SEARCH> for a description of MetaNames. |
449 |
|
|
|
450 |
|
|
|
451 |
|
|
|
452 |
|
|
=item -f *file1 file2 ...* (index files) |
453 |
|
|
|
454 |
|
|
Specifies the index file(s) used while searching. More than one file may be listed, and each |
455 |
|
|
file will be searched. If no C<-f> switch is specified then the file F<index.swish-e> in the current |
456 |
|
|
directory will be used as the index file. |
457 |
|
|
|
458 |
|
|
=item -m *number* (max results) |
459 |
|
|
|
460 |
|
|
While searching, this specifies the maximum number of results to return. |
461 |
|
|
The default is to return all results. |
462 |
|
|
|
463 |
|
|
This switch is often used in conjunction with the C<-b> switch to return results one |
464 |
|
|
page at a time (strongly recommended for large indexes). |
465 |
|
|
|
466 |
|
|
=item -b *number* (beginning result) |
467 |
|
|
|
468 |
|
|
Sets the I<begining> search result to return (records are numbered from 1). This switch can be used |
469 |
|
|
with the C<-m> switch to return results in groups or pages. |
470 |
|
|
|
471 |
|
|
Example: |
472 |
|
|
|
473 |
|
|
swish-e -w 'word' -b 1 -m 20 # first 'page' |
474 |
|
|
swish-e -w 'word' -b 21 -m 20 # second 'page' |
475 |
|
|
|
476 |
|
|
=item -t HBthec (context searching) |
477 |
|
|
|
478 |
|
|
The C<-t> option allows you to search for words that exist only |
479 |
|
|
in specific HTML tags. Each character in the string you |
480 |
|
|
specify in the argument to this option represents a |
481 |
|
|
different tag in which to search for the word. H means all HEAD |
482 |
|
|
tags, B stands for BODY tags, t is all TITLE tags, h is H1 |
483 |
|
|
to H6 (header) tags, e is emphasized tags (this may be B, I, |
484 |
|
|
EM, or STRONG), and c is HTML comment tags |
485 |
|
|
|
486 |
|
|
search only in header (<H*>) tags |
487 |
|
|
|
488 |
|
|
swish-c -w word -t h |
489 |
|
|
|
490 |
|
|
=item -d *string* (delimiter) |
491 |
|
|
|
492 |
|
|
Set the delimiter used when printing results. By default, Swish-e separates the output fields by a |
493 |
|
|
space, and places double-quotes around the document title. This output may be hard to parse, so it |
494 |
|
|
is recommended to use C<-d> to specify a character or string used as a separator between fields. |
495 |
|
|
|
496 |
|
|
The string C<dq> means "double-quotes". |
497 |
|
|
|
498 |
|
|
swish-e -w word -d , # single char |
499 |
|
|
swish-e -w word -d :: # string |
500 |
|
|
swish-e -w word -d '"' # double quotes under Unix |
501 |
|
|
swish-e -w word -d \" # double quotes under Windows |
502 |
|
|
swish-e -w word -d dq # double quotes |
503 |
|
|
|
504 |
|
|
The following control characters may also be specified: C<\t \r \n \f>. |
505 |
|
|
|
506 |
|
|
=item -P *character* |
507 |
|
|
|
508 |
|
|
Sets the delimiter used for phrase searches. The default is double quotes C<">. |
509 |
|
|
|
510 |
|
|
Some examples under bash: (be careful about you shell metacharacters) |
511 |
|
|
|
512 |
|
|
swish-e -P ^ -w 'title=^words in a phrase^' |
513 |
|
|
swish-e -P \' -w "title='words in a pharse"' |
514 |
|
|
|
515 |
|
|
|
516 |
|
|
=item -p *property1 property2 ...* (display properties) |
517 |
|
|
|
518 |
|
|
This causes swish to print the listed property in the search results. The properties |
519 |
|
|
are returned in the order they are listed in the C<-p> argument. |
520 |
|
|
|
521 |
|
|
Properties are defined by the B<ProperNames> directive in the configuration file (see L<SWISH-CONFIG|SWISH-CONFIG>) |
522 |
|
|
and properties must also be defined in B<MetaNames>. Swish stores the text of the meta name as a I<property>, and |
523 |
|
|
then will return this text while searching if this option is used. |
524 |
|
|
|
525 |
|
|
Properties are very useful for returning data included in a source documnet without having to re-read |
526 |
|
|
the source document while searching. For example, this could be used to return a short document description. |
527 |
|
|
See also see B<Document Summeries> and L<PropertyNames|SWISH-CONFIG/"item_PropertyNames"> in L<SWISH-CONFIG|SWISH-CONFIG>. |
528 |
|
|
|
529 |
|
|
To return the subject and category properties while indexing. |
530 |
|
|
|
531 |
|
|
swish-e -w word -p subject category |
532 |
|
|
|
533 |
|
|
Properties are returned in double quotes. If a property contains a double quote it is HTML escaped ("). |
534 |
|
|
See the C<-x> switch for a more advanced method of returning a list of properties. |
535 |
|
|
|
536 |
|
|
|
537 |
|
|
NOTE: it is necessary to have indexed with the proper |
538 |
|
|
PropertyNames directive in the user config file in order to |
539 |
|
|
use this option. |
540 |
|
|
|
541 |
|
|
=item -s *property [asc|desc] ...* (sort) |
542 |
|
|
|
543 |
|
|
Normally, search results are printed out in order of relevancy, with the most relevant listed first. |
544 |
|
|
The C<-s> sort switch allows you to sort results in order of a specified I<property>, where a I<property> |
545 |
|
|
was defined using the B<MetaNames> and B<PropertyNames> directives during indexing |
546 |
|
|
(see L<SWISH-CONFIG|SWISH-CONFIG>). |
547 |
|
|
|
548 |
|
|
The string passed can include the strings C<asc> and C<desc> to specify the sort order, and more than |
549 |
|
|
one property may be specified to sort on more than one key. |
550 |
|
|
|
551 |
|
|
Examples: |
552 |
|
|
|
553 |
|
|
sort by title property ascending order |
554 |
|
|
|
555 |
|
|
-s title |
556 |
|
|
|
557 |
|
|
sort descending by title, ascending by name |
558 |
|
|
|
559 |
|
|
-s title desc name asc |
560 |
|
|
|
561 |
|
|
=item -L limit to a range of property values (Limit) |
562 |
|
|
|
563 |
|
|
B<This is an experimental feature!> |
564 |
|
|
|
565 |
|
|
The C<-L> switch can be used to limit search results to a range of property values |
566 |
|
|
|
567 |
|
|
Example: |
568 |
|
|
|
569 |
|
|
swish-e -w foo -L swishtitle a m |
570 |
|
|
|
571 |
|
|
finds all documents that contain the word C<foo>, and where the |
572 |
|
|
document's title is in the range of C<a> to C<m>, inclusive. |
573 |
|
|
By default, the case of the property is ignored, but this can be |
574 |
|
|
changed by using L<PropertyNamesCompareCase|SWISH-CONFIG/"item_PropertyNamesCompareCase"> |
575 |
|
|
configuation directive. |
576 |
|
|
|
577 |
|
|
Limiting may be done with user-defined properties, as well. |
578 |
|
|
|
579 |
|
|
For example, if you indexed documents that contain a created timestamp in a meta tag: |
580 |
|
|
|
581 |
|
|
<meta name="created_on" content="982648324"> |
582 |
|
|
|
583 |
|
|
Then you tell Swish that you have a property called C<created_on>, and that |
584 |
|
|
it's a timestamp. |
585 |
|
|
|
586 |
|
|
PropertyNamesDate created_on |
587 |
|
|
|
588 |
|
|
After indexing you will be able to limit documents to a range of timestamps: |
589 |
|
|
|
590 |
|
|
-w foo -L created_on 946684800 949363199 |
591 |
|
|
|
592 |
|
|
will find documents containing the word foo and that have a created_on |
593 |
|
|
date from the start of Jan 1, 2000 to the end of Jan 31, 2000. |
594 |
|
|
|
595 |
|
|
Note: swish currently does not parse dates; Unix timestamps must be used. |
596 |
|
|
|
597 |
|
|
Two special formats can be used: |
598 |
|
|
|
599 |
|
|
-L swishtitle <= m |
600 |
|
|
-L swishtitle >= m |
601 |
|
|
|
602 |
|
|
Finds titles less than or equal, or grater than or equal to the letter C<m>. |
603 |
|
|
|
604 |
|
|
This feature will not work with C<swishrank> or C<swishdbfile> properties. |
605 |
|
|
|
606 |
|
|
This feature takes advantages of the pre-sorted tables built by swish during indexing to |
607 |
|
|
make this feature fast while searching. |
608 |
|
|
You should see in the indexing output a line such as: |
609 |
|
|
|
610 |
|
|
6 properties sorted. |
611 |
|
|
|
612 |
|
|
That indicates that six pre-sorted tables were built during indexing. |
613 |
|
|
By default, all properties are presorted while indexing. |
614 |
|
|
What properties are pre-sorted can be controlled by the configuration parameter C<PreSortedIndex>. |
615 |
|
|
|
616 |
|
|
Using the C<-L> switch on a property that was not pre-sorted will still work, but may be I<much> |
617 |
|
|
slower during searching. |
618 |
|
|
|
619 |
|
|
This is an experimental feature, and its use and interface are subject to change. |
620 |
|
|
|
621 |
|
|
=item -x formatstring (extended output format) |
622 |
|
|
|
623 |
|
|
The C<-x> switch defines the output format string. |
624 |
|
|
The format string can contain plain text and property names (including swish-defined internal property names) |
625 |
|
|
and is used to generate the output for every result. |
626 |
|
|
In addition, the output format of the property name can be controlled with C-like printf format strings. |
627 |
|
|
This feature overrides the cmdline switches C<-d> and C<-p>, |
628 |
|
|
and a warning will be generated if C<-d> or C<-p> are used with C<-x>. |
629 |
|
|
|
630 |
|
|
For example, to return just the title, one per line, in the search results: |
631 |
|
|
|
632 |
|
|
swish-e -w ... -x '<swishtitle>\n' ... |
633 |
|
|
|
634 |
|
|
Note: the C<\n> may need to be protected from your shell. |
635 |
|
|
|
636 |
|
|
See also L<ResultExtFormatName|SWISH-CONFIG/"item_ResultExtFormatName"> for a way to define I<named> |
637 |
|
|
format strings in the swish configuration file. |
638 |
|
|
|
639 |
|
|
B<Format of "formatstring":> |
640 |
|
|
|
641 |
|
|
"text<propertyname>text<propertyname fmt=propfmtstr>text..." |
642 |
|
|
|
643 |
|
|
|
644 |
|
|
Where B<propertyname> is: |
645 |
|
|
|
646 |
|
|
=over 4 |
647 |
|
|
|
648 |
|
|
=item * |
649 |
|
|
|
650 |
|
|
the name of a user property as specified with the config file |
651 |
|
|
directive "PropertyNames" |
652 |
|
|
|
653 |
|
|
=item * |
654 |
|
|
|
655 |
|
|
the name of a swish Auto property (see below). These properties are |
656 |
|
|
defined automatically by swish -- you do not need to specify them |
657 |
|
|
with PropertyNames directive. (This may change in the future.) |
658 |
|
|
|
659 |
|
|
=back |
660 |
|
|
|
661 |
|
|
propertynames must be placed within "E<lt>" and "E<gt>". |
662 |
|
|
|
663 |
|
|
B<User properties:> |
664 |
|
|
|
665 |
|
|
Swish-e allows you to specify certain META tags within your documents that can be used as B<document properties>. |
666 |
|
|
The contents of any META tag that has been identified as a document property can be returned as |
667 |
|
|
part of the search results. Doucment properties must be defined while indexing using the B<PropertyNames> |
668 |
|
|
configuration directive (see L<SWISH-CONFIG|SWISH-CONFIG/"item_PropertyNames">). |
669 |
|
|
|
670 |
|
|
Examples of user-defined PropertyNames: |
671 |
|
|
|
672 |
|
|
<keywords> |
673 |
|
|
<author> |
674 |
|
|
<deliveredby> |
675 |
|
|
<reference> |
676 |
|
|
<id> |
677 |
|
|
|
678 |
|
|
|
679 |
|
|
B<Auto properties:> |
680 |
|
|
|
681 |
|
|
Swish defines a number of "Auto" properties for each document indexed. |
682 |
|
|
These are available for output when using the C<-x> format. |
683 |
|
|
|
684 |
|
|
Name Type Contents |
685 |
|
|
-------------- ------- ---------------------------------------------- |
686 |
|
|
swishreccount Integer Result record counter |
687 |
|
|
swishtitle String Document title |
688 |
|
|
swishrank Integer Result rank for this hit |
689 |
|
|
swishdocpath String URL or filepath to document |
690 |
|
|
swishdocsize Integer Document size in bytes |
691 |
|
|
swishlastmodified Date Last modified date of document |
692 |
|
|
swishdescription String Description of document (see:StoreDescription) |
693 |
|
|
swishdbfile String Path of swish database indexfile |
694 |
|
|
|
695 |
|
|
The Auto properties can also be specified using shortcuts: |
696 |
|
|
|
697 |
|
|
Shortcut Property Name |
698 |
|
|
-------- -------------- |
699 |
|
|
%c swishreccount |
700 |
|
|
%d swishdescription |
701 |
|
|
%D swishlastmodified |
702 |
|
|
%I swishdbfile |
703 |
|
|
%p swishdocpath |
704 |
|
|
%r swishrank |
705 |
|
|
%l swishdocsize |
706 |
|
|
%t swishtitle |
707 |
|
|
|
708 |
|
|
For example, these are equivalent: |
709 |
|
|
|
710 |
|
|
-x '<swishrank>:<swishdocpath>:<swishtitle>\n' |
711 |
|
|
-x '%r:%p:%t\n' |
712 |
|
|
|
713 |
|
|
Use a double percent sign "%%" to enter a literal percent sign in the output. |
714 |
|
|
|
715 |
|
|
|
716 |
|
|
B<Formatstrings of properties:> |
717 |
|
|
|
718 |
|
|
Properties listed in an C<-x> format string can include format control strings. |
719 |
|
|
These "propertyformats" are used to control how the contents of the associated property are printed. |
720 |
|
|
Property formats are used like C-language printf formats. |
721 |
|
|
The property format is specified by including the attribute "fmt" within the property tag. |
722 |
|
|
|
723 |
|
|
Format strings cannot be used with the "%" shortcuts described above. |
724 |
|
|
|
725 |
|
|
General syntax: |
726 |
|
|
|
727 |
|
|
-x '<propertyname fmt="propfmtstr">' |
728 |
|
|
|
729 |
|
|
where C<subfmt> controls the output format of C<propertyname>. |
730 |
|
|
|
731 |
|
|
Examples of property format strings: |
732 |
|
|
|
733 |
|
|
date type: <swishlastmodified fmt="%d.%m.%Y"> |
734 |
|
|
string type: <swishtitle fmt="%-40.35s"> |
735 |
|
|
integer type: <swishreccount fmt=/%8.8d/> |
736 |
|
|
|
737 |
|
|
Please see the manual pages for strftime(3) and sprintf(3) for an explanation of |
738 |
|
|
format strings. Note: some versions of strftime do not offer the %s format string |
739 |
|
|
(number of seconds since the Epoch), so swish provides a special format string "%ld" |
740 |
|
|
to display the number of seconds since the Epoch. |
741 |
|
|
|
742 |
|
|
The first character of a property format string defines the delimiter for the format string. |
743 |
|
|
For example, |
744 |
|
|
|
745 |
|
|
-x "<author fmt=[%20s]> ...\n" |
746 |
|
|
-x "<author fmt='%20s'> ...\n" |
747 |
|
|
-x "<author fmt=/%20s/> ...\n" |
748 |
|
|
|
749 |
|
|
|
750 |
|
|
B<Standard predefined formats:> |
751 |
|
|
|
752 |
|
|
If you ommit the sub-format, the following formats are used: |
753 |
|
|
|
754 |
|
|
String type: "%s" (like printf char *) |
755 |
|
|
Integer type: "%d" (like printf int) |
756 |
|
|
Float type: "%f" (like printf double) |
757 |
|
|
Date type: "%Y-%m-%d %H:%M:%S" (like strftime) |
758 |
|
|
|
759 |
|
|
|
760 |
|
|
B<Text in "formatstring" or "propfmtstr":> |
761 |
|
|
|
762 |
|
|
Text will be output as-is in format strings (and property format strings). |
763 |
|
|
Special characters can be escaped with a backslash. |
764 |
|
|
To get a new line for each result hit, you have to include |
765 |
|
|
the Newline-Character "\n" at the end of "fmtstr". |
766 |
|
|
|
767 |
|
|
-x "<swishreccount>|<swishrank>|<swishdocpath>\n" |
768 |
|
|
-x "Count=<swishreccount>, Rank=<swishrank>\n" |
769 |
|
|
-x "Title=\<b\><swishtitle>\</b\>" |
770 |
|
|
-x 'Date: <swishlastmodified fmt="%m/%d/%Y">\n' |
771 |
|
|
-x 'Date in seconds: <swishlastmodified fmt=/%ld/>\n' |
772 |
|
|
|
773 |
|
|
B<Control/Escape charcters:> |
774 |
|
|
|
775 |
|
|
you can use C-like control escapes in the format string: |
776 |
|
|
|
777 |
|
|
known controls: \a, \b, \f, \n, \r, \t, \v, |
778 |
|
|
digit escapes: \xhexdigits \0octaldigits |
779 |
|
|
character escapes: \anychar |
780 |
|
|
|
781 |
|
|
Example, |
782 |
|
|
|
783 |
|
|
swish -x "%c\t%r\t%p\t\"<swishtitle fmt=/%40s/>\"\n" |
784 |
|
|
|
785 |
|
|
B<Examples of -x format strings:> |
786 |
|
|
|
787 |
|
|
-x "%c|%r|%p|%t|%D|%d\n" |
788 |
|
|
-x "%c|%r|%p|%t|<swishdate fmt=/%A, %d. %B %Y/>|%d\n" |
789 |
|
|
-x "<swishrank>\t<swishdocpath>\t<swishtitle>\t<keywords>\n |
790 |
|
|
-x "xml_out: \<title\><swishtitle>\>\</title\>\n" |
791 |
|
|
-x "xml_out: <swishtitle fmt='<title>%s</title>'>\n" |
792 |
|
|
|
793 |
|
|
=item -H [0|1|2|3|<n>] (header output verbosity) |
794 |
|
|
|
795 |
|
|
The C<-H n> switch generates extened I<header> output. This is most useful when searching more than one |
796 |
|
|
index file at a time, either by specifying more than one index file with the C<-f> switch, or when searching |
797 |
|
|
a merged index file. In these cases, C<-H 2> will generate a set of headers specific to each index file. |
798 |
|
|
This gives access to the settings used to generate each index file. |
799 |
|
|
|
800 |
|
|
Even when searching a single index file, C<-H n> will provided additional information about the index file, |
801 |
|
|
how it was indexed, and how swish is interperting the query. |
802 |
|
|
|
803 |
|
|
-H 0 : print no header information, output only search result entries. |
804 |
|
|
-H 1 : print standard result header (default). |
805 |
|
|
-H 2 : print additional header information for each searched index file. |
806 |
|
|
-H 3 : enhanced header output (e.g. print stopwords). |
807 |
|
|
-H 9 : print diagnostic information in the header of the results (changed from: C<-v 4>) |
808 |
|
|
|
809 |
|
|
|
810 |
|
|
=back |
811 |
|
|
|
812 |
|
|
|
813 |
|
|
=head1 OTHER SWITCHES |
814 |
|
|
|
815 |
|
|
=over 4 |
816 |
|
|
|
817 |
|
|
=item -V (version) |
818 |
|
|
|
819 |
|
|
Print the current version. |
820 |
|
|
|
821 |
|
|
=item -k *letter* (print out keywords) |
822 |
|
|
|
823 |
|
|
The C<-k> switch is used for testing and will cause swish to print out all keywords |
824 |
|
|
in the index beginning with that letter. You may enter C<-k '*'> to generate a list of all words indexed |
825 |
|
|
by swish. |
826 |
|
|
|
827 |
|
|
=item -D *index file* (debug index) |
828 |
|
|
|
829 |
|
|
The -D option is no longer supported in version 2.2. |
830 |
|
|
|
831 |
|
|
=item -T *options* (trace/debug swish) |
832 |
|
|
|
833 |
|
|
The -T option is used to print out information that may be helpful when debugging swish-e's |
834 |
|
|
operation. This option replaced the C<-D> option of previous versions. |
835 |
|
|
|
836 |
|
|
Running C<-T help> will print out a list of available *options* |
837 |
|
|
|
838 |
|
|
|
839 |
|
|
=back |
840 |
|
|
|
841 |
|
|
=head1 Merging Index Files |
842 |
|
|
|
843 |
|
|
At times it can be useful to merge different index files into one file for searching. |
844 |
|
|
This could be because you want to keep separate site indexes and a common one for a global search, or |
845 |
|
|
because your site is very large and Swish-e runs out of memory if you try to index it directly. |
846 |
|
|
|
847 |
|
|
You can only merge only indexes that were indexed with common settings |
848 |
|
|
(e.g. don't mix stemming and non-stemming indexes, or indexes with different WordCharacter settings, etc.). |
849 |
|
|
|
850 |
|
|
usage: swish-e [-v (num)] [-c file] -M index1 index2 ... outputfile |
851 |
|
|
|
852 |
|
|
Due to the structure of the swish-e index, merging may or may not require less memory than indexing |
853 |
|
|
all files at one time. |
854 |
|
|
|
855 |
|
|
|
856 |
|
|
=over 4 |
857 |
|
|
|
858 |
|
|
=item -M *file file ...* (merge) |
859 |
|
|
|
860 |
|
|
This allows you to merge two or more index files - the last file you specify on the |
861 |
|
|
list will be the output file. |
862 |
|
|
|
863 |
|
|
Merging removes all redundant file and word data. To estimate how much memory the operation will need, |
864 |
|
|
sum up the sizes of the files to be merged and divide by two. |
865 |
|
|
That's about the maximum amount of memory that will be used. |
866 |
|
|
|
867 |
|
|
You can use the C<-v> option to produce feedback while merging and the C<-c> option with a |
868 |
|
|
configuration file to include new administrative information in the new index file. |
869 |
|
|
|
870 |
|
|
=item -c *configuration file* |
871 |
|
|
|
872 |
|
|
Specify a configuration file while indexing to add administrative information to the output index file. |
873 |
|
|
|
874 |
|
|
=back |
875 |
|
|
|
876 |
|
|
=head1 Document Info |
877 |
|
|
|
878 |
|
|
$Id: SWISH-RUN.pod,v 1.23 2002/08/22 22:58:39 whmoseley Exp $ |
879 |
|
|
|
880 |
|
|
. |