1 |
=head1 NAME |
2 |
|
3 |
CHANGES - List of revisions |
4 |
|
5 |
=head1 Revision History |
6 |
|
7 |
This document contains list of bug fixes and feature additions to Swish-e. |
8 |
|
9 |
=head2 Version 2.2rc1 |
10 |
|
11 |
Release Date: August 29, 2002 |
12 |
|
13 |
Many large changes were made internally in the code, some for performance |
14 |
reasons, some for feature changes and additions, and some to prepare |
15 |
for new features in later versions of Swish-e. |
16 |
|
17 |
=over 4 |
18 |
|
19 |
=item * Documentation! |
20 |
|
21 |
Documentation is now included in the source distribution as .pod |
22 |
(perldoc) files, and as HTML files. In addition, the distribution can now |
23 |
generate PDF, postscript, and unix man pages from the source .pod files. |
24 |
See L<README|README> for more information. |
25 |
|
26 |
=item * Indexing and searching speed |
27 |
|
28 |
The indexing process has been imporoved. Depending on a number of |
29 |
factors, you may see a significant improvement in indexing speed, |
30 |
especially if upgrading from version 1.x. |
31 |
|
32 |
Searching speed has also been improved. Properties are not loaded until |
33 |
results are displayed, and properties are pre-sorted during indexing to |
34 |
speed up sorting results by properties while searching. |
35 |
|
36 |
=item * Properties are written to a sepearte file |
37 |
|
38 |
Swish-e now stores document properties in a separate file. This means |
39 |
there are now two files that make up a Swish-e index. The default files |
40 |
are C<index.swish-e> and C<index.swish-e.prop>. |
41 |
|
42 |
This change frees memory while indexing, allowing larger collections to |
43 |
be indexed in memory. |
44 |
|
45 |
=item * Internal data stored as Properties |
46 |
|
47 |
Pre 2.2 some internal data was stored in fixed locations within the |
48 |
index, namely the file name, file size, and title. 2.2 introduced new |
49 |
internal data such as the last modified date, and document summaries. |
50 |
This data is considered I<meta data> since it is data about a document. |
51 |
|
52 |
Instead of adding new data to the internal structure of the index file, |
53 |
it was decided to use the MetaNames and PropertyNames feature of Swish-e |
54 |
to store this meta information. This allows for new meta data to be added |
55 |
at a later time (e.g. Content-type), and provides an easy and customizable |
56 |
way to print results with the C<-p> switch and the new C<-x> switch. |
57 |
In addition, search results can now be sorted and limited by properties. |
58 |
|
59 |
For example, to sort by the rank and title: |
60 |
|
61 |
swish-e -w foo -s swishrank desc swishtitle asc |
62 |
|
63 |
|
64 |
=item * The header display has been slightly reorganized. |
65 |
|
66 |
If you are parsing output headers in a program then you may need to |
67 |
adjust your code. There's a new switch <-H> to control the level of |
68 |
header output when searching. |
69 |
|
70 |
=item * Results are now combined when searching more than one index. |
71 |
|
72 |
Swish-e now merges (and sorts) the results from multiple indexes when |
73 |
using C<-f> to specify more than one index. This change effects the way |
74 |
maxhits (C<-m>) works. Here's a summary of the way it works for the |
75 |
different versions. |
76 |
|
77 |
|
78 |
1.3.2 - MaxHits returns first N results starting from the first index. |
79 |
e.g. maxhits=20; 15 hits Index1, 40 hits Index2 |
80 |
All 15 from Index1 plus first five from Index2 = 20 hits. |
81 |
|
82 |
2.0.0 - MaxHits returns first N results from each index. |
83 |
e.g. Maxhits=20; 15 hits Index1, 40 hits Index2 |
84 |
All 15 from Index1 plus 15 from Index2. |
85 |
|
86 |
2.2.0 - Results are merged and first N results are returned. |
87 |
e.g. Maxhits=20; 15 hits Index1, 40 hits Index2 |
88 |
Results are merged from each index and sorted |
89 |
(rank is the default sort) and only the first |
90 |
20 are returned. |
91 |
|
92 |
|
93 |
=item * New B<prog> document source indexing method |
94 |
|
95 |
You can now use -S prog to use an external program to supply documents |
96 |
to Swish-e. This external program can be used to spider web servers, |
97 |
index databases, or to convert any type of document into html, xml, |
98 |
or text, so it can be indexed by Swish-e. Examples are given in the |
99 |
C<prog-bin> directory. |
100 |
|
101 |
=item * The indexing parser was rewritten to be more logical. |
102 |
|
103 |
TranslateCharacters now is done before WordCharacters is checked. For example, |
104 |
|
105 |
WordCharacters abcdefghijklmnopqrstuvwxyz |
106 |
TranslateCharacters ñ n |
107 |
|
108 |
Now C<El Niño> will be indexed as El Nino (el and nino), even though C<ñ> |
109 |
is not listed in WordCharacters. |
110 |
|
111 |
Previously, stopwords were checked after stemming and soundex conversions, |
112 |
as well as most of the other word checks (WordCharacters, min/max length |
113 |
and so on). This meant that the stopword list probably didn't work as |
114 |
expected when using stemming. |
115 |
|
116 |
=item * The search parser was rewritten to be more logical |
117 |
|
118 |
The search parser was rewritten to correct a number of logic errors. |
119 |
Swish-e did not differentiate between meta names, Swish-e operators |
120 |
and search words when parsing the query. This meant, for example, |
121 |
that metanames might be broken up by the WordCharacters setting, and |
122 |
that they could be stemmed. |
123 |
|
124 |
Swish-e operator characters C<"*()=> can now be searched by escaping |
125 |
with a backslash. For example: |
126 |
|
127 |
./swish-e -w 'this\=odd\)word' |
128 |
|
129 |
will end up searching for the word C<this=odd)word>. To search for a |
130 |
backslash character preceed it with a backslash. |
131 |
|
132 |
Currently, searching for: |
133 |
|
134 |
./swish-e -w 'this\*' |
135 |
|
136 |
is the same as a wildcard search. This may be fixed in the future. |
137 |
|
138 |
Searching for buzzwords with those characters will still require |
139 |
backslashing. This also may change to allow some un-escaped operator |
140 |
characters, but some will always need to be escaped (e.g. the double-quote |
141 |
phrase character). |
142 |
|
143 |
=item * Quotes and Backslash escapes in strings |
144 |
|
145 |
A bug was fixed in the C<parse_line()> function (in F<string.c>) where |
146 |
backslashes were not escaping the next character. C<parse_line()> is used |
147 |
to parse a string of text into tokens (words). Normally splitting is done |
148 |
at whitespace. You may use quotes (single or double) to define a string |
149 |
(that might include whitespace) as a single parameter. The backslash |
150 |
can also be used to escape the following character when *within* quotes |
151 |
(e.g. to escape an embedded quote character). |
152 |
|
153 |
ReplaceRules append "foo bar" <- define "foo bar" as a single word |
154 |
ReplaceRules append "foo\"bar" <- escape the quotes |
155 |
ReplaceRules append 'foo"bar' <- same thing |
156 |
|
157 |
|
158 |
=item * Example C<user.config> file removed. |
159 |
|
160 |
Previous versions of Swish-e included a configuration file called |
161 |
C<user.config> which contained examples of all directives. This has |
162 |
been replaced by a series of example configuration files located in the |
163 |
C<conf> directory. The configuration directives are now described in |
164 |
L<SWISH-CONFIG|SWISH-CONFIG>. |
165 |
|
166 |
=item * Ports to Win32 and VMS |
167 |
|
168 |
David Norris has included the files required to build Swish-e under |
169 |
Windows. See C<src/win32>. A self-extracting Windows version is |
170 |
available from the Download page of the swish-e.org web site. |
171 |
|
172 |
Jean-François Piéronne has provided the files required to build Swish-e |
173 |
under OpenVMS. See C<src/vms> for more information. |
174 |
|
175 |
=item * String properties are concatenated |
176 |
|
177 |
Multiple I<string> properties of the same name in a document are now |
178 |
concatenated into one property. A space character is added between |
179 |
the strings if needed. A warning will be generated if multiple numeric |
180 |
or date properties are found in the same document, and the additional |
181 |
properties will be ignored. |
182 |
|
183 |
Previously, properties of the same name were added to the index, but |
184 |
could not be retrieved. |
185 |
|
186 |
To do: remove the C<next> pointer, and allow user-defined character to |
187 |
place between properties. |
188 |
|
189 |
=item * regex type added to ReplaceRules |
190 |
|
191 |
A more general purpose pattern replacement syntax. |
192 |
|
193 |
|
194 |
=item * New Parsers |
195 |
|
196 |
Swish-e's XML parser was replaced with James Clark's expat XML parser |
197 |
library. |
198 |
|
199 |
Swish-e can now use Daniel Veillard's libxml2 library for parsing HTML and |
200 |
XML. This requires installation of the library before building Swish-e. |
201 |
See the L<INSTALL|INSTALL> document for information. libxml2 is not |
202 |
required, but is strongly recommended for parsing HTML over Swish-e's |
203 |
internal HTML parser, and provides more features for both HTML and |
204 |
XML parsing. |
205 |
|
206 |
=item * Support for zlib |
207 |
|
208 |
Swish-e can be compiled with zlib. This is useful for compressing large |
209 |
properties. Building Swish-e with zlib is stronly recommended if you |
210 |
use its C<StoreDescription> feature. |
211 |
|
212 |
=item * LST type of document no longer supported |
213 |
|
214 |
LST allowed indexing of files that contained multiple documents. |
215 |
|
216 |
=item * Temporary files |
217 |
|
218 |
To improve security Swish-e now uses the C<mkstemp(3)> function to |
219 |
create temporary files. Temporary files are used while indexing only. |
220 |
This may result in some portability issues, but the security issues |
221 |
were overriding. |
222 |
|
223 |
(Currently this does not apply to the -S http indexing method.) |
224 |
|
225 |
C<mkstemp> opens the temporary with O_EXCL|O_CREAT flags. This prevents |
226 |
overwriting existing files. In addition, the name of the file created |
227 |
is a lot harder to guess by attackers. The temporary file is created |
228 |
with only owner permissions. |
229 |
|
230 |
Please report any portability issues on the Swish-e discussion list. |
231 |
|
232 |
=item * Temporary file locations |
233 |
|
234 |
Swish-e now uses the environment variables C<TMPDIR>, C<TMP>, and |
235 |
C<TEMP> (in that order) to decide where to write temporary files. |
236 |
The configuration setting of L<TmpDir|SWISH-CONFIG/"item_TmpDir"> will |
237 |
be used if none of the environment variables are set. Swish-e uses the |
238 |
current directory otherwise; there is no default temporary directory. |
239 |
|
240 |
Since the environment variables override the configuration settings, |
241 |
a warning will be issued if you set L<TmpDir|SWISH-CONFIG/"item_TmpDir"> |
242 |
in the configuration file and there's also an environment variable set. |
243 |
|
244 |
Temporary files begin with the letters "swtmp" (which can be changed in |
245 |
F<config.h>), followed by two or more letters that indicate the type of |
246 |
temporary file, and some random characters to complete the file name. |
247 |
If indexing is aborted for some reason you may find these temporary |
248 |
files left behind. |
249 |
|
250 |
=item * New Fuzzy indexing method Double Metaphone |
251 |
|
252 |
Based on Lawrence Philips' Metaphone algorithm, add two |
253 |
new methods of creating a fuzzy index (in addition to Stemming and Soundex). |
254 |
|
255 |
|
256 |
=back |
257 |
|
258 |
Changes to Configuration File Directives. Please see |
259 |
L<SWISH-CONFIG|SWISH-CONFIG> for more info. |
260 |
|
261 |
=over 4 |
262 |
|
263 |
=item * New directives: IndexContents and DefaultContents |
264 |
|
265 |
The IndexContents directive assigns internal Swish-e document parsers |
266 |
to files based on their file type. The DefaultContents directive |
267 |
assigns a parser to be used on file that are not assigned a parser with |
268 |
IndexContents. |
269 |
|
270 |
=item * New directive: UndefinedMetaTags [error|ignore|index|auto] |
271 |
|
272 |
This describes what to do when a meta tag is found in a document that |
273 |
is not listed in the MetaNames directive. |
274 |
|
275 |
=item * New directive: IgnoreTags |
276 |
|
277 |
Will ignore text with the listed tags. |
278 |
|
279 |
=item * New directive: SwishProgParameters *list of words* |
280 |
|
281 |
Passes words listed to the external Swish-e program when running with |
282 |
C<-S prog> document source method. |
283 |
|
284 |
=item * New directive: ConvertHTMLEntities [yes|no] |
285 |
|
286 |
Controls parsing and conversion of HTML entities. |
287 |
|
288 |
=item * New directive: DontBumpPositionOnMetaTags |
289 |
|
290 |
The word position is now bumped when a new metatag is found -- this is |
291 |
to prevent phrases from matching across meta tags. This directive will |
292 |
disable this behavior for the listed tags. |
293 |
|
294 |
This directive works for HTML and XML documents. |
295 |
|
296 |
=item * Changed directive: IndexComments |
297 |
|
298 |
This has been changed such that comments are not indexed by default. |
299 |
|
300 |
=item * Changed directive: IgnoreWords |
301 |
|
302 |
The builtin list of stopwords has been removed. Use of the SwishDefault |
303 |
word will generate a warning, and no stop words will be used. You must |
304 |
now specify a list of stopwords, or specify a file of stopwords. |
305 |
|
306 |
A sample file C<stopwords.txt> has been included in the F<conf/stopwords> |
307 |
directory of the distribution, and can be used by the directive: |
308 |
|
309 |
IgnoreWords File: /path/to/stopwords.txt |
310 |
|
311 |
=item * Change of the default for IgnoreTotalWordCountWhenRanking |
312 |
|
313 |
The default is now "yes". |
314 |
|
315 |
=item * New directive: Buzzwords |
316 |
|
317 |
Buzzwords are words that should be indexed as-is, without checking |
318 |
for stopwords, word length, WordCharacters, or any other of the word |
319 |
limiting features. This allows indexing of things like C<C++> when "+" |
320 |
is not listed in WordCharacters. |
321 |
|
322 |
Currenly, IgnoreFirstChar and IgnoreLastChar will be stripped before |
323 |
processing Buzzwords. |
324 |
|
325 |
In the future we may use separate IgnoreFirst/Last settings for buzzwords |
326 |
since, for example, you may wish to index all C<+> within Swish-e words, |
327 |
but strip C<+> from the start/end of Swish-e words, but not from the |
328 |
buzzword C<C++>. |
329 |
|
330 |
=item * New directives: PropertyNamesNumeric PropertyNamesDate |
331 |
|
332 |
Before Swish-e 2.2 all user-defined document properties were stored in |
333 |
the index as strings. PropertyNamesNumeric and PropertyNamesDate tell |
334 |
it that a property should be stored in binary format. This allows |
335 |
for correct sorting of numeric properties. |
336 |
|
337 |
Currenly, only integers can be stored, such as a unix timestamp. (Swish-e |
338 |
uses C<strtoul> to convert the number to an unsigned long internally.) |
339 |
|
340 |
PropertyNamesDate only indicates to Swish-e that a number is a unix |
341 |
timestamp, and to display the property as a formatted time when printing |
342 |
results. Swish does not currently parse date strings; you must provide |
343 |
a unix timestamp. |
344 |
|
345 |
=item * New directive: MetaNameAlias |
346 |
|
347 |
You may now create alias names for MetaNames. This allow you to map or |
348 |
group multiple names to the same MetaName. |
349 |
|
350 |
=item * New directive: PropertyNameAlias |
351 |
|
352 |
Creates aliases for a PropertyName. |
353 |
|
354 |
=item * New directive: PropertyNamesMaxLength |
355 |
|
356 |
Sets the max length of a text property. |
357 |
|
358 |
=item * New directive: HTMLLinksMetaName |
359 |
|
360 |
Defines a metaname to use for indexing href links in HTML documents. |
361 |
Available only with libxml2 parser. |
362 |
|
363 |
=item * New directive: ImageLinksMetaName |
364 |
|
365 |
Defines a metaname to use for indexing src links in <img> tags. |
366 |
Allow you to search image pathnames within HTML pages. Available only |
367 |
with libxml2 parser. |
368 |
|
369 |
=item * New directive: IndexAltTagMetaName |
370 |
|
371 |
Allows indexing of image ALT tags. Only available when using the libxml2 parser. |
372 |
|
373 |
=item * New directive: AbsoluteLinks |
374 |
|
375 |
Attempts to convert relative links indexed with HTMLLinksMetaName and |
376 |
ImageLinksMetaName to absolute links. Available only with libxml2 parser. |
377 |
|
378 |
=item * New directive: ExtractPath |
379 |
|
380 |
Allows you to use a regular expression to extract out part of the path |
381 |
of each file and index it with a meta name. For example, this allows |
382 |
searches to be limited to parts of your file tree. |
383 |
|
384 |
=item * New directive: FileMatch |
385 |
|
386 |
FileMatch is similar to FileRules. Where FileRules is used to exclude |
387 |
files and directoires, FileMatch is used to I<include> files. |
388 |
|
389 |
=item * New directive: PreSortedIndex |
390 |
|
391 |
Controls which properties are pre-sorted while indexing. All properties |
392 |
are sorted by default. |
393 |
|
394 |
=item * New directive: ParserWarnLevel |
395 |
|
396 |
Sets the level of warning printed when using libxml2. |
397 |
|
398 |
=item * New directive: obeyRobotsNoIndex [yes|NO] |
399 |
|
400 |
When using libxml2 to parse HTML, Swish-e will skip files marked as |
401 |
NOINDEX. |
402 |
|
403 |
<meta name="robots" content="noindex"> |
404 |
|
405 |
Also, comments may be used within HTML and XML source docs to block sections of |
406 |
content from indexing: |
407 |
|
408 |
<!-- SwishCommand noindex --> |
409 |
<!-- SwishCommand index --> |
410 |
|
411 |
and/or these may be used also: |
412 |
|
413 |
<!-- noindex --> |
414 |
<!-- index --> |
415 |
|
416 |
|
417 |
=item * New directive: UndefinedXMLAttributes |
418 |
|
419 |
This describes how the content of XML attributes should be indexed, |
420 |
if at all. This is similar to UndefinedMetaTags, but is only for XML |
421 |
attributes and when parsed by libxml2. The default is to not index |
422 |
XML attributes. |
423 |
|
424 |
=item * New directive: XMLClassAttributes |
425 |
|
426 |
XMLClassAttributes can specify a list of attribute names whose content |
427 |
is combined with the element name to form metanames. |
428 |
|
429 |
=item * New directive: PropCompressionLevel [0-9] |
430 |
|
431 |
If compiled with zlib, Swish-e uses this setting to control the level |
432 |
of compression applied to properties. Properties must be long enough |
433 |
(defined in config.h) to be compressed. Useful for StoreDescription. |
434 |
|
435 |
=item * Experimental directive: IgnoreNumberChars |
436 |
|
437 |
Defines a set of characters. If a word is made of of *only* those |
438 |
characters the word will not be indexed. |
439 |
|
440 |
=item * New directive: FuzzyIndexingMode |
441 |
|
442 |
This configuration directive is used to define the type of "fuzzy" index to create. |
443 |
Currently the options are: |
444 |
|
445 |
None |
446 |
Stemming |
447 |
Soundex |
448 |
Metaphone |
449 |
DoubleMetaphone |
450 |
|
451 |
|
452 |
|
453 |
=back |
454 |
|
455 |
Changes to command line arguments. See L<SWISH-RUN|SWISH-RUN> for |
456 |
documentation on these switches. |
457 |
|
458 |
=over 4 |
459 |
|
460 |
=item * New command line argument C<-H> |
461 |
|
462 |
Controls the level (verbosity) of header information printed with |
463 |
search results. |
464 |
|
465 |
=item * New command line argument C<-x> |
466 |
|
467 |
Provides additional header output and allows for a I<format string> |
468 |
to describe what data to print. |
469 |
|
470 |
=item * New command line argument C<-k> |
471 |
|
472 |
Prints words stored in the Swish-e index. |
473 |
|
474 |
=item * New command line argument C<-N> |
475 |
|
476 |
Provides a way to do incremental indexing by comparing last modification |
477 |
dates. You pass C<-N> a path to a file and only files newer than the |
478 |
last modified date of that file will be indexed. |
479 |
|
480 |
=item * Removed command line argument C<-D> |
481 |
|
482 |
C<-D> no longer dumps the index file data. Use C<-T> instead. |
483 |
|
484 |
=item * New command line argument C<-T> |
485 |
|
486 |
C<-T> is used for debugging indexing and searching. |
487 |
|
488 |
=item * Enhanced command line argument C<-d> |
489 |
|
490 |
Now C<-d> can accept some back-slashed characters to be used as output |
491 |
separators. |
492 |
|
493 |
=item * Enhanced command line argument C<-P> |
494 |
|
495 |
Now -P sets the phrase delimiter character in searches. |
496 |
|
497 |
=item * New command line argument C<-L> |
498 |
|
499 |
Swish-e 2.2 contains an B<experimental> feature to limit results by a |
500 |
range of property values. This behavior of this feature may change in |
501 |
the future. |
502 |
|
503 |
=item * Modified command line argument C<-v> |
504 |
|
505 |
Now the argument C<-v 0> results in *no* output unless there is an error. |
506 |
This is a bit more handy when indexing with cron. |
507 |
|
508 |
|
509 |
=back |