/[MITgcm]/mitgcm.org/devel/buildweb/pkg/swish-e/pod/CHANGES.pod
ViewVC logotype

Annotation of /mitgcm.org/devel/buildweb/pkg/swish-e/pod/CHANGES.pod

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.1 - (hide annotations) (download)
Fri Sep 20 19:47:29 2002 UTC (22 years, 10 months ago) by adcroft
Branch point for: Import, MAIN
Initial revision

1 adcroft 1.1 =head1 NAME
2    
3     CHANGES - List of revisions
4    
5     =head1 Revision History
6    
7     This document contains list of bug fixes and feature additions to Swish-e.
8    
9     =head2 Version 2.2rc1
10    
11     Release Date: August 29, 2002
12    
13     Many large changes were made internally in the code, some for performance
14     reasons, some for feature changes and additions, and some to prepare
15     for new features in later versions of Swish-e.
16    
17     =over 4
18    
19     =item * Documentation!
20    
21     Documentation is now included in the source distribution as .pod
22     (perldoc) files, and as HTML files. In addition, the distribution can now
23     generate PDF, postscript, and unix man pages from the source .pod files.
24     See L<README|README> for more information.
25    
26     =item * Indexing and searching speed
27    
28     The indexing process has been imporoved. Depending on a number of
29     factors, you may see a significant improvement in indexing speed,
30     especially if upgrading from version 1.x.
31    
32     Searching speed has also been improved. Properties are not loaded until
33     results are displayed, and properties are pre-sorted during indexing to
34     speed up sorting results by properties while searching.
35    
36     =item * Properties are written to a sepearte file
37    
38     Swish-e now stores document properties in a separate file. This means
39     there are now two files that make up a Swish-e index. The default files
40     are C<index.swish-e> and C<index.swish-e.prop>.
41    
42     This change frees memory while indexing, allowing larger collections to
43     be indexed in memory.
44    
45     =item * Internal data stored as Properties
46    
47     Pre 2.2 some internal data was stored in fixed locations within the
48     index, namely the file name, file size, and title. 2.2 introduced new
49     internal data such as the last modified date, and document summaries.
50     This data is considered I<meta data> since it is data about a document.
51    
52     Instead of adding new data to the internal structure of the index file,
53     it was decided to use the MetaNames and PropertyNames feature of Swish-e
54     to store this meta information. This allows for new meta data to be added
55     at a later time (e.g. Content-type), and provides an easy and customizable
56     way to print results with the C<-p> switch and the new C<-x> switch.
57     In addition, search results can now be sorted and limited by properties.
58    
59     For example, to sort by the rank and title:
60    
61     swish-e -w foo -s swishrank desc swishtitle asc
62    
63    
64     =item * The header display has been slightly reorganized.
65    
66     If you are parsing output headers in a program then you may need to
67     adjust your code. There's a new switch <-H> to control the level of
68     header output when searching.
69    
70     =item * Results are now combined when searching more than one index.
71    
72     Swish-e now merges (and sorts) the results from multiple indexes when
73     using C<-f> to specify more than one index. This change effects the way
74     maxhits (C<-m>) works. Here's a summary of the way it works for the
75     different versions.
76    
77    
78     1.3.2 - MaxHits returns first N results starting from the first index.
79     e.g. maxhits=20; 15 hits Index1, 40 hits Index2
80     All 15 from Index1 plus first five from Index2 = 20 hits.
81    
82     2.0.0 - MaxHits returns first N results from each index.
83     e.g. Maxhits=20; 15 hits Index1, 40 hits Index2
84     All 15 from Index1 plus 15 from Index2.
85    
86     2.2.0 - Results are merged and first N results are returned.
87     e.g. Maxhits=20; 15 hits Index1, 40 hits Index2
88     Results are merged from each index and sorted
89     (rank is the default sort) and only the first
90     20 are returned.
91    
92    
93     =item * New B<prog> document source indexing method
94    
95     You can now use -S prog to use an external program to supply documents
96     to Swish-e. This external program can be used to spider web servers,
97     index databases, or to convert any type of document into html, xml,
98     or text, so it can be indexed by Swish-e. Examples are given in the
99     C<prog-bin> directory.
100    
101     =item * The indexing parser was rewritten to be more logical.
102    
103     TranslateCharacters now is done before WordCharacters is checked. For example,
104    
105     WordCharacters abcdefghijklmnopqrstuvwxyz
106     TranslateCharacters ñ n
107    
108     Now C<El Niño> will be indexed as El Nino (el and nino), even though C<ñ>
109     is not listed in WordCharacters.
110    
111     Previously, stopwords were checked after stemming and soundex conversions,
112     as well as most of the other word checks (WordCharacters, min/max length
113     and so on). This meant that the stopword list probably didn't work as
114     expected when using stemming.
115    
116     =item * The search parser was rewritten to be more logical
117    
118     The search parser was rewritten to correct a number of logic errors.
119     Swish-e did not differentiate between meta names, Swish-e operators
120     and search words when parsing the query. This meant, for example,
121     that metanames might be broken up by the WordCharacters setting, and
122     that they could be stemmed.
123    
124     Swish-e operator characters C<"*()=> can now be searched by escaping
125     with a backslash. For example:
126    
127     ./swish-e -w 'this\=odd\)word'
128    
129     will end up searching for the word C<this=odd)word>. To search for a
130     backslash character preceed it with a backslash.
131    
132     Currently, searching for:
133    
134     ./swish-e -w 'this\*'
135    
136     is the same as a wildcard search. This may be fixed in the future.
137    
138     Searching for buzzwords with those characters will still require
139     backslashing. This also may change to allow some un-escaped operator
140     characters, but some will always need to be escaped (e.g. the double-quote
141     phrase character).
142    
143     =item * Quotes and Backslash escapes in strings
144    
145     A bug was fixed in the C<parse_line()> function (in F<string.c>) where
146     backslashes were not escaping the next character. C<parse_line()> is used
147     to parse a string of text into tokens (words). Normally splitting is done
148     at whitespace. You may use quotes (single or double) to define a string
149     (that might include whitespace) as a single parameter. The backslash
150     can also be used to escape the following character when *within* quotes
151     (e.g. to escape an embedded quote character).
152    
153     ReplaceRules append "foo bar" <- define "foo bar" as a single word
154     ReplaceRules append "foo\"bar" <- escape the quotes
155     ReplaceRules append 'foo"bar' <- same thing
156    
157    
158     =item * Example C<user.config> file removed.
159    
160     Previous versions of Swish-e included a configuration file called
161     C<user.config> which contained examples of all directives. This has
162     been replaced by a series of example configuration files located in the
163     C<conf> directory. The configuration directives are now described in
164     L<SWISH-CONFIG|SWISH-CONFIG>.
165    
166     =item * Ports to Win32 and VMS
167    
168     David Norris has included the files required to build Swish-e under
169     Windows. See C<src/win32>. A self-extracting Windows version is
170     available from the Download page of the swish-e.org web site.
171    
172     Jean-François Piéronne has provided the files required to build Swish-e
173     under OpenVMS. See C<src/vms> for more information.
174    
175     =item * String properties are concatenated
176    
177     Multiple I<string> properties of the same name in a document are now
178     concatenated into one property. A space character is added between
179     the strings if needed. A warning will be generated if multiple numeric
180     or date properties are found in the same document, and the additional
181     properties will be ignored.
182    
183     Previously, properties of the same name were added to the index, but
184     could not be retrieved.
185    
186     To do: remove the C<next> pointer, and allow user-defined character to
187     place between properties.
188    
189     =item * regex type added to ReplaceRules
190    
191     A more general purpose pattern replacement syntax.
192    
193    
194     =item * New Parsers
195    
196     Swish-e's XML parser was replaced with James Clark's expat XML parser
197     library.
198    
199     Swish-e can now use Daniel Veillard's libxml2 library for parsing HTML and
200     XML. This requires installation of the library before building Swish-e.
201     See the L<INSTALL|INSTALL> document for information. libxml2 is not
202     required, but is strongly recommended for parsing HTML over Swish-e's
203     internal HTML parser, and provides more features for both HTML and
204     XML parsing.
205    
206     =item * Support for zlib
207    
208     Swish-e can be compiled with zlib. This is useful for compressing large
209     properties. Building Swish-e with zlib is stronly recommended if you
210     use its C<StoreDescription> feature.
211    
212     =item * LST type of document no longer supported
213    
214     LST allowed indexing of files that contained multiple documents.
215    
216     =item * Temporary files
217    
218     To improve security Swish-e now uses the C<mkstemp(3)> function to
219     create temporary files. Temporary files are used while indexing only.
220     This may result in some portability issues, but the security issues
221     were overriding.
222    
223     (Currently this does not apply to the -S http indexing method.)
224    
225     C<mkstemp> opens the temporary with O_EXCL|O_CREAT flags. This prevents
226     overwriting existing files. In addition, the name of the file created
227     is a lot harder to guess by attackers. The temporary file is created
228     with only owner permissions.
229    
230     Please report any portability issues on the Swish-e discussion list.
231    
232     =item * Temporary file locations
233    
234     Swish-e now uses the environment variables C<TMPDIR>, C<TMP>, and
235     C<TEMP> (in that order) to decide where to write temporary files.
236     The configuration setting of L<TmpDir|SWISH-CONFIG/"item_TmpDir"> will
237     be used if none of the environment variables are set. Swish-e uses the
238     current directory otherwise; there is no default temporary directory.
239    
240     Since the environment variables override the configuration settings,
241     a warning will be issued if you set L<TmpDir|SWISH-CONFIG/"item_TmpDir">
242     in the configuration file and there's also an environment variable set.
243    
244     Temporary files begin with the letters "swtmp" (which can be changed in
245     F<config.h>), followed by two or more letters that indicate the type of
246     temporary file, and some random characters to complete the file name.
247     If indexing is aborted for some reason you may find these temporary
248     files left behind.
249    
250     =item * New Fuzzy indexing method Double Metaphone
251    
252     Based on Lawrence Philips' Metaphone algorithm, add two
253     new methods of creating a fuzzy index (in addition to Stemming and Soundex).
254    
255    
256     =back
257    
258     Changes to Configuration File Directives. Please see
259     L<SWISH-CONFIG|SWISH-CONFIG> for more info.
260    
261     =over 4
262    
263     =item * New directives: IndexContents and DefaultContents
264    
265     The IndexContents directive assigns internal Swish-e document parsers
266     to files based on their file type. The DefaultContents directive
267     assigns a parser to be used on file that are not assigned a parser with
268     IndexContents.
269    
270     =item * New directive: UndefinedMetaTags [error|ignore|index|auto]
271    
272     This describes what to do when a meta tag is found in a document that
273     is not listed in the MetaNames directive.
274    
275     =item * New directive: IgnoreTags
276    
277     Will ignore text with the listed tags.
278    
279     =item * New directive: SwishProgParameters *list of words*
280    
281     Passes words listed to the external Swish-e program when running with
282     C<-S prog> document source method.
283    
284     =item * New directive: ConvertHTMLEntities [yes|no]
285    
286     Controls parsing and conversion of HTML entities.
287    
288     =item * New directive: DontBumpPositionOnMetaTags
289    
290     The word position is now bumped when a new metatag is found -- this is
291     to prevent phrases from matching across meta tags. This directive will
292     disable this behavior for the listed tags.
293    
294     This directive works for HTML and XML documents.
295    
296     =item * Changed directive: IndexComments
297    
298     This has been changed such that comments are not indexed by default.
299    
300     =item * Changed directive: IgnoreWords
301    
302     The builtin list of stopwords has been removed. Use of the SwishDefault
303     word will generate a warning, and no stop words will be used. You must
304     now specify a list of stopwords, or specify a file of stopwords.
305    
306     A sample file C<stopwords.txt> has been included in the F<conf/stopwords>
307     directory of the distribution, and can be used by the directive:
308    
309     IgnoreWords File: /path/to/stopwords.txt
310    
311     =item * Change of the default for IgnoreTotalWordCountWhenRanking
312    
313     The default is now "yes".
314    
315     =item * New directive: Buzzwords
316    
317     Buzzwords are words that should be indexed as-is, without checking
318     for stopwords, word length, WordCharacters, or any other of the word
319     limiting features. This allows indexing of things like C<C++> when "+"
320     is not listed in WordCharacters.
321    
322     Currenly, IgnoreFirstChar and IgnoreLastChar will be stripped before
323     processing Buzzwords.
324    
325     In the future we may use separate IgnoreFirst/Last settings for buzzwords
326     since, for example, you may wish to index all C<+> within Swish-e words,
327     but strip C<+> from the start/end of Swish-e words, but not from the
328     buzzword C<C++>.
329    
330     =item * New directives: PropertyNamesNumeric PropertyNamesDate
331    
332     Before Swish-e 2.2 all user-defined document properties were stored in
333     the index as strings. PropertyNamesNumeric and PropertyNamesDate tell
334     it that a property should be stored in binary format. This allows
335     for correct sorting of numeric properties.
336    
337     Currenly, only integers can be stored, such as a unix timestamp. (Swish-e
338     uses C<strtoul> to convert the number to an unsigned long internally.)
339    
340     PropertyNamesDate only indicates to Swish-e that a number is a unix
341     timestamp, and to display the property as a formatted time when printing
342     results. Swish does not currently parse date strings; you must provide
343     a unix timestamp.
344    
345     =item * New directive: MetaNameAlias
346    
347     You may now create alias names for MetaNames. This allow you to map or
348     group multiple names to the same MetaName.
349    
350     =item * New directive: PropertyNameAlias
351    
352     Creates aliases for a PropertyName.
353    
354     =item * New directive: PropertyNamesMaxLength
355    
356     Sets the max length of a text property.
357    
358     =item * New directive: HTMLLinksMetaName
359    
360     Defines a metaname to use for indexing href links in HTML documents.
361     Available only with libxml2 parser.
362    
363     =item * New directive: ImageLinksMetaName
364    
365     Defines a metaname to use for indexing src links in <img> tags.
366     Allow you to search image pathnames within HTML pages. Available only
367     with libxml2 parser.
368    
369     =item * New directive: IndexAltTagMetaName
370    
371     Allows indexing of image ALT tags. Only available when using the libxml2 parser.
372    
373     =item * New directive: AbsoluteLinks
374    
375     Attempts to convert relative links indexed with HTMLLinksMetaName and
376     ImageLinksMetaName to absolute links. Available only with libxml2 parser.
377    
378     =item * New directive: ExtractPath
379    
380     Allows you to use a regular expression to extract out part of the path
381     of each file and index it with a meta name. For example, this allows
382     searches to be limited to parts of your file tree.
383    
384     =item * New directive: FileMatch
385    
386     FileMatch is similar to FileRules. Where FileRules is used to exclude
387     files and directoires, FileMatch is used to I<include> files.
388    
389     =item * New directive: PreSortedIndex
390    
391     Controls which properties are pre-sorted while indexing. All properties
392     are sorted by default.
393    
394     =item * New directive: ParserWarnLevel
395    
396     Sets the level of warning printed when using libxml2.
397    
398     =item * New directive: obeyRobotsNoIndex [yes|NO]
399    
400     When using libxml2 to parse HTML, Swish-e will skip files marked as
401     NOINDEX.
402    
403     <meta name="robots" content="noindex">
404    
405     Also, comments may be used within HTML and XML source docs to block sections of
406     content from indexing:
407    
408     <!-- SwishCommand noindex -->
409     <!-- SwishCommand index -->
410    
411     and/or these may be used also:
412    
413     <!-- noindex -->
414     <!-- index -->
415    
416    
417     =item * New directive: UndefinedXMLAttributes
418    
419     This describes how the content of XML attributes should be indexed,
420     if at all. This is similar to UndefinedMetaTags, but is only for XML
421     attributes and when parsed by libxml2. The default is to not index
422     XML attributes.
423    
424     =item * New directive: XMLClassAttributes
425    
426     XMLClassAttributes can specify a list of attribute names whose content
427     is combined with the element name to form metanames.
428    
429     =item * New directive: PropCompressionLevel [0-9]
430    
431     If compiled with zlib, Swish-e uses this setting to control the level
432     of compression applied to properties. Properties must be long enough
433     (defined in config.h) to be compressed. Useful for StoreDescription.
434    
435     =item * Experimental directive: IgnoreNumberChars
436    
437     Defines a set of characters. If a word is made of of *only* those
438     characters the word will not be indexed.
439    
440     =item * New directive: FuzzyIndexingMode
441    
442     This configuration directive is used to define the type of "fuzzy" index to create.
443     Currently the options are:
444    
445     None
446     Stemming
447     Soundex
448     Metaphone
449     DoubleMetaphone
450    
451    
452    
453     =back
454    
455     Changes to command line arguments. See L<SWISH-RUN|SWISH-RUN> for
456     documentation on these switches.
457    
458     =over 4
459    
460     =item * New command line argument C<-H>
461    
462     Controls the level (verbosity) of header information printed with
463     search results.
464    
465     =item * New command line argument C<-x>
466    
467     Provides additional header output and allows for a I<format string>
468     to describe what data to print.
469    
470     =item * New command line argument C<-k>
471    
472     Prints words stored in the Swish-e index.
473    
474     =item * New command line argument C<-N>
475    
476     Provides a way to do incremental indexing by comparing last modification
477     dates. You pass C<-N> a path to a file and only files newer than the
478     last modified date of that file will be indexed.
479    
480     =item * Removed command line argument C<-D>
481    
482     C<-D> no longer dumps the index file data. Use C<-T> instead.
483    
484     =item * New command line argument C<-T>
485    
486     C<-T> is used for debugging indexing and searching.
487    
488     =item * Enhanced command line argument C<-d>
489    
490     Now C<-d> can accept some back-slashed characters to be used as output
491     separators.
492    
493     =item * Enhanced command line argument C<-P>
494    
495     Now -P sets the phrase delimiter character in searches.
496    
497     =item * New command line argument C<-L>
498    
499     Swish-e 2.2 contains an B<experimental> feature to limit results by a
500     range of property values. This behavior of this feature may change in
501     the future.
502    
503     =item * Modified command line argument C<-v>
504    
505     Now the argument C<-v 0> results in *no* output unless there is an error.
506     This is a bit more handy when indexing with cron.
507    
508    
509     =back

  ViewVC Help
Powered by ViewVC 1.1.22