1 |
adcroft |
1.1 |
=head1 NAME |
2 |
|
|
|
3 |
|
|
Proposed changes for Swish-e 3.0 |
4 |
|
|
|
5 |
|
|
=head1 Overview |
6 |
|
|
|
7 |
|
|
This pages is intended to give users of Swish-e an idea of the changes |
8 |
|
|
to come, to foster discussion of the direction of Swish-e, and a place |
9 |
|
|
where developers can map out new ideas. |
10 |
|
|
|
11 |
|
|
None of this is written in stone. Any of the developers can write their |
12 |
|
|
ideas in this document, but that doesn't mean it will actually happen ;). |
13 |
|
|
|
14 |
|
|
|
15 |
|
|
=head1 Support Incremental Indexing |
16 |
|
|
|
17 |
|
|
The Swish-e index structure currently makes it difficult to do incremental |
18 |
|
|
indexing, range limiting, and presents limits to indexing due to memory |
19 |
|
|
requirements. A database may solve some of these issues, at possibly |
20 |
|
|
a cost of performance. |
21 |
|
|
|
22 |
|
|
Swish-e has been linked with Berkeley DB. Although much slower in |
23 |
|
|
indexing, this may allow incremental indexing. Currently, the idea is |
24 |
|
|
to offer both database backends. |
25 |
|
|
|
26 |
|
|
=head1 Split code into Search and Indexing code |
27 |
|
|
|
28 |
|
|
There may be a small benefit from creating a smaller search-only program. |
29 |
|
|
CGI scripts may be faster, and the code would be smaller for those that |
30 |
|
|
want to embed Swish-e in to other applications. |
31 |
|
|
|
32 |
|
|
Currently, linking libswish-e into a program adds about 720K. Not real |
33 |
|
|
significant, but it could be if a number of processes are running with |
34 |
|
|
Swish-e. Another option is to build libswish-e as a shared library. |
35 |
|
|
|
36 |
|
|
=head1 Swish Server |
37 |
|
|
|
38 |
|
|
Somone needs to write a threaded Swish-e server. Or maybe just a |
39 |
|
|
pre-forking server in perl, to see how it works... |
40 |
|
|
|
41 |
|
|
=head1 Switch to Content-Types |
42 |
|
|
|
43 |
|
|
Moseley: Dec 28, 2000 |
44 |
|
|
|
45 |
|
|
I'm wondering if it might be smart to switch from the current "Document |
46 |
|
|
Types" to Content-Types. Currently, Swish-e know how to parse three |
47 |
|
|
types of documents TXT, HTML, and XML. There's currently two new |
48 |
|
|
configuration directives DefaultContents and IndexContents that map |
49 |
|
|
file extensions to one of the three types. This doesn't really work |
50 |
|
|
when spidering since it's the content-type that describes the document |
51 |
|
|
and not the file extension. |
52 |
|
|
|
53 |
|
|
It's an issue that can wait, but I'm concerned about backward compatiblity |
54 |
|
|
before people start using the IndexContents and DefaultContents config |
55 |
|
|
directives and then we change to content-type in the future. There's |
56 |
|
|
probably not that many people using those, but it might be work noting |
57 |
|
|
in the documentation that it will change, if we agree. |
58 |
|
|
|
59 |
|
|
The main reason to use content-type instead is for http processing where |
60 |
|
|
you can't depend on the file extension to determine the document type, |
61 |
|
|
so with http we have to use content-type to determine how to deal with |
62 |
|
|
the file. This is somewhat moot, as mapping can now be done with -S prog. |
63 |
|
|
|
64 |
|
|
I'd propose that Swish-e uses a mime.types file to map from extension |
65 |
|
|
to content-type. You could add or override mappings in the config file: |
66 |
|
|
|
67 |
|
|
AddType text/plain .doc .log |
68 |
|
|
|
69 |
|
|
DefaultType text/html # like DefaultContents currently |
70 |
|
|
|
71 |
|
|
The file source "plug-in" (whatever that ends up being) would return a |
72 |
|
|
content-type, but if not returned then Swish-e would map the type from |
73 |
|
|
the file name using the mime.types file or any AddType directives. |
74 |
|
|
|
75 |
|
|
Again, internally Swish-e only knows about text/[TXT|HTML|XML], so there |
76 |
|
|
should be a way to map other types, otherwise Swish-e might ignore |
77 |
|
|
the file. We could continue to use the three type names or switch |
78 |
|
|
completely to content-types. |
79 |
|
|
|
80 |
|
|
For example, if we continued to use [TXT|HTML|XML] |
81 |
|
|
|
82 |
|
|
MapType TXT text/directory text/logfile |
83 |
|
|
MapType HTML text/html |
84 |
|
|
|
85 |
|
|
Or maybe just extend the current directives |
86 |
|
|
|
87 |
|
|
IndexContents HTML .htm .html text/html |
88 |
|
|
|
89 |
|
|
Where the content-type would have precedence over the file extensions. |
90 |
|
|
|
91 |
|
|
This would tell Swish-e that those types are handled by those internal |
92 |
|
|
handlers. |
93 |
|
|
|
94 |
|
|
Then as I've mentioned before, you might specify filters as such |
95 |
|
|
|
96 |
|
|
FilterDocument application/msword /path/to/word-to-text |
97 |
|
|
|
98 |
|
|
And word-to-text would convert to text and return one of the three |
99 |
|
|
content-types that Swish-e knows how to parse, or a different content |
100 |
|
|
type if were to chain filters. |
101 |
|
|
|
102 |
|
|
|
103 |
|
|
=head1 Enhanced the PropertyNames directive |
104 |
|
|
|
105 |
|
|
Moseley: Updated Jan 13, 2001 |
106 |
|
|
|
107 |
|
|
If the PropertyNames directive was enhanced to be able to limit the number |
108 |
|
|
of characters stored, optionally extract text from HTML, and was able |
109 |
|
|
to define what type of docs (text, XML, HTML) it applied to, then the |
110 |
|
|
existing PropertyNames feature would work like the new StoreDescription |
111 |
|
|
feature but be useful for more than just one use. |
112 |
|
|
|
113 |
|
|
I'm not clear how to enhance the syntax of Properties and/or Metanames, |
114 |
|
|
but here's some ideas. Rainer suggested that an xml-type of format |
115 |
|
|
might be best and commonly understood. That's a good idea. Below are |
116 |
|
|
some older ideas that I had. But you will get the idea... |
117 |
|
|
|
118 |
|
|
The metaname structure could have flags for properties: |
119 |
|
|
|
120 |
|
|
1 - limiting to a length |
121 |
|
|
2 - stripping HTML |
122 |
|
|
3 - encoding HTML entities on output |
123 |
|
|
|
124 |
|
|
Oct 9, 2001 - The code is now in Swish-e to limit a string property to |
125 |
|
|
a length. The stripping of HTML is an issue for discussion. And encoding |
126 |
|
|
entities on output should be a result_outpu.c issue. |
127 |
|
|
|
128 |
|
|
=head1 Apache/XML style configuration |
129 |
|
|
|
130 |
|
|
This would be to allow some directives to be set per directory, or perl |
131 |
|
|
file extenstion (or content-type). |
132 |
|
|
|
133 |
|
|
|
134 |
|
|
=head1 Document Info |
135 |
|
|
|
136 |
|
|
$Id: SWISH-3.0.pod,v 1.6 2002/04/15 02:34:43 whmoseley Exp $ |
137 |
|
|
|
138 |
|
|
. |