1 |
=head1 NAME |
2 |
|
3 |
Proposed changes for Swish-e 3.0 |
4 |
|
5 |
=head1 Overview |
6 |
|
7 |
This pages is intended to give users of Swish-e an idea of the changes |
8 |
to come, to foster discussion of the direction of Swish-e, and a place |
9 |
where developers can map out new ideas. |
10 |
|
11 |
None of this is written in stone. Any of the developers can write their |
12 |
ideas in this document, but that doesn't mean it will actually happen ;). |
13 |
|
14 |
|
15 |
=head1 Support Incremental Indexing |
16 |
|
17 |
The Swish-e index structure currently makes it difficult to do incremental |
18 |
indexing, range limiting, and presents limits to indexing due to memory |
19 |
requirements. A database may solve some of these issues, at possibly |
20 |
a cost of performance. |
21 |
|
22 |
Swish-e has been linked with Berkeley DB. Although much slower in |
23 |
indexing, this may allow incremental indexing. Currently, the idea is |
24 |
to offer both database backends. |
25 |
|
26 |
=head1 Split code into Search and Indexing code |
27 |
|
28 |
There may be a small benefit from creating a smaller search-only program. |
29 |
CGI scripts may be faster, and the code would be smaller for those that |
30 |
want to embed Swish-e in to other applications. |
31 |
|
32 |
Currently, linking libswish-e into a program adds about 720K. Not real |
33 |
significant, but it could be if a number of processes are running with |
34 |
Swish-e. Another option is to build libswish-e as a shared library. |
35 |
|
36 |
=head1 Swish Server |
37 |
|
38 |
Somone needs to write a threaded Swish-e server. Or maybe just a |
39 |
pre-forking server in perl, to see how it works... |
40 |
|
41 |
=head1 Switch to Content-Types |
42 |
|
43 |
Moseley: Dec 28, 2000 |
44 |
|
45 |
I'm wondering if it might be smart to switch from the current "Document |
46 |
Types" to Content-Types. Currently, Swish-e know how to parse three |
47 |
types of documents TXT, HTML, and XML. There's currently two new |
48 |
configuration directives DefaultContents and IndexContents that map |
49 |
file extensions to one of the three types. This doesn't really work |
50 |
when spidering since it's the content-type that describes the document |
51 |
and not the file extension. |
52 |
|
53 |
It's an issue that can wait, but I'm concerned about backward compatiblity |
54 |
before people start using the IndexContents and DefaultContents config |
55 |
directives and then we change to content-type in the future. There's |
56 |
probably not that many people using those, but it might be work noting |
57 |
in the documentation that it will change, if we agree. |
58 |
|
59 |
The main reason to use content-type instead is for http processing where |
60 |
you can't depend on the file extension to determine the document type, |
61 |
so with http we have to use content-type to determine how to deal with |
62 |
the file. This is somewhat moot, as mapping can now be done with -S prog. |
63 |
|
64 |
I'd propose that Swish-e uses a mime.types file to map from extension |
65 |
to content-type. You could add or override mappings in the config file: |
66 |
|
67 |
AddType text/plain .doc .log |
68 |
|
69 |
DefaultType text/html # like DefaultContents currently |
70 |
|
71 |
The file source "plug-in" (whatever that ends up being) would return a |
72 |
content-type, but if not returned then Swish-e would map the type from |
73 |
the file name using the mime.types file or any AddType directives. |
74 |
|
75 |
Again, internally Swish-e only knows about text/[TXT|HTML|XML], so there |
76 |
should be a way to map other types, otherwise Swish-e might ignore |
77 |
the file. We could continue to use the three type names or switch |
78 |
completely to content-types. |
79 |
|
80 |
For example, if we continued to use [TXT|HTML|XML] |
81 |
|
82 |
MapType TXT text/directory text/logfile |
83 |
MapType HTML text/html |
84 |
|
85 |
Or maybe just extend the current directives |
86 |
|
87 |
IndexContents HTML .htm .html text/html |
88 |
|
89 |
Where the content-type would have precedence over the file extensions. |
90 |
|
91 |
This would tell Swish-e that those types are handled by those internal |
92 |
handlers. |
93 |
|
94 |
Then as I've mentioned before, you might specify filters as such |
95 |
|
96 |
FilterDocument application/msword /path/to/word-to-text |
97 |
|
98 |
And word-to-text would convert to text and return one of the three |
99 |
content-types that Swish-e knows how to parse, or a different content |
100 |
type if were to chain filters. |
101 |
|
102 |
|
103 |
=head1 Enhanced the PropertyNames directive |
104 |
|
105 |
Moseley: Updated Jan 13, 2001 |
106 |
|
107 |
If the PropertyNames directive was enhanced to be able to limit the number |
108 |
of characters stored, optionally extract text from HTML, and was able |
109 |
to define what type of docs (text, XML, HTML) it applied to, then the |
110 |
existing PropertyNames feature would work like the new StoreDescription |
111 |
feature but be useful for more than just one use. |
112 |
|
113 |
I'm not clear how to enhance the syntax of Properties and/or Metanames, |
114 |
but here's some ideas. Rainer suggested that an xml-type of format |
115 |
might be best and commonly understood. That's a good idea. Below are |
116 |
some older ideas that I had. But you will get the idea... |
117 |
|
118 |
The metaname structure could have flags for properties: |
119 |
|
120 |
1 - limiting to a length |
121 |
2 - stripping HTML |
122 |
3 - encoding HTML entities on output |
123 |
|
124 |
Oct 9, 2001 - The code is now in Swish-e to limit a string property to |
125 |
a length. The stripping of HTML is an issue for discussion. And encoding |
126 |
entities on output should be a result_outpu.c issue. |
127 |
|
128 |
=head1 Apache/XML style configuration |
129 |
|
130 |
This would be to allow some directives to be set per directory, or perl |
131 |
file extenstion (or content-type). |
132 |
|
133 |
|
134 |
=head1 Document Info |
135 |
|
136 |
$Id: SWISH-3.0.pod,v 1.6 2002/04/15 02:34:43 whmoseley Exp $ |
137 |
|
138 |
. |