I'm developing a standalone application in Java that will enable users to import/add/remove dictionaries. There is a JList (scrollable list of items, entry headers in our case) in gui that contains a portion of the index (since loading all index can be time and memory consuming).
Now, it would be quite simple if I wanted to create 'yet-another-dictionary-compilation-software'. But I don't. Here is a list of requirements that I want to be met in the program:
- there can be dictionaries of any language (source language) and of any size (I'm planning to import up dictionaries with up to 1,000,000 headers);
- the index sort is not just some Locale's Comparator, it can be custom or based on some rule (eg. headers are sorted by semma or word root, part of speech, etc.), so we need to create a column that contains some bigint values sorted and mapped to headers according to some sort rule;
- resizability - we should be able to add/remove more headers, add/remove another dictionary to the index;
- quick index browsing, we should be able to retrieve some amount of headers from an index based on the following parameters: - scrollbar thumb position (can be percent from total), - some header (so even if such header is not in the index, we should return closest match of headers limited to the size of cache limit - comes as a setting), - dictionary set (the set of dictionaries that are activated on the shelf in gui, users can click on a dictionary icon (toggle button) and activate/deactivate the dictionary, thus changing the dictionary set);
- db storage size shouldn't be too large.
I'm using eXist as XML DB server for dictionaries content and HyperSonic as relational db for index. (MySQL won't do since you should download and install it separately, but I want it to work with no such tricks).
eXist works perfect, since it indexates all xml documents and finds the required entries in milliseconds. But I'm having problems with Hypersonic, it retrieves headers too slowly.
After some db redesign on paper I would like to share it with you and ask for advice. I'm not a db guru and might need some know-how about JOINS, FOREIGN KEYS etc.
MY VARIANT 1:
create table main_headers_index (ID bigint unique autoincrement, header varchar(255) unique); - contains list of all headers in all dictionaries, headers do not repeat
create table en_index (header_ID bigint unique, sort_ind bigint unique, uni_dic_ER boolean, <another dictionary indexated with that index>); - contains sort index with boolean meaning whether this header exists in the certain dictionary
Shortcomings: selecting headers will look like "select header_ID from en_index where sort_ind < somevalue and uni_dic_ER='true';"
Too many 'where' especially if we have 20 dictionary on one shelf.
MY VARIANT 2:
create table main_headers_index (ID bigint unique autoincrement, header varchar(255) unique);
create table en_index (header_ID bigint unique, sort_ind bigint unique);
create table uni_ER(header_ID bigint unique, sort_ind_1 bigint, sort_ind_2 bigint, <some other index this dictionary is indexated with>);
The select sql will contain JOIN but I'm still investigating this.
bigint for sort ind is used to indexate headers with some step (eg. 100), so we can easily add new headers to the index without reindexation. In case we will add new dictionaries to the index, total reindexation will be required.
I think I'm one step closer to the solution of this problem.
First off I'm going to use binary trees, avl approach to constuct index (so that I don't have to worry about insertion,deletion, sort,find in the index).
And I'm going to read Knuth.
Also the problem with selecting headers only from the active dictionaries on the current dictionaries shelf seems like it can be solved on the b-trees level. Any help welcome.