Dear folks,

I`m dealling with the problem of pattern matching and I have to store 700.000 key/data pairs ("rows") in a db ("table").

The key is a logical number (0 to 699.999). And the data is an array of 65 integers.

I'm new to Berkeley DB, but after reading the manual and successfully testing the c api, I would like to share my results and ask for advice on tunning my config for my specific need.

As the data field has a fixed length (65 * sizeof(int)), I've choosen to use DB_QUEUE. (From what I've read, this is the best choice for my case, CORRECT?) .

The dabatase will be used for READ ONLY operations and will not be queried "concurrently" from several "clients". Just one c program is going to read one db at a time.

ITERATION SPEED
---------------------------------
After filling the DB with 700.000 "rows" the total time needed to iterate to all the data using a cursor is ~5 seconds (the first time) and ~2 seconds (the second time the c program is called).

As I`m a beginner with Berkeley DB, I believe this difference happens because of the cache system.(Am I right?)

QUERY SPEED
----------------------------

"Queryng" the DB with (DB->get) for 140.000 random keys takes around 0.8 seconds.

MY SCENARIO
------------------------

I'm going to use this DB for pattern matching and therefore all "querys" to the db will work like random accesses.
Each time I need to do pattern matching I will have to query the database 140.000 times (randomly).

(The best solution would be to have all the database loaded in memory ~183 mb).

Berkeley DB seemed to be very fast for ONE DB with 700.000 "rows", when it is queried after the first time (I mean, the time to iterate falls from 5 to 2 the second time I access it randomly).

But the problem is that I need to create several (+20) DB`s like that, each with 700.000 rows.

The first 5 DB's have the same speed as the experience witht he fitst one, described above .
The problem is that, after the 5, each new DB starts to take a greater amount of time for the iteration to be completed on its full content. The 6th DB takes 6 seconds, the 7th takes 8 seconds, and the 20th arrives at almost 12 seconds.
When it comes to querying the DB randomly, the same slow down happens, but in a exponential way. The query time becomes pretty slow: more then 60 seconds on the 20th DB.

Would you suggest any tunning on the config (pagesize, cache and mmap size for my case) or even another DB method / ACCESS method or structure for indexing this data?


I do appreciate your help,
Thank you,
Maurice