06-10-11, 06:03 #1Registered User
- Join Date
- Jun 2011
Which database for hierarchical file and folder structure?
I am new here, so I hope, I post this in the correct part, otherwise please move it ... and sorry for my bad english
My problem: I would lke to store a complete file / folder structure with additional information (inode, mtime, ctime, maybe checksum) in a database.
The "sense" is to replicate file / folder structures between linux / unix servers. If the service is running, all changes which occurs could be catched (inotify / fsnotify) and updated in the database. But if the service is not running and the files or folders on the file system has changed, a complete scan is nessecary to get the database back to a consistence check.
In this case of a "complete check", each file has to be checked if it is already in the database and if the ctime / mtime is different on the filesystem than in the database.
Btw: The programming language will be Perl ...
What I have already tried:
- MySQL: Build the hierarchy with foreign keys and store folder name, id and parent id (foreign key) in the database -> really slow and inefficient!!!
- MySQL: Build the hierarchy with nested sets -> 20 - 40 times faster than test 1, but still not that fast!!!
- MySQL / Perl: Save complete path in database and create a "hash tree" with perl -> fast, but uses much more space (if you have 1000 subfolders, the parent folder is saved 999 to much!!!)
The database should held also other datas, e.g. configuration options of the servers etc.
By now, I am wondering if MySQL (relational) is the correct choise for storing a file / folder structure (hierarchical) ...
With Perl, the handling of XML-files is quite easy and efficient ... so might be a XML-like-database or just a simple XML-file the right choice?
I could also take a simple XML-file and read the whole XML-file into a "hash tree" in Perl ... so the "database" is directly in memory and "part" of the program ... efficient?
The whole solution should be capable of storing the information about 100'000 Folders and 1'000'000 Files ... better would be 1'000'000 Folders and 10'000'000 Files.
06-10-11, 07:01 #2Jaded Developer
- Join Date
- Nov 2004
- out on a limb
depends what you are using the data for
you don't always need a relational db to every task, although most people here would reccommend one.
the real advantage of a realtional db is data integrity, knowing the data 'hangs together'. in your case im not to certain whether you need a realtional db.
so it comes down to the reasons why you are storing the information, the method of accessing the data.
bear in mind the real 'cost' of selecting which type of data storage may not be in the insert process, but the retrieval and manipulation process. the insertion happens once, the retrieval... who knows.
unless you are running this on an embedded system then in the days of cheap storage I don't think the amount of space used is 'that' significant.I'd rather be riding on the Tiger 800 or the Norton
06-10-11, 09:42 #3Registered User
- Join Date
- Jun 2011
Hello healdem, thank you for the answer!
There are several "situations":
###The "initial scan" ###
All information about the files / folders on the file system has to be imported into the database: name, ctime / mtime, time of importing, maybe checksum.
-> This has to be done only once.
### The "complete scan" ###
The database has to be updated with the information from the file system -> ensure the constistency: name, ctime / mtime, time of importing / updating.
- All files which exists only on the file system, has to be imported into the DB
- All files which exists only in the DB has to be removed from DB
- All files which exists on the file system AND in the database has to be checked for mtime / ctime
-> This has to be done from time to time (e.g. once a week) or if files has changed when the service was not running (inconsistency)
### The "change request" ###
A change was detected on the file system (inotify / fsnotify). Only this data record has to be updated / filled in / deleted from the DB.
-> Very often!
### Reports ###
Run reports, e.g. which file has changed since ...
-> Custom ... not that important