I aim to design a search engine for my school network (in order to search over Microsoft shares). I know how to scan the network, and I get over 600 000 entries. Every entry has 4 fields :
- the IP of the machine where the entry is hosted
- the name of the file corresponding to the entry
- the path of the file
- the size of the file
I have implemented the search engine using PHP/MySQL. Unfortunately, while the search query is quite flexible (sort in whatever order you want, etc...), it takes about 8 seconds for a simple search to be done.
In order to make the search faster, I'm trying to use the egrep command, and with some success, as a query takes now only 1s to be done. BUT, the query is accent sensitive, and as I live in France, many of the files over the network have accents (like 'J'ai demandé à la lune' for example). I was wondering how Google and other search engines over the web are able to consider the 'e' and 'é' as identical letters.
Please be free to propose other databse designs, that would be more appropriate for the needs I have. Anyway thank you very much for your help !!