Unanswered: NLP Database Optimization (set up and installation)
Greetings all, this is my first post.
Although I'm on MySQL, this is not specifically a MySQL question, but I didn't know where else to post this. Feel free to move it if it belongs elsewhere...
I've got a couple of years' college programming and some database knowledge, but I'm not super knowledgeable and need some advice before embarking on the wrong trail. I'm going to include 3 problems here because once my db structure is explained, it'd be better to kill 3 birds with 1 stone... I'm working on an NLP project until end of June 2012. It's in its initial stages, so I could even change from MySQL to something else, but it has to be free to use.
Problem 1 (perhaps not so difficult to resolve)
I have two main tables, the "word" table (id INT, word_name VARCHAR(128), weight INT) and the "semantic_relation" table (id INT, wordID1 INT, wordID2 INT, weight INT, type INT). Between two words you can have several different types of semantic relations (there is also a "relation_type" table. There are sometimes redundant entries in the "word" tables due to mispellings, variants and typographical and encoding reasons (tyre (id=187343), tire (id=90094; etc.). I want to merge these misspellings (choose a single version), but I also have to update the relation table! Before deleting the unwanted word from the "word" table, I want to first merge the relations using the variant. That is, if I keep "tire" and get rid of "tyre", I want to merge the relations that use "tyre" in the "semantic relation" table, i.e. create new ones it they do not exist in "tire", but add their weights if they already exist in the equivalent 'tire' relations. I can do this in php, no problem, however there are over a million "semantic relation" relations and over 250,000 relations in my "word table". What's more, whenever I delete a word or relation, I want it to go into a bin (word_bin, relation_bin) and to include metadata concerning the reason for the move, including the possibility to do a simple "undo". Because I want to include this metadata, I don't think there's much I can do to improve doing all via php. Triggers would not be possible, but how about stored procedures? (I'm using bash and sed to correct incoherences).
Problem 2 (more difficult)
Now that you're familiar with the db (see Problem 1), I'd like to count the occurrences of each word and put them in the "word" table (as attribute "occurrences") for quick and easy access, but more important, I'd also like to count the number of ingoing and outgoing relations (attributes "ingoing_rel" and "outgoing_rel") for each word in the "semantic_relation" table. I can obviously do this via a COUNT(*), but I'd like to stock this calculation as an attribute in the "word" table for quick and easy access. I've written a php script but have calculated that it would take 2 weeks to execute it on my desktop. (reminder: over 1 million entries in relation table). Is there a solution to this? I thought of doing this via a text file, script and a prefix tree and Java. I also thought of including a separate table for each relation type, but then if I need to do a cross-type search, I'd have to join tables.
I would like to include other metadata (part of speech information, semantic traits) in my tables, including if the word is a single word or a multiword expression (composed of many words). I can obviously do this via REGEXP, etc, but would like, again, to store this information as an attribute for quick access (I'll be executing scripts based on this and other properties). Any ideas on how to do this fast (when loading and installing the data into the database)?
Problem 3 (difficult)
(see Problem 1 for db structure) Last but not least, the data entered in the "word" table is not always accurate, it is user-entered data. I'd like to compare each entry with a machine-tractable dictionary to add yet another attribute to the "word" table "is_in_dict" BOOLEAN. What's the most efficient way of doing this?
If anybody is working on similar projects, I'd love to know about what you're doing, you can also send me an email.
Thanks in advance, and sorry for this rather long post, but I really thought it was the best way to go.
Awesome, I haven't even been lurking much lately and here's an NLP question. My company does a fair amount of this, but we've got a lot of custom built tools that are a little quirky.
For small datasets like this, I'd rather use something like GATE as it gives you a lot of modules and such to play with that are already developed to target NLP work. I'd be surprised if your department doesn't have a GATE guru somewhere.
MySQL is not a bad choice if you're going the DBMS route. I'm not sure how newer versions are configured out of the box, but it doesn't seem as though you're writing an app to serve many users, so you won't gain much from relational integrity, and you can speed things up by disabling it.
I want to merge these misspellings (choose a single version), but I also have to update the relation table! Before deleting the unwanted word from the "word" table, I want to first merge the relations using the variant.
If I get what you're saying, you write a function that translates words to something sort of phonetic representation, so tire and tyre both map to tIr, or whatever. (IANALinguist, btw...) You've got:
word_graphic word_phonetic weight
tire tIr 2
tyre tIr 2.5
Don't "merge" anything or delete it or otherwise fix what isn't broken: you've got your answer right there. Just pull it out with a SELECT statement. More importantly, this way you maintain provenance, the ability to show how you arrived at your results from the raw data.
I'd like to count the occurrences of each word and put them in the "word" table (as attribute "occurrences") for quick and easy access, but more important
create view wordcounts as select count(*) as wordcount from wordtable group by word
Quick and easy. If it really takes a long time to run, you could change 'view' to 'table,' but you'll have to rerun if the original data changes. It shouldn't be slow, though, if you have an index on word.
I'd like to compare each entry with a machine-tractable dictionary to add yet another attribute to the "word" table "is_in_dict" BOOLEAN.
There's really no point in using booleans in a relational database. Just load the machine dictionary into a separate table, and you can use joins to answer your question.
For instance, if you have UserEntered and MachineDict, and you want to know if a word in UserEntered is in MachineDict, you can simply:
select ue.word as word,
case when md.word is null then 0 else 1 end as is_in_dict
from UserEntered ue
left join MachineDict md
The hard part is learning to think in terms of sets, but you'll find it's very powerful if you can.
I would like to include other metadata (part of speech information, semantic traits) in my tables, including if the word is a single word or a multiword expression (composed of many words).
Any ideas on how to do this fast (when loading and installing the data into the database)?
My experience is that loading is possibly the most painful part. If that's the case, then the fastest way to add that information is not to add it at all. That is, you use a view to compute those fields, and only when you actually need them.
Now, the downside is that if you're trying to search for those fields, it becomes painfully slow. But, remember: Anytime I write CREATE VIEW ... AS SELECT ..., I can go right back again and just change VIEW to TABLE, add the appropriate indexes, and blam.
So I recommend you create a view with the necessary regexes to extract the metadata, and replace it with a table only if you need to index it.
Postgresql can build indexes on expressions which might make it worth looking into. That way, the DBMS caches the results of the regex or function, without you having do the maintenance. Any time you can get the computer to do all the tedious BS, it's a good thing.
I'm going to check out GATE (I've already had a look on their site, but will really look into it this coming week). Thanks for the suggestions.
I now realize I didn't explain clearly the objectives of the project.
I'm working with a large semantic, lexical network with two main tables, the word and the semantic_relation table. There are some 250,000 words and well over 1,000,000 relations. There are variants, but errors too, because users enter data via the keyboard.
First of all, I do need to make it so that equivalent spellings (errors, variants) are merged to one. (I'll explain this later). And I do need to make the parameters (count(*), is_mwe, etc.) attributes because, once all of this is set up, I'm going to be writing algorithms to try to get the system to automatically get information from the semantic network based on statistical data as well as the linguistic properties of the different types of relations. The algorithms are already going to take quite a bit of time to execute, that's why I want to include metadata and statistical information directly in tables as attributes (which, I know, one generally doesn't do in db's). It doesn't bother me if it takes 14 hours to install it all, if it means my algorithms can be executed relatively efficiently. So all of this is just to get things set up for much more difficult tasks and not simply to consult data. Sorry, I hadn't explained this very well in my first post.
For the count thing, let me give a clear example.
Let's say "color" has id 29834 (word table). We might have "color" in the semantic_relation table as "red--is a-->color" (start=38744, end=29834), "blue--is a-->color" (start=87432, end=29834), etc. So in the relation table you have 29834 as an "end" node. You might have it as a "start" node as well. And there are many different types of relations ("is a", "antonym", "synonym", "trait", etc.). I need to be able to quickly know how many start and end entries there are for "color" (of all relation types, not each relation type individually), so, in pseudo code that would mean:
FOR all words w (and there are 250,000 of them):
-go in the relation table (1,000,000 entries), do this, and stock it in the start_quantity attribute of the word table (SELECT COUNT(start) FROM semantic_relation WHERE startID = w;
FOR all words w ... same thing but for the endID
(and then I woke up! I don't think this is possible, I don't think there's a quick way to do this. I don't know!) ... I need to know this because if a word doesn't have many relations, I'm not going to consider its data as reliable as one that has many, many relations.
Now, I do need to correct errors (mispellings), because I'm interested in the weight of all versions of color as linked to another word. If we have red--isa-->color weight=83 and red--isa-->colour weight=181 and red--isa colorr--> weight=12, to me that means red--isa-->[any spelling of] color = 276. If it is simply a spelling mistake (because users enter words, that's how they get into the network), I do want to correct it, if it's a variant, your idea of somehow conserving both versions but having a simple way to merge weights is a good one, some pointer to the one chosen as the reference version for the sake of our db and app. I'm interested in nothing else but semantics, not spelling, etc. only semantics.
The same goes for other traits (one-word or multiword expression and many more). I need to access this data very fast because my algorithms will need to use them, all of them, and, as I said earlier, the algorithms are already going to take some time to execute.
So my problem is choosing the right method, I don't want to start in the wrong direction, the project is still in its initial stages, and I don't have all that much experience so I'm not sure how to get started, I don't know if my problems can be solved or from what angle to tackle them.
Many thanks sco08y, I don't know if there's a solution to all of this... I will check out GATE though, that's for sure!