we've got about 2GB research data per simulation, whose structure is (long int vox, long int lor). My duty is to program an utility, which looks for identical pairs and counts them. In the end, we should have a sorted table with unique entries of (long int vox, long int lor, short int weight).
Because of the huge data amount, i was thinking of using databases, such as db2. So i read the data in (with INSERT commands, because the original data is binary), perform data handling (create a view with GROUP BY) and read them finally out.
But still it takes about 6 days to process one 2GB file.
Is there a better way, to get the wanted table? What could i optimize?
I'm quite new to this area, so i'd appreciate any help.
I am not exactly sure if you are trying to create a separate table of unique pairs, or to just have a query that counts them. You might want to consider count distinct instead on group by, depending on what you are trying to do. Not sure why you need the view.
With processing like this, you need to make sure you tune the DB2 configuration parameters, particularly the bufferpools and the temporary tablespace. I would create a temporary tablespace that is at least as large as your largest table. You also need to make sure your sort heap size is sufficient. I would look at the Performance Wizard to help you with some of these parameters.
It would help if you could get an experienced DBA to help you, because it is a bit difficult to provide every piece of information that you need for an efficient system via this forum if you are a novice with DB2.
the result should be a separate table. I'm using a view, because it's still actual after adding new simulation data to the referring table. It's command is:
"SELECT vox, lor, count(*) from table1 GROUP BY vox, lor".