We will monthly receive a CSV file with about 3 million records of finacial transactions generated by a third party using a bookkeeping program we can not control.
This is how the records in the file will look like:
Code:
invoice_nr amount booking/annulation ....
12 1000 B ....
13 500 B ....
13 500 A ....
14 2000 B ....
14 2000 A ....
14 2500 B ....
14 2500 A ....
14 3000 B ....
Explanation:
B: means the amount was booked
A: means the amount was annulated, cancelled, all the data of the original booking is copied, only B becomes A
- the amount due of invoice nr 12 is $1000.
- invoice nr 13 is cancelled totally, it is useless to import it into the database.
- the amount due of invoice nr 14 is $3000, once it was $2000, but that was an error and it was cancelled, then it was changed to $2500, but that was cancelled too, the last thing we know is that it is now $3000.
This should be imported into our database:
Code:
invoice_nr amount ....
12 1000 ....
14 3000 ....
The records are stored ordered by the invoice_nr and the time the transaction was performed by the bookkeeping program.
The logic is: import only the last record of a group of records (grouped by invoice_nr), if its booking/annulation flag is set to 'B'.
Can we put some logic into the LOAD command to make it perform the filtering for us ? From the test data we received, a relatively large portion consists of booking/annulation/booking and booking/annulation.