Unanswered: Retrieving high volumes of data in pages/partitions?
Retrieving high volumes of data in pages/partitions?
I have a table on a central server with 30 million records.
I'd like a few dozen dozen worker systems to be able to read 20K chunks of data at a time and process them. Once a worker has finished one 20K chunk, it must proceed to a separate chunk until it has been through all records, at which point it starts over. All workers should eventually process all data so there will clearly be overlap where separate workers are working on the same data in parallel. Workers must need to be able to pick up where they left off after an application restart or a network error (which requires a connection bounce). This process is entirely read only.
Obviously performance is a large concern with this volume of data.
Here are the two avenues that I've been considering:
ADO recordset "paging". I'm not sure if this does what I want but here is an idealized version of how I'd like it to work:
Set the page size to 20K
Open the recordset. This action is trivial. No record data is transferred and the action is instantaneous.
Jump to page X (depending on the progress position of the worker)
Suck in the entire 20K page. The entire block of data is transferred in a bulk-friendly fashion from the server.
Close the recordset
Processs the 20K chunk
All records have an IDENTITY key
Workers do the following SELECT (pseudo code) wrapped in a bcp command or DTS package.
SELECT TOP 20000 ... WHERE ID > @lastReadID ... ORDER BY ID ASC
The results of the bcp/DTS are exported to a flat file which the worker reads and processes.
The problems are, that I don't suspect the ADO recordset approach works like I'd like it to (having trouble with sample code). The bcp/DTS package requires lots of resource intensive sorting on large tables. It will be tricky getting this to perform acceptably.
Can anyone recommend changes to one of these appraoches or a new approach?