Results 1 to 8 of 8
  1. #1
    Join Date
    Jan 2004
    Location
    Austin
    Posts
    62

    Unanswered: Process blocks of records in parallel with Java?

    Does anyone know if it's possible to divide up a table by some number of rows for parallel processing with java?

    Essentially, if I had a table of lets say 10000 rows and I wanted to process 500 record blocks, is there a fast way to let Java know records 1 - 500, 501-1000, 1001-1500 ....

    I was thinking maybe this could be done using rowid? However, I know it's not a normal number. Is there a way to determine an offset or get a range of rowid values?

    If someone has other ideas on how this can be done fast or without requiring database reads your help would be much appreciated.

    Regards,

    Mike

  2. #2
    Join Date
    Jun 2004
    Location
    Liverpool, NY USA
    Posts
    2,509
    The following select would return rows 500 throught 1000.

    Code:
    select y.col1,y.col2,y.col3
    from
    (select x.col1,x.col2,x.col3,rownum rnum
     from 
     (select col1,col2,col3
      from my_table
      order by col1) x
     where rownum <= 1000) y
    where y.rnum >= 500;
    Bill
    You do not need a parachute to skydive. You only need a parachute to skydive twice.

  3. #3
    Join Date
    Jan 2004
    Location
    Austin
    Posts
    62

    Thanks for the thought

    This particular method would work for getting the records, but I'm trying to identify a way of pulling the correct set of records without having to do the addtional reads. In the case you're describing it would require pulling back all 1000 records and then selecting the 500 I'm interested in. I'd like to avoid reading the larger block of records.

    In my case, I'll have to pull larger block sets, more along the 10k size, from a set of around 1M. As a result, I'd like to keep the record set as small as possible.

    Regards,

    Mike

  4. #4
    Join Date
    Mar 2002
    Location
    Reading, UK
    Posts
    1,137
    1. What kind of processing do you do with the data, if its simple it is much better to do it with sql or plsql possibly using parallel dml. This is usually an order of magnitude faster than any other means as it greatly reduces network traffic.

    2. If it isnt simple dont assume Bill's method is inefficient, if you have n processes asking for say 500 rows each, dont forget the first query will probably load most of the data into the database cache so subsequent processes will request data from the data cache rather than disk access.

    3. The other possibility is selecting the records which are within a range of PK values i.e. if you PK on one numeric column then select records where the pk_col between 1 and 500, 501 and 1000 etc. This may not guarantee each process get exactly the same number of rows but then that isnt usually an issue if your simply after improving performance. This only works if your PK is backed up by an index. If you dont have such a PK you could still do it with some other column which is indexed and has data distributed evenly.

    Alan
    Last edited by AlanP; 08-30-06 at 16:50.

  5. #5
    Join Date
    Jan 2004
    Location
    Austin
    Posts
    62

    Just looking for other options.

    1: The application is supposed to be database agnostic, I'm not in control of the implementation, they plan on going Java for this reason. If it can't be done using Java, we may look at this more platform specific solution.

    2: I'm not discounting Bill's method. We are doing something similar today, however, I'm concerned with flooding the cache, this is going to be just one of many processes that will be running on the server. Caching 10's of 1000's of records will eat up the memory quickly.

    3: Unfortunately selecting based on PK won't work in this case because the PKs won't be sequential.

    I was mostly curious to know if RowID or some other type of hashing function could be used to just grab the records.

    Mike

  6. #6
    Join Date
    May 2004
    Location
    Dominican Republic
    Posts
    721
    Check this out over AskTom.

    I would definitively relay on Oracle's own parallelism feature (for example, mark the objects in question with PARALLEL ENABLE and letting Oracle do its job) unless I have not bought this feature, then I would think on DIY parallelism.

  7. #7
    Join Date
    Mar 2002
    Location
    Reading, UK
    Posts
    1,137
    1. Database agnostic solutions usually means worst performance across all databases . Where performance is critical (which I assume it is for this otherwise you wouldnt bother going for the parallelism) I would suggest calling a stored procedure which you would then implement for each database vendor. After all if your doing data manipulation the best place is in the database. Also if a customer does have a performance issue they can modify the procedure to suit their environment without having to upgrade their application.

    2/3. If you use a function then Oracle wont use any indexes (unless you have created a special function based index) thus it will likely do a full table scan which is what you objected to originally. Also it will be much slower than using a database specific solution.

    Alan

  8. #8
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool



    In Oracle you can use the parallel option:

    ALTER SESSION ENABLE PARALLEL {DML|QUERY};

    -- or -- use the "parallel" hint in the query:

    SELECT /*+ PARALLEL(<table_name>,<degree> ...*/ FROM ...




    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •