Results 1 to 11 of 11
  1. #1
    Join Date
    Jan 2003
    Location
    Sydney, Australia
    Posts
    12

    Unanswered: Solution Design for high volume of data

    Hi,

    I have been asked to design a solution for a client of mine who basically requires the daily analysis and reconciliation of the differences between 2 extremely large text files.

    The files are not in an identical format but are both in some form of delimited format (one is CSV, the other is a little more complex). For the sake of this question, let's assume that I can effectively import each file into an MS SQL table.

    Each file will have in excess of 100,000 rows each day (new data for each day).

    Whilst I know that MS SQL does easily have the capacity to store the data, is there a recommended way to tackle the potential problems (I imagine that performance is important... they will be running the report every day)

    Or is building the solution as simple as importing the data into 2 tables, and then querying the differences and outputting as a report using Crystal?

    Any suggestions appreciated.

    Thanks

    Rael

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Without understanding what constitutes a difference, or what you need to report on, the best I can answer is "Yes".

    -PatP

  3. #3
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    Without understanding what constitutes a difference, or what you need to report on, the best I can answer is "No".

    So there!
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    It's good to get opinions from all sides in a discussion.

    Maybe Rael can provide us with some details so we can build on these.

    -PatP

  5. #5
    Join Date
    Jan 2003
    Location
    Sydney, Australia
    Posts
    12

    Additional Info

    Ok... The reason why I didn't provide extra information was because I figured that the reports and queries themselves were superfluous to the problem... but here is some more information.

    Basically the two text files will be comparing bank account transactions from the 2 different points of view. ie: the first file is from our point of view. The second file is from the banks point of view.

    The data in each file is as follows:

    TransactionDate, TransactionID (PKey), Type, Description, Amount

    The reports will then output:
    - List of transactions which appear on our records but not on the bank's
    - List of transactions which appear on the bank's records but not ours
    - List of transactions which appear on both but with a different amount

    As mentioned above, there are often in excess of 100,000 transactions per day.
    We would want to be able to report on any historical date range.

    Does this eliminate one of the Yes/No?

  6. #6
    Join Date
    Nov 2004
    Location
    on the wrong server
    Posts
    8,835
    Provided Answers: 6
    I will throw in a maybe.

    if you are going to be mainly reporting on one days worth of data, I might keep one set of tables for the current day and I might have a set of archive tables that stores all of the previous days.

    if you are going to be doing a lot of historical analysis and you only have 5 fields one set of 2 tables is fine.

    since it sounds like you are going to be doing mostly reporting on these tables, you can index the data thoroughly.
    If one brings so much courage to this world the world has to kill them or break them, so of course it kills them. The world breaks every one and afterward many are strong at the broken places. But those that will not break it kills. It kills the very good and the very gentle and the very brave impartially. If you are none of these you can be sure it will kill you too but there will be no special hurry. Earnest Hemingway, A Farewell To Arms.

  7. #7
    Join Date
    Jan 2003
    Location
    Sydney, Australia
    Posts
    12

    Additional Info

    I like that idea (ie: archiving data). Having never worked with db's where the main complexity is the volume of data, I am not experienced at all with the sorts of issues one can be faced with. Apart from indexing and keeping table size to a minimum I was just wondering if data-warehouse type applications use any other techniques or even MS SQL tools to improve performance and reliablity.

  8. #8
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    If all you want to do is compare one day's bank transactions against one day of your transactions, then what you are developing is a database utility, and there is no need to archive the data. If you need to determine whether a certain bank transaction occured on any day present or past, then archiving is required.
    In the first case, where you are comparing 100,000 records from the bank with 100,000 of your own records each day, well, SQL Server should be able to handle that pretty quickly if your records are as short as you describe, and if the data is indexed.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  9. #9
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Yeah, I'm thinking that you've got a pretty straightforward problem to solve. The biggest issue that I see is figuring out how to deal with "near-miss" issues where one COB (Close-Of-Business) and the other don't quite match, or somebody fumbles a Transaction ID, etc. Assuming that you can match Transaction ID values most of the time, you've got a walk in the park kind of problem.

    -PatP

  10. #10
    Join Date
    Jan 2003
    Location
    Sydney, Australia
    Posts
    12

    Thanks

    Thanks for all the advice guys.

    I guess that, in summary:

    - there are no real technical issues with the size of the data, provided that:
    a) it is well indexed
    b) it is archived whenever necessary
    c) the queries are not overly complex

  11. #11
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Correct on all counts. Even if the queries become complex, as long as the relationships used in those queries don't get too bad it should still be quite straightforward. An example of how those relationships could get complex would be if you need to start doing batch totals against groups that were not fully represented (or inequally distributed) on both sides.

    -PatP

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •