Results 1 to 5 of 5

Thread: "Freshest" data

  1. #1
    Join Date
    Mar 2014
    Posts
    2

    Question Unanswered: "Freshest" data

    The setup:

    A SQL server database consisting of 252 tables. Each of which contains several million records and will continue to grow at a rate of about 1mil per month. This data is captured for a particular year and month and is captured for a specific entity. So a (really, really) simplified table structure is:

    id (primary, auto-increment)
    code (unique per entity)
    importmonth (numeric)
    importyear (numeric)

    There are obviously many other fields but those are the ones that are used to sort or find information.

    What I need to do is automate (as much as possible) a process of taking the last data captured per entity and putting it into a completely empty database with the same structure.

    Entities may have last been captured in January 2012 so it's not as simple as just selecting the last rows by date and dumping them into a DB.

    I'm happy to do this manually by SQL or SP etc and equally happy to have some awesome tool that does it. Maybe something that I can get to synchronise the latest data with the criteria of code, year and month?

    Thanks for any help, much appreciated

    Notes: I cannot change the structure of the database, I cannot change indexes, I really do need to have a separate database - not my choice but there it is

  2. #2
    Join Date
    Nov 2004
    Location
    on the wrong server
    Posts
    8,835
    Provided Answers: 6
    Something like this maybe...


    ;WITH cte AS (
    SELECT id,
    code,
    importmonth,
    importyear,
    ROW_NUMBER() OVER (PARTITION BY code ORDER BY importyear DESC,importmonth DESC) as row_num
    FROM YourTable
    )
    SELECT id,
    code,
    importmonth,
    importyear
    FROM cte
    WHERE row_num = 1
    “If one brings so much courage to this world the world has to kill them or break them, so of course it kills them. The world breaks every one and afterward many are strong at the broken places. But those that will not break it kills. It kills the very good and the very gentle and the very brave impartially. If you are none of these you can be sure it will kill you too but there will be no special hurry.” Earnest Hemingway, A Farewell To Arms.

  3. #3
    Join Date
    Jan 2013
    Posts
    354
    Provided Answers: 1
    [quote] A SQL server database consisting of 252 tables. Each of which contains several million records [sic] and will continue to grow at a rate of about 1 mil per month. This data is captured for a particular year and month and is captured for a specific entity. So a (really, really) simplified table structure is: [quote]

    I know it is a skeleton, but rows are not records, columns are not fields, auto-increment table properties cannot ever be a key, there are no generic “id”, no generic “code” and temporal values are not split. Nothing is right here. This narrative does not even have a amen! Let me guess that the GTIN is the “<generic, nothing in particular>_code” so we can get an international standard and that we can use the MySQL month-within-year temporal duration notation.

    CREATE TABLE Imports
    (product_gtin CHAR(15) NOT NULL PRIMARY KEY,
    import_month CHAT(10) NOT NULL
    CHECK (import_month
    LIKE '[12][0-9][0-9][0-9][0-9]-[0-1][0-9]-00')
    );
    There are obviously many other fields [sic] but those are the ones that are used to sort or find information. What I need to do is automate (as much as possible) a process of taking the last data captured per entity and putting it into a completely empty database with the same structure.
    Why? This mimics how we moved punch cards and mag tape records to new physical storage. A database is supposed to model a “universe of discourse”, so you want to create an entire, separate parallel new universe.

    >> Entities may have last been captured in '2012-01-00' so it's not as simple as just selecting the last rows by date and dumping them into a DB. <<

    what date?

    Notes: I cannot change the structure of the database, I cannot change indexes, I really do need to have a separate database - not my choice but there it is
    Can you explain why you must use a physically separate database? That kind of thing is usually a legal requirement, since it makes no sense from an RDBMS viewpoint.

  4. #4
    Join Date
    Mar 2014
    Posts
    2

    Talking

    [quote]
    A SQL server database consisting of 252 tables. Each of which contains several million records [sic] and will continue to grow at a rate of about 1 mil per month. This data is captured for a particular year and month and is captured for a specific entity. So a (really, really) simplified table structure is:
    There are obviously many other fields [sic] but those are the ones that are used to sort or find information. What I need to do is automate (as much as possible) a process of taking the last data captured per entity and putting it into a completely empty database with the same structure.
    Why? This mimics how we moved punch cards and mag tape records to new physical storage. A database is supposed to model a “universe of discourse”, so you want to create an entire, separate parallel new universe.
    Notes: I cannot change the structure of the database, I cannot change indexes, I really do need to have a separate database - not my choice but there it is
    Can you explain why you must use a physically separate database? That kind of thing is usually a legal requirement, since it makes no sense from an RDBMS viewpoint.
    Thanks for the assist so far - OK, so here's the complete overview, perhaps it'll help to understand

    The people in question run a warehouse of data which contains all the information for every school in the country. When I say everything, I really do mean it - up to and including when someone was off sick and why as well as the stock movement for every piece of stock ever bought. So, the warehouse is fairly large.

    The underlying application at every school in the country basically dumps unfiltered, unsantized data into an access database with some pretty wacky structures (hence the simplified table structure). There are now 401 tables and they all have anywhere between 10 and 100 columns in them. There is no such thing as lookup tables, foreign keys etc, everything is stored as text.

    Each month, every school (+- 5700) send their access databases to a central point and it's dumped into the warehouse, increasing the number of rows per table by ~1 million rows per month.

    Because the warehouse is badly designed (since it's just an exactly duplicate of the access database but with less restraints ), indexing doesn't work and queries are slow - in the order of a couple of hours in some cases. Unfortunately, this is not something I can change or even contemplate changing until the political whackjobs in charge of it realise that *something* is drastically wrong here.

    So that brings us to the solution required. A catalogue of "freshest data" per school. A school may have last submitted their data in March 2012 or March 2014 or even October 2008. Whatever their last submission is is considered their freshest data. The logic behind this for the company is that this new database will be properly indexed and will result in much faster queries since the dataset will be much, much smaller.

    The columns I have that I can use to identify each relevant row are:

    SchoolCode (ex. 2000000000) , ImportYear (ex. 2014), ImportMonth (ex. 02)

    Obviously, I have defined useful views that give me data such as the last submission period per schoolcode but where it all falls over is that running something akin to the following pseudosql is insane for ~5700 schools (given an array of schools):

    SchoolCode,ImportYear,ImportMonth
    2000000001,2014,02
    2000000002,2013,09
    2000000003,2008,10
    2000000004,2009,12
    Code:
    Foreach(schools as {school}) {
      Foreach(tables as {table}) {
        INSERT into freshdata.{table} 
            SELECT * from alldata.{table} WHERE SchoolCode = {school['code']}
            AND ImportYear = {school['ImportYear']} AND ImportMonth = {school['ImportMonth']}
    }
    }
    as current projections indicate it'd take 44 days to run.

    So, suggestions? There are some bits of software out there that seem to allow synchronization but they're usually only schema based or they don't allow constraints such as the ones I must use. I'm open to anything at this point

  5. #5
    Join Date
    Nov 2004
    Location
    on the wrong server
    Posts
    8,835
    Provided Answers: 6
    Yes.

    1) Run. Don't walk.

    2) Baring that. Look at my query.

    3) Read about set based processing when you get a moment.
    “If one brings so much courage to this world the world has to kill them or break them, so of course it kills them. The world breaks every one and afterward many are strong at the broken places. But those that will not break it kills. It kills the very good and the very gentle and the very brave impartially. If you are none of these you can be sure it will kill you too but there will be no special hurry.” Earnest Hemingway, A Farewell To Arms.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •