Results 1 to 4 of 4
  1. #1
    Join Date
    Nov 2013
    Posts
    2

    Unanswered: Schema for analytics table

    Hi

    We use Postgres for analytics (star schema).
    Every few seconds we get reports on ~500 metrics types.
    The simplest schema would be:

    (timestamp, metric_type value)
    78930890, FOO, 80.9
    78930890, ZOO, 20

    Our DBA has came up with a suggestion to flatten all reports of the same 5 seconds to:

    (timestamp, metric1, metric2, ..., metric500)
    78930890, 90.9, 20, ..., ...

    Some developers push back on this saying this adds a huge complexity on development (batching data so it is written in one shot) and to maintainability (just looking at the table or adding fields is more complex).

    Is the DBA model the standard practice in such systems or only a last resort once the original model is clearly not scalable enough?

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    The original report with timestamp, metric_type, and value is preferred over the Timestamp, metric1, metric2...metric500 structure if at all possible. Except in a few fringe cases with exceptionally high transmission costs, the simpler report structure will be faster too.

    What makes you and/or your dba think that the flater structure is not scalable enough? What makes you think that a more complex schema that needs additional processing will work at least as well as the flatter schema?

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  3. #3
    Join Date
    Nov 2013
    Posts
    2
    Thanks for your comment

    The end goal is to draw a line chart for the users. So queries will mostly be selecting a few metrics, folding them by hours/minutes, and selecting min/max/avg per hour (or any other time period).

    The DBA main argument is that reducing the number of rows x500 times will allow more efficient indexes and memory (the table will contain hundreds of millions of rows before this optimization). Also, when selecting multiple metrics the suggested schema will allow one pass over the data instead of separate index search for each metric.

    When dealing with hundreds of millions of rows in postgres, is there such a drastic value in reducing them x500 times (on expense of maintainability)?

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    If you need to establish indexes over multiple metrics, then it is possible that you can gain some indexing performance by combining rows on whichever attribute becomes the primary (in your case the timestamp) because this can reduce the number of physical IO operations and it also allows indexes to be built on more than one metric.

    Disk and memory are cheap compared to people. You can buy loads of disk and RAM for what you can spend on weeks of someone's time to figure out how to aggregate data into complex structures and then "cracK' those structures back into usable form. Worse still, every new person that comes along needs to re-learn the cracking skills, so you've artificially created an ongoing drain on your personnel costs. If a bigger, faster box is needed for your database then buy a Godzilla grade knuckle-dragger if that means keeping the data conceptually simpler and more accessible! More hardware IS ALWAYS cheaper than any complex work-around in the long run.

    As a side note, when dealing with hundreds of millions of rows, Big Data springs to mind because it has analytic tools that would allow you to discover relationships that you didn't know existed. It is a radically different way to approach data and requires a different set of skills than a conventional database, but since you are using your data to mine new information from it I suspect that Big Data may be a better fit.

    I'm not going to second guess your DBA. They know your data and your organization much better than anyone else. I would still encourage them to look at both Big Data and at bigger hardware.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •