Results 1 to 13 of 13
  1. #1
    Join Date
    Oct 2007
    Posts
    8

    Unanswered: Responsibility - Query vs Reporting Tool

    I've been asked to start capturing metrics at my new job and I'm having some difficulty with a time series report and was hoping for some help. I'm quite new to metrics and reporting so please go easy on me if this is a simple question. I've tried googling several different terms and I just can't seem to find an answer.

    We want to capture the number of page views for a particular project over time so we can do some trend analysis. However, who's responsibility is the filling in of possible missing time data?

    For example...

    We would like to do a time series of the last year of a particular project.

    Then here are the made up values for that year:

    Jan - 0
    Feb - 0
    Mar - 2
    Apr - 4
    May - 8
    Jun - 16
    Jul - 4
    Aug - 0
    Sept - 0
    Oct - 18
    Nov - 0
    Dec - 12

    Now with the data structure that is below when I do my Time Series query we will only get back the months that actually have data. Then my Time Series line graph doesn't show the months that have zero data, it just kind of skips them and goes to the next month.

    So my question is the following... Is it the responsibility of the query to request all the months even if they don't exist in the database or is it the reporting tools responsibility to realize that if I don't have any data then show zero?


    Table, query, and report output follow below (PostgreSQL) (All the data was created randomly):

    Code:
    dataset_view_fact
        id integer NOT NULL PRIMARY KEY,
        program_id uuid NOT NULL,
        time_id uuid NOT NULL,
        user_id uuid NOT NULL
        // Has foreign keys constraints to the tables below.
    
    users_dimension
        id uuid NOT NULL PRIMARY KEY,
        firstname character varying(100) NOT NULL,
        lastname character varying(100) NOT NULL,
        username character varying(100) NOT NULL
    
    time_dimension
        id uuid NOT NULL PRIMARY KEY,
        date timestamp NOT NULL,
        day_of_week smallint NOT NULL,
        day_of_month smallint NOT NULL,
        day_of_year smallint NOT NULL,
        month smallint NOT NULL,
        quarter smallint NOT NULL,
        year smallint NOT NULL,
        holiday boolean NOT NULL,
        weekend boolean NOT NULL,
        hour smallint NOT NULL,
        minute smallint NOT NULL,
        second smallint NOT NULL
    
    programs_dimension
        id uuid NOT NULL PRIMARY KEY,
        name character varying(255) NOT NULL
    Query:
    Code:
    SELECT COUNT(dvf.id) AS dvf_count, pd.name, td.year, td.month, td.day_of_month, td.hour, td.minute, td.second
    FROM metrics.dataset_view_fact AS dvf
    JOIN metrics.time_dimension AS td ON dvf.time_id = td.id
    JOIN metrics.programs_dimension AS pd ON dvf.program_id = pd.id
    WHERE
        td.year=2008 AND
        td.month=3 AND
        td.day_of_month=28 AND
        td.hour=6 AND
        td.minute=10 AND
        td.second > 50 AND
        (
          pd.name = 'program0' OR
          pd.name = 'program1'
        )
    GROUP BY pd.name, td.year, td.month, td.day_of_month, td.hour, td.minute, td.second
    ORDER BY td.hour DESC, td.minute DESC, td.second DESC;
    Date:
    Code:
    32;"program0";2008;3;28;6;10;59
    42;"program1";2008;3;28;6;10;59
    34;"program0";2008;3;28;6;10;58
    33;"program0";2008;3;28;6;10;57
    41;"program0";2008;3;28;6;10;55
    45;"program1";2008;3;28;6;10;55
    34;"program1";2008;3;28;6;10;54
    33;"program0";2008;3;28;6;10;54
    37;"program0";2008;3;28;6;10;53
    29;"program1";2008;3;28;6;10;53
    31;"program0";2008;3;28;6;10;52
    38;"program1";2008;3;28;6;10;52
    39;"program0";2008;3;28;6;10;51
    33;"program1";2008;3;28;6;10;51
    I've attached a pdf of the chart that was created from the data. Notice to the right of the chart there are missing data points and should actually show zero. (Also, just noticed that there is an entire date missing at the bottom.)

    Now, is it the reporting tools responsibility to zero out the missing data points?

    Should I change my query so that each individual time slice I'm interested in is queried and therefore a 0 will be returned if an item isn't there? (Using an outer join of some sort.)

    Thank you for your time.
    Attached Files Attached Files

  2. #2
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    If you are pivoting this information for a chart, you can use a scattergraph format where the months are plotted on the x-axis. Then, it is unnecessary to fill in the data gaps.
    Otherwise, you can use a table of sequential values to fill in the gaps.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  3. #3
    Join Date
    Oct 2007
    Posts
    8
    I wouldn't mind going with a scatter chart/graph, but the requirement I've been given is to produce a time series.

    So, a table of sequential values...

    Are you suggesting that we consider pre-populating our time dimension with data and then re-populate as needed? If this is what you're suggesting, I think its a good idea. If not, could you please be a little more specific about what you mean?

    Thank you for your time.

  4. #4
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    Add a table of sequential values to your database.
    By querying this using a formula such as dateadd(month, sequentialvalue, [startingmonth]) you can create any time series you want on the fly. No need to prepopulate.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  5. #5
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    I would go a step further and pop it in model too. Tally tables rock!
    http://www.sqljunkies.com/WebLog/ama...bersTable.aspx

    There's some scripts at SQLTeam that they really like that produce the result set on the fly.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  6. #6
    Join Date
    Oct 2007
    Posts
    8
    First off, thank you for all the replies.

    I've found examples of both "Numbers Tables" and "Tally Tables", but unfortunately I have minimal skill at SQL and I'm not really able to understand them.

    I was successful in adding a numbers table itself...

    Code:
    CREATE TABLE numbers 
    ( 
        number INTEGER PRIMARY KEY 
    );
    
    CREATE FUNCTION increment_numbers (BIGINT) RETURNS INTEGER AS '
      
      BEGIN
        
        FOR i IN 1..$1 LOOP
    
          INSERT INTO numbers values (i);
        
        END LOOP;
        
        RETURN (SELECT COUNT(*) from numbers);
        
      END;
    ' LANGUAGE 'plpgsql';
    
    SELECT increment_numbers(20000);
    How would I use the new numbers table to complete the following two use cases using the tables from my post above...?

    "Show in a time series chart all the total monthly views for the program named "Experiment #1" in the year 2007."

    and then to confuse it a little bit more...

    "Show in a time series chart all the total monthly views for the program named "Experiment #1" for the last two years."

    I really do not have an idea on how to implement a solution using a tally table.

    Any friendly advice/direction is appreciated.

  7. #7
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    Quote Originally Posted by hooknc
    How would I use the new numbers table to complete the following two use cases using the tables from my post above...?

    "Show in a time series chart all the total monthly views for the program named "Experiment #1" in the year 2007."

    and then to confuse it a little bit more...

    "Show in a time series chart all the total monthly views for the program named "Experiment #1" for the last two years."
    Sounds more like a homework problem than a legitimate business requirement to me.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  8. #8
    Join Date
    Oct 2007
    Posts
    8
    Quote Originally Posted by blindman
    Sounds more like a homework problem than a legitimate business requirement to me.
    It does sound like homework thats for sure. I kind of wish it was, because then I could go talk to a professor instead of asking people on the internet.

    What it comes down to... Is I'm new to metrics gathering/reporting and I need some assistance. I spent all day yesterday researching number/tally tables and didn't find one good explanation or clear reason for using them. Just hocus pocus examples of sql.

    I need an example that is clearly explained. I'm an Object guy, not an Relational/SQL guy. Just handing me the sql won't help me with the next problems I'm going to be facing. The reason I asked the two questions I did, is because those are use cases I've already been asked to solve. The second one to me is especially tricky because of the addition of adding the change of years.

    If you don't feel comfortable helping me out, point me to some online or book resources and I'll figure it out on my own.

    Thank you for your time and reply.

  9. #9
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Dude - I posted a link for a numbers table. It also includes a couple of links to reasons\ examples of using a numbers table. Adam Machanic is one of the top SQL Server MVPs - if he says a tally table is useful you can be assured it is.

    blindman also gave you an example of using a tally table to generate a sequential set of dates.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  10. #10
    Join Date
    Oct 2007
    Posts
    8
    And clearly those examples helped me out.

    I'm not doubting the usefulness of a numbers table. That is why I researched them. I only asked my second question after not finding anything that was helpful to me.

    I've never had to do anything more with sql then write simple inner join statements. I've always used Hibernate to do our queries for us. Looking at those sql examples means nothing to me. Its just garbly gook.

    Thank you for your responses and time. I guess I have to figure this out on my own.

  11. #11
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    hooknc, the only reason I am reluctant to provide you a solution is because I am not familiar with postgres, so I would not be able to help you debug the syntax.
    Maybe you should repost your question in the postgres forum, or I could move this thread there if you would like.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  12. #12
    Join Date
    Oct 2007
    Posts
    8
    Blindman, thank you for that information. If you think this thread belongs in the postgres section please move it.

    I am now able to get a sequence of dates generated via a numbers table...

    Here are the two functions I created:

    Code:
    /*
        Concatenates a TEXT string together that will create an INTERVAL to use when working with a DATE/TIMESTAMP.
    */
    CREATE OR REPLACE FUNCTION metrics.get_date_interval (INTEGER, TEXT) RETURNS INTERVAL AS $$
    
        BEGIN
    
        return CAST( CAST($1 as TEXT) || ' ' || $2 as INTERVAL);
    
        END;
    
    $$ LANGUAGE 'plpgsql';
    and...

    Code:
    /*
        Creates a sequence of TIMESTAMPS using a specific INTERVAL
    */
    CREATE OR REPLACE FUNCTION metrics.get_timestamp_sequence (DATE, DATE, TEXT) RETURNS SETOF TIMESTAMP AS $$
    
        SELECT $1 + metrics.get_date_interval(number, $3) FROM metrics.Numbers WHERE $1 + metrics.get_date_interval(number, $3) <= $2 ORDER BY 1;
    
    $$ LANGUAGE 'sql';
    Then using the following query I'm able to get a years worth of uninterrupted sequenced months:

    Code:
    SELECT metrics.get_timestamp_sequence('20071201', '20081231', 'MONTH');
    
    "2008-01-01 00:00:00"
    "2008-02-01 00:00:00"
    "2008-03-01 00:00:00"
    "2008-04-01 00:00:00"
    "2008-05-01 00:00:00"
    "2008-06-01 00:00:00"
    "2008-07-01 00:00:00"
    "2008-08-01 00:00:00"
    "2008-09-01 00:00:00"
    "2008-10-01 00:00:00"
    "2008-11-01 00:00:00"
    "2008-12-01 00:00:00"
    The above output is of type TIMESTAMP.

    Now, I'm stuck on how to use the contents of the above query in my original query above... (reposted here for readability and removed some text)

    Code:
    SELECT COUNT(dvf.id) AS dvf_count, pd.name, td.year, td.month, FROM metrics.dataset_view_fact AS dvf
    JOIN metrics.time_dimension AS td ON dvf.time_id = td.id
    JOIN metrics.programs_dimension AS pd ON dvf.program_id = pd.id
    WHERE
        td.year=<< year from a row in the above query>> AND
        td.month=<< month from a row in the above query >> AND
        (
          pd.name = 'program0' OR
          pd.name = 'program1'
        )
    GROUP BY pd.name, td.year, td.month
    I've looked into the subquery logical operators, but I'm not exactly sure if they would help me here or not.

    Should I consider changing my time_dimension table? Or should I consider adding another function that will allow me to loop through the result set returned from the metrics.get_timestamp_sequence function and creates my where clause on the fly? Or is there a better idea?

    Any friendly guidance is appreciated.

  13. #13
    Join Date
    Oct 2007
    Posts
    8
    I've made some progress, but I've hit another road block that I'm not sure how to get by.

    Here is a majority of my build script again for reference (plus it has been updated a little bit, I also had to pre-populate my time_dimension table with all the day between 2007 and 2009.):

    Code:
    CREATE TABLE users_dimension (
        id uuid NOT NULL,
        firstname character varying(100) NOT NULL,
        lastname character varying(100) NOT NULL,
        username character varying(100) NOT NULL,
        CONSTRAINT users_dimension_pkey PRIMARY KEY (id)
    );
    
    CREATE TABLE time_dimension (
        id uuid NOT NULL,
        timestamp timestamp NOT NULL,
        day_of_week smallint NOT NULL,
        day_of_month smallint NOT NULL,
        day_of_year smallint NOT NULL,
        month smallint NOT NULL,
        quarter smallint NOT NULL,
        year smallint NOT NULL,
        holiday boolean NOT NULL,
        weekend boolean NOT NULL,
        CONSTRAINT time_dimension_pkey PRIMARY KEY (id)
    );
    
    CREATE TABLE programs_dimension (
        id uuid NOT NULL,
        name character varying(255) NOT NULL,
        CONSTRAINT program_dimension_pkey PRIMARY KEY (id)
    );
    
    CREATE SEQUENCE dataset_view_fact_seq;
    
    CREATE TABLE dataset_view_fact (
        id INTEGER NOT NULL PRIMARY KEY DEFAULT nextVal('dataset_view_fact_seq'),
        program_id uuid,
        time_id uuid,
        user_id uuid,
        CONSTRAINT fkey_dataset_view_fact_0 FOREIGN KEY (program_id) REFERENCES programs_dimension(id) MATCH FULL ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED,
        CONSTRAINT fkey_dataset_view_fact_1 FOREIGN KEY (time_id) REFERENCES time_dimension(id) MATCH FULL ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED,
        CONSTRAINT fkey_dataset_view_fact_2 FOREIGN KEY (user_id) REFERENCES users_dimension(id) MATCH FULL ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    );
    
    
    CREATE TABLE metrics.numbers 
    ( 
        number INTEGER PRIMARY KEY 
    );
    
    CREATE FUNCTION metrics.increment_numbers (BIGINT) RETURNS INTEGER AS $$
      
      BEGIN
        
        FOR i IN 1..$1 LOOP
    
          INSERT INTO metrics.numbers values (i);
        
        END LOOP;
        
        RETURN (SELECT COUNT(*) from metrics.numbers);
        
      END;
    $$ LANGUAGE 'plpgsql';
    
    SELECT metrics.increment_numbers(1000);
    
    DROP FUNCTION metrics.increment_numbers(BIGINT);
    
    /*
     *  Concatenates a TEXT string together that will create an INTERVAL to use when working with a DATE/TIMESTAMP.
     */
    CREATE OR REPLACE FUNCTION metrics.get_date_interval (INTEGER, TEXT) RETURNS INTERVAL AS $$
    
        BEGIN
    
        return CAST( CAST($1 as TEXT) || ' ' || $2 as INTERVAL);
    
        END;
    
    $$ LANGUAGE 'plpgsql';
    
    /*
     *  Creates a sequence of TIMESTAMPS using a specific INTERVAL
     */
    CREATE OR REPLACE FUNCTION metrics.get_timestamp_sequence (DATE, DATE, TEXT) RETURNS SETOF TIMESTAMP AS $$
    
        SELECT $1 + metrics.get_date_interval(number, $3) FROM metrics.Numbers WHERE $1 + metrics.get_date_interval(number, $3) <= $2 ORDER BY 1;
    
    $$ LANGUAGE 'sql';
    
    CREATE INDEX dataset_view_fact_idx1 ON metrics.dataset_view_fact(program_id);
    
    CREATE INDEX dataset_view_fact_idx2 ON metrics.dataset_view_fact(time_id);
    
    CREATE INDEX dataset_view_fact_idx3 ON metrics.dataset_view_fact(user_id);
    My goal is to get how many times a specific program has been viewed in a sequence of months. Even if that program was not viewed for in a month we want to report 0 views.

    Here is the current select statement I've come up with:

    Code:
    SELECT count(dvf.id) as dvf_count, dvf.name, td.year, td.month
    FROM metrics.time_dimension AS td
    LEFT OUTER JOIN (
        SELECT dvf.id, dvf.time_id, pd.name
        FROM metrics.dataset_view_fact AS dvf
        JOIN metrics.programs_dimension AS pd ON dvf.program_id = pd.id
        WHERE
            pd.name = 'program0'
        ) AS dvf ON dvf.time_id = td.id
    WHERE
        date_trunc('MONTH', td.timestamp) IN
            (SELECT metrics.get_timestamp_sequence('20061201', '20091201', 'MONTH'))
    GROUP BY dvf.name, td.year, td.month
    ORDER BY td.year, td.month
    and here is a sample of what is being returned:

    dvf_count, name, year, month
    Code:
    2;"program0";2007;0
    0;"";2007;0
    4;"program0";2007;1
    0;"";2007;1
    4;"program0";2007;2
    0;"";2007;2
    7;"program0";2007;3
    0;"";2007;3
    4;"program0";2007;4
    0;"";2007;4
    6;"program0";2007;5
    0;"";2007;5
    0;"";2007;6
    3;"program0";2007;7
    0;"";2007;7
    5;"program0";2007;8
    0;"";2007;8
    7;"program0";2007;9
    0;"";2007;9
    5;"program0";2007;10
    0;"";2007;10
    4;"program0";2007;11
    0;"";2007;11
    3;"program0";2008;0
    0;"";2008;0
    3;"program0";2008;1
    0;"";2008;1
    4;"program0";2008;2
    0;"";2008;2
    5;"program0";2008;3
    0;"";2008;3
    6;"program0";2008;4
    0;"";2008;4
    2;"program0";2008;5
    0;"";2008;5
    4;"program0";2008;6
    0;"";2008;6
    7;"program0";2008;7
    0;"";2008;7
    4;"program0";2008;8
    0;"";2008;8
    7;"program0";2008;9
    0;"";2008;9
    3;"program0";2008;10
    0;"";2008;10
    7;"program0";2008;11
    0;"";2008;11
    0;"";2009;0
    0;"";2009;1
    0;"";2009;2
    0;"";2009;3
    0;"";2009;4
    0;"";2009;5
    0;"";2009;6
    0;"";2009;7
    0;"";2009;8
    0;"";2009;9
    0;"";2009;10
    0;"";2009;11
    A few things are messing up my report. When there aren't any values for a month the name of the program is null or blank. I really want to make sure that every program name is filled in, but shows a 0. So getting something more like the following instead of what is above:

    dvf_count, name, year, month
    Code:
    2;"program0";2007;0
    4;"program0";2007;1
    4;"program0";2007;2
    7;"program0";2007;3
    4;"program0";2007;4
    6;"program0";2007;5
    0;"program0";2007;6
    3;"program0";2007;7
    5;"program0";2007;8
    7;"program0";2007;9
    5;"program0";2007;10
    4;"program0";2007;11
    3;"program0";2008;0
    3;"program0";2008;1
    4;"program0";2008;2
    5;"program0";2008;3
    6;"program0";2008;4
    2;"program0";2008;5
    4;"program0";2008;6
    7;"program0";2008;7
    4;"program0";2008;8
    7;"program0";2008;9
    3;"program0";2008;10
    7;"program0";2008;11
    0;"program0";2009;0
    0;"program0";2009;1
    0;"program0";2009;2
    0;"program0";2009;3
    0;"program0";2009;4
    0;"program0";2009;5
    0;"program0";2009;6
    0;"program0";2009;7
    0;"program0";2009;8
    0;"program0";2009;9
    0;"program0";2009;10
    0;"program0";2009;11
    What do i need to do to my query to get the second result set?

    Am I approaching this problem incorrectly? Would some other way work better?

    Thanks for your time.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •