Results 1 to 11 of 11
  1. #1
    Join Date
    Aug 2008
    Posts
    54

    Group By Question

    Hello Everyone,

    I have a table which contains customers with multiple process_dates. I would like to select the max process_date for each customer so the output has one entry for each customer, no dups.

    The query below returns this error:
    "Column 'cust_tab.Group name' is invalid in the select list because it is not contained in either aggregate function or the Group By clause"

    This table contains hundreds of columns; I couldn't possibly list all the columns in the select statement. And I need to keep all of the columns.
    Does anyone have any suggestions on resolving this issue?

    Thank you in advance for any input


    select cust_id, cust_class, process_date, *
    from cust_tab
    where principal_bal > 0
    group by cust_id, cust_class, process_date
    having process_date = max(process_date)

  2. #2
    Join Date
    Apr 2012
    Posts
    212
    Hi,

    Try something like this:

    Code:
    with CTE_R as
    (
        select 
            *,
            ROW_NUMBER() OVER(PARTITION BY cust_id ORDER BY process_date DESC) as RowNum
        from cust_tab 
        where principal_bal > 0
    )
    
    select * from CTE_R
    where RowNum = 1
    Hope this helps.

  3. #3
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,564
    The CTE may be overkill for this task, depending upon your requirements. What you described (one record for each customer) does not match the logic of your code (only the record with the latest process date).

    And part of your problem is that you have a "*" in your SELECT statement with a GROUP BY. No can do.

    Try this:
    Code:
    select	cust_id,
    	cust_class,
    	max(process_date) as MaxProcessDate
    from	cust_tab
    where	principal_bal > 0
    group by cust_id,
    	cust_class
    ...which may satisfy your requirements. If neither my nor imex's solution works for you, describe your business requirements more clearly and I bet either I or imex can roll a CTE that will suit your needs.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  4. #4
    Join Date
    Jan 2013
    Posts
    305
    WITH X
    AS
    (SELECT *,
    MAX(process_date) OVER (PARTITION BY cust_id)
    AS process_date_max
    FROM Customers
    WHERE principal_bal > 0.00)
    SELECT *
    FROM X
    WHERE process_date_max = process_date;

    The advantage of MAX() over ROW_NUMBER() is it has no need to do a sort and if there are ties, it will find them. I really do not like using SELECT *, but it is the best I can do with the specs; how did you wind up with hundreds of columns? The most I have seen in the real world was ~120, where each one was an allergy test for a medical study

  5. #5
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,564
    That would appear to perform the same logic as the CTE that imex already posted.

    Any reason to prefer:
    "MAX(process_date) OVER (PARTITION BY cust_id)...WHERE process_date_max = process_date;"
    over:
    "ROW_NUMBER() OVER(PARTITION BY cust_id ORDER BY process_date DESC)...where RowNum = 1"?
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  6. #6
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,564
    Actually the code that imex post is preferred, as the code that Joe posted could return more than one record for a customer, if they have two records with identical process_date values.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  7. #7
    Join Date
    Aug 2008
    Posts
    54
    Thank you all very much for your input!
    I tried all of your suggestions, and it seems that imex's approach worked best for the data I have.

    Thank you again for your help. This forum is a wealth of knowledge

  8. #8
    Join Date
    Jan 2013
    Posts
    305
    Any reason to prefer:
    "MAX(process_date) OVER (PARTITION BY cust_id)...WHERE process_date_max = process_date;"
    over:
    "ROW_NUMBER() OVER(PARTITION BY cust_id ORDER BY process_date DESC)...where RowNum = 1"?
    1. The "MAX(..) OVER (PARTITION BY ..)” requires a scan within each partition and not sorting like “ROW_NUMBER() OVER(PARTITION BY .. ORDER BY ..)”, so it should run faster.
    2. Each partition can execute in parallel, because it is not referencing an outer SELECT; all the relevant data is in the partition.
    3. MAX() returns ties while ROW_NUMBER() picks one at random and thus destroys information and might not be repeatable.

  9. #9
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,564
    This will indeed run faster in the rare scenarios where Process dates are unique for each cust_id. But I don't see how that method is extensible to secondary sorting. For instance, if a customer has two records with identical process dates and the business rule is to use the last one entered (the one with the highest sequential "cust_tab_id"), can that be rolled into your existing logic, or would you have to revert back to the "ROW_NUMBER() over (Partition by cust_ID order by processed_date desc, cust_tab_id desc) as RowNum" method?
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  10. #10
    Join Date
    Jan 2013
    Posts
    305
    This will indeed run faster in the rare scenarios where Process dates are unique for each cust_id.
    Unh? A sort is O(n * log2(n)) while a MAX(n) is O(n). The uniqueness of the date will not change that. Also, how do you know that duplicate dates are common from the specs? Why not assume that this process is daily?

    In fact, the MAX() has another advantage, if we can fold the “process_date_max = process_date” into the scan. Make it easy with integers. Do the first scan, and since you have only one row, it is the max so far:

    Value Value_max_so_far
    100 100 ◀ keep
    110
    100
    99

    Step two, we now have a new greatest value, so replace it. You could now delete all the rows that are strictly less the new current max value, but let us keep scanning and dropping as we go.

    Value Value_max_so_far
    =================
    100 100 drop
    110 110 ◀ keep
    100
    99

    Value Value_max_so_far
    ==================
    100 100
    110 110 keep
    100 110 ◀ drop
    99


    Value Value_max_so_far
    ==================
    100 100 drop
    110 110 keep
    100 110 drop
    99 110 ◀ drop

    If Conner is at the Cactuss meetings tonight and tomorrow, I will ask how MS does it.

    But I don't see how that method is extensible to secondary sorting. For instance, if a customer has two records [sic: rows are not records] with identical process dates and the business rule is to use the last one entered (the one with the highest sequential "cust_something_seq"), can that be rolled into your existing logic, or would you have to revert back to the "ROW_NUMBER() OVER (PARTITION BY cust_id ORDER BY processed_date DESC, cust_something_seq DESC) AS row_nbr" method?
    Yes, if you change the specs and create new columns, you might need a new query And I would probably go with the ROW_NUMBER(). However, one trick I have seen was to use a computed column. The students took tests that had three parts that carried different weights that gave a final score. Skeleton DDL:

    CREATE TABLE Tests
    (student_id CHAR (10) NOT NULL,
    test_nbr SMALLINT NOT NULL,
    PRIMARY KEY (student_id, test_nbr),
    x_score INTEGER NOT NULL,
    y_score INTEGER NOT NULL,
    z_score INTEGER NOT NULL,
    grand_score COMPUTED AS (10*x_score + 15*y_score + z_score)
    );

    Since the computed score was being used for lots of stuff and the formula is simple, materialization made sense. It was waiting in the data, not computed on the fly. But combining mixed data types might not be such a good idea.

    WITH X
    AS
    (SELECT student_id, grand_score,
    MAX(grand_score)
    OVER (PARTITION BY student_id) AS grand_score_max
    FROM Tests)

    SELECT student_id, grand_score
    FROM X
    WHERE grand_score_max = grand_score;

  11. #11
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,564
    Quote Originally Posted by Celko View Post
    This will indeed run faster in the rare scenarios where Process dates are unique for each cust_id.
    Unh? A sort is O(n * log2(n)) while a MAX(n) is O(n).
    Sorry, I meant the method was only "applicable" to those scenarios. Not that the cardinality of the data would affect the relative processing speed of the two methods.
    I hope that misunderstanding didn't put you through too much trouble.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •