If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > Microsoft SQL Server > Group By Question

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old
Registered User
 
Join Date: Aug 2008
Posts: 54
Group By Question

Hello Everyone,

I have a table which contains customers with multiple process_dates. I would like to select the max process_date for each customer so the output has one entry for each customer, no dups.

The query below returns this error:
"Column 'cust_tab.Group name' is invalid in the select list because it is not contained in either aggregate function or the Group By clause"

This table contains hundreds of columns; I couldn't possibly list all the columns in the select statement. And I need to keep all of the columns.
Does anyone have any suggestions on resolving this issue?

Thank you in advance for any input


select cust_id, cust_class, process_date, *
from cust_tab
where principal_bal > 0
group by cust_id, cust_class, process_date
having process_date = max(process_date)
Reply With Quote
  #2 (permalink)  
Old
Registered User
 
Join Date: Apr 2012
Posts: 211
Hi,

Try something like this:

Code:
with CTE_R as
(
    select 
        *,
        ROW_NUMBER() OVER(PARTITION BY cust_id ORDER BY process_date DESC) as RowNum
    from cust_tab 
    where principal_bal > 0
)

select * from CTE_R
where RowNum = 1
Hope this helps.
Reply With Quote
  #3 (permalink)  
Old
World Class Flame Warrior
 
Join Date: Jun 2003
Location: Ohio
Posts: 12,560
The CTE may be overkill for this task, depending upon your requirements. What you described (one record for each customer) does not match the logic of your code (only the record with the latest process date).

And part of your problem is that you have a "*" in your SELECT statement with a GROUP BY. No can do.

Try this:
Code:
select	cust_id,
	cust_class,
	max(process_date) as MaxProcessDate
from	cust_tab
where	principal_bal > 0
group by cust_id,
	cust_class
...which may satisfy your requirements. If neither my nor imex's solution works for you, describe your business requirements more clearly and I bet either I or imex can roll a CTE that will suit your needs.
__________________
If it's not practically useful, then it's practically useless.

blindman
www.chess.com: "sqlblindman"
www.LobsterShot.blogspot.com
Reply With Quote
  #4 (permalink)  
Old
Registered User
 
Join Date: Jan 2013
Posts: 298
WITH X
AS
(SELECT *,
MAX(process_date) OVER (PARTITION BY cust_id)
AS process_date_max
FROM Customers
WHERE principal_bal > 0.00)
SELECT *
FROM X
WHERE process_date_max = process_date;

The advantage of MAX() over ROW_NUMBER() is it has no need to do a sort and if there are ties, it will find them. I really do not like using SELECT *, but it is the best I can do with the specs; how did you wind up with hundreds of columns? The most I have seen in the real world was ~120, where each one was an allergy test for a medical study
Reply With Quote
  #5 (permalink)  
Old
World Class Flame Warrior
 
Join Date: Jun 2003
Location: Ohio
Posts: 12,560
That would appear to perform the same logic as the CTE that imex already posted.

Any reason to prefer:
"MAX(process_date) OVER (PARTITION BY cust_id)...WHERE process_date_max = process_date;"
over:
"ROW_NUMBER() OVER(PARTITION BY cust_id ORDER BY process_date DESC)...where RowNum = 1"?
__________________
If it's not practically useful, then it's practically useless.

blindman
www.chess.com: "sqlblindman"
www.LobsterShot.blogspot.com
Reply With Quote
  #6 (permalink)  
Old
World Class Flame Warrior
 
Join Date: Jun 2003
Location: Ohio
Posts: 12,560
Actually the code that imex post is preferred, as the code that Joe posted could return more than one record for a customer, if they have two records with identical process_date values.
__________________
If it's not practically useful, then it's practically useless.

blindman
www.chess.com: "sqlblindman"
www.LobsterShot.blogspot.com
Reply With Quote
  #7 (permalink)  
Old
Registered User
 
Join Date: Aug 2008
Posts: 54
Thank you all very much for your input!
I tried all of your suggestions, and it seems that imex's approach worked best for the data I have.

Thank you again for your help. This forum is a wealth of knowledge
Reply With Quote
  #8 (permalink)  
Old
Registered User
 
Join Date: Jan 2013
Posts: 298
Quote:
Any reason to prefer:
"MAX(process_date) OVER (PARTITION BY cust_id)...WHERE process_date_max = process_date;"
over:
"ROW_NUMBER() OVER(PARTITION BY cust_id ORDER BY process_date DESC)...where RowNum = 1"?
  1. The "MAX(..) OVER (PARTITION BY ..)” requires a scan within each partition and not sorting like “ROW_NUMBER() OVER(PARTITION BY .. ORDER BY ..)”, so it should run faster.
  2. Each partition can execute in parallel, because it is not referencing an outer SELECT; all the relevant data is in the partition.
  3. MAX() returns ties while ROW_NUMBER() picks one at random and thus destroys information and might not be repeatable.
Reply With Quote
  #9 (permalink)  
Old
World Class Flame Warrior
 
Join Date: Jun 2003
Location: Ohio
Posts: 12,560
This will indeed run faster in the rare scenarios where Process dates are unique for each cust_id. But I don't see how that method is extensible to secondary sorting. For instance, if a customer has two records with identical process dates and the business rule is to use the last one entered (the one with the highest sequential "cust_tab_id"), can that be rolled into your existing logic, or would you have to revert back to the "ROW_NUMBER() over (Partition by cust_ID order by processed_date desc, cust_tab_id desc) as RowNum" method?
__________________
If it's not practically useful, then it's practically useless.

blindman
www.chess.com: "sqlblindman"
www.LobsterShot.blogspot.com
Reply With Quote
  #10 (permalink)  
Old
Registered User
 
Join Date: Jan 2013
Posts: 298
Quote:
This will indeed run faster in the rare scenarios where Process dates are unique for each cust_id.
Unh? A sort is O(n * log2(n)) while a MAX(n) is O(n). The uniqueness of the date will not change that. Also, how do you know that duplicate dates are common from the specs? Why not assume that this process is daily?

In fact, the MAX() has another advantage, if we can fold the “process_date_max = process_date” into the scan. Make it easy with integers. Do the first scan, and since you have only one row, it is the max so far:

Value Value_max_so_far
100 100 ◀ keep
110
100
99

Step two, we now have a new greatest value, so replace it. You could now delete all the rows that are strictly less the new current max value, but let us keep scanning and dropping as we go.

Value Value_max_so_far
=================
100 100 drop
110 110 ◀ keep
100
99

Value Value_max_so_far
==================
100 100
110 110 keep
100 110 ◀ drop
99


Value Value_max_so_far
==================
100 100 drop
110 110 keep
100 110 drop
99 110 ◀ drop

If Conner is at the Cactuss meetings tonight and tomorrow, I will ask how MS does it.

Quote:
But I don't see how that method is extensible to secondary sorting. For instance, if a customer has two records [sic: rows are not records] with identical process dates and the business rule is to use the last one entered (the one with the highest sequential "cust_something_seq"), can that be rolled into your existing logic, or would you have to revert back to the "ROW_NUMBER() OVER (PARTITION BY cust_id ORDER BY processed_date DESC, cust_something_seq DESC) AS row_nbr" method?
Yes, if you change the specs and create new columns, you might need a new query And I would probably go with the ROW_NUMBER(). However, one trick I have seen was to use a computed column. The students took tests that had three parts that carried different weights that gave a final score. Skeleton DDL:

CREATE TABLE Tests
(student_id CHAR (10) NOT NULL,
test_nbr SMALLINT NOT NULL,
PRIMARY KEY (student_id, test_nbr),
x_score INTEGER NOT NULL,
y_score INTEGER NOT NULL,
z_score INTEGER NOT NULL,
grand_score COMPUTED AS (10*x_score + 15*y_score + z_score)
);

Since the computed score was being used for lots of stuff and the formula is simple, materialization made sense. It was waiting in the data, not computed on the fly. But combining mixed data types might not be such a good idea.

WITH X
AS
(SELECT student_id, grand_score,
MAX(grand_score)
OVER (PARTITION BY student_id) AS grand_score_max
FROM Tests)

SELECT student_id, grand_score
FROM X
WHERE grand_score_max = grand_score;
Reply With Quote
  #11 (permalink)  
Old
World Class Flame Warrior
 
Join Date: Jun 2003
Location: Ohio
Posts: 12,560
Quote:
Originally Posted by Celko View Post
Quote:
This will indeed run faster in the rare scenarios where Process dates are unique for each cust_id.
Unh? A sort is O(n * log2(n)) while a MAX(n) is O(n).
Sorry, I meant the method was only "applicable" to those scenarios. Not that the cardinality of the data would affect the relative processing speed of the two methods.
I hope that misunderstanding didn't put you through too much trouble.
__________________
If it's not practically useful, then it's practically useless.

blindman
www.chess.com: "sqlblindman"
www.LobsterShot.blogspot.com
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On