Page 1 of 2 12 LastLast
Results 1 to 15 of 16
  1. #1
    Join Date
    May 2004
    Posts
    184

    Unanswered: DISTINCT causes aggregate to double in value

    Happy New Year everyone.

    I have a fairly complex query that aggregates several subqueries, views and tables. However when I add the DISTINCT modifer to the outside select statement, all of the aggregated values double in value, i.e. returns a result of 4 instead of 2.

    I've looked for ambiguous field names that might impact the group by clause, but nothing makes sense as to why this is happening. I should also point out that no matter whether I use distinct or not, the query and subquery only return one row.

    Rather than posting the very long query, below is a query that mimics what I'm doing:

    Code:
    select DISTINCT id, sum(enrollmentCount)
    from (
       select id,enrollmentCount
       from myView...) //view uses aggregate COUNT(*) to calc enrollmentCount
    group by id;
    Adding the DISTINCT modifier to the sub query doesn't change the result, only adding it to the outer query.

    I would have expected distinct to reduce my aggregate values, not increase them.

    Has anyone ever seen this before? Any suggestions or thoughts of what to look for?

    Thanks.

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    i don't think that's possible with what you've shown us

    GROUP BY id produces one row per id, so no matter what sum(enrollmentCount) is, there will only be one row per id, so DISTINCT is useless and should not change the results

    of course, there are two factors here:

    1 what you posted isn't what you've got
    2 there might be a bug in the database management system

    guess which one i think it is
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    May 2004
    Posts
    184
    I have to agree with you, that it's probably somewhere in my code and that DISTINCT should NOT impact a group by clause nor should it need to be in the statement. Still, it's weird.

    Thanks.

  4. #4
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    can you show a small subset of data that illustrates the problem?

    have you tracked down the ids?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  5. #5
    Join Date
    May 2004
    Posts
    184
    Rudy,

    The data comes from too many tables/subqueries to actually be able to clearly display it here, but thank you. However, let me try to at least provide something...

    The view is retreiving counts from 9 separate LEFT joins, that are then being passed through DECODE to set their values to zero if null. The view is then dynamically joined again with 3 more LEFT joins and a WHERE clause is added to limit the records. Lastly all of the above is wrapped in an outer query which is where I'm trying to discover the problem.

    Yes, I've gone down through the layers of the query/subqueries/views to make sure that the ID field is fully qualified with the table/alias to remove any ambiguity if that is what you mean by 'have you tracked down the ids'. I've also tried to look at the subqueries to see what might return more than one row to cause the problem. Unfortunately, almost all of my table use ID as their pk so this could be being picked up from any number of tables.

    Where I suspect that problem to be coming from is the 9 LEFT joins and that the Oracle parser is picking up a NULL from one of these records and causing the problem. This is just a shot in the dark though.

  6. #6
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    wow, no offence, but that's a mess

    i have been in similar situations and it's no fun

    try picking a small number of ids and tracking them through the subqueries and views individually, either by searching within reports or using WHERE clauses

    if you are familiar with the totals that they should be producing, it should eventually tell you where the doubling is occurring

    i often use temp tables going forward from the lowermost detailed data, then do the outer joins to lookup tables, save again, run an accumulation, save again, and so on
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  7. #7
    Join Date
    May 2004
    Posts
    184
    Rudy,

    Thanks. I was thinking of creating a tmp table from the view and testing with that, so thanks for the prodding. In this case, I do know the ID number as I chose it for the testing, so now I guess it's just time to spend some more time in the guts of the thing.

    Thanks for helping me think this through.

    Robert

  8. #8
    Join Date
    May 2004
    Posts
    184
    Okay, I've been able to narrow this down by stripping out the excess and using a tmp table (tmpDebug) instead of the underlying view.

    The problem is in the EXISTS clause. I know that my EXISTS clause contains two matches from the domainMembership table. (The domainmembership table is the middle table in a many to many join between the course and other tables.) The correct value for the sum(enrolledCount) is supposed to be 2. But using the EXISTS clause with two matching records, returns 4. If there are 3 records in the domainMembership table, then the query is returning 6.

    But, just because a record is found more than one time in the EXIST subquery clause, why would this impact the exterior aggregates - once for each record contained in the EXISTS subquery? This is acting more like a JOIN than an EXISTS criteria. I even tried adding DISTINCT to the where subquery, and that didn't help.

    Code:
    select distinct  id,
    			 courseTitle,
    			 sum(enrolledCount) as enrolled
    
      from (SELECT id,
    			courseTitle,
    			synchenrollmentCount + asynchEnrollmentCount as enrolledCount
    		FROM tmpDebug fcl
              where UPPER(coursetitle) LIKE '%DISOLVE%'
    		 AND status = 'A'
    		 AND (session_status IN ('C') OR session_status is null)) rep
     WHERE EXISTS (SELECT  entityID
    		FROM DomainMembership DM, DomainMap MAP
    	    WHERE DM.entityTypeID = 4
    		 AND DM.domainID = MAP.childID
    		 AND (MAP.parentID IN (1))
    		 AND DM.entityID = REP.ID)
     group by courseTitle,
    		id
    Thanks.

  9. #9
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    try this and see what it returns --
    Code:
    SELECT id                                                               
         , count(*)
      FROM tmpDebug fcl                                                     
     where UPPER(coursetitle) LIKE '%DISOLVE%'                              
       AND status = 'A'                                                     
       AND (session_status IN ('C') OR session_status is null)              
    GROUP
        BY id
    HAVING count(*) > 1
    see if those ids correlate to your dupes
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  10. #10
    Join Date
    May 2004
    Posts
    184
    Rudy,

    the result as written is null. Without the having clause, it returns just one record with a count of one for the result.

    Thanks.

  11. #11
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    um, you have only one id in your temp table?

    i am confused again
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  12. #12
    Join Date
    May 2004
    Posts
    184
    Rudy,

    Sorry for the confusion. I have only one row that matches the criteria of:
    Code:
    UPPER(coursetitle) LIKE '%DISOLVE%'                              
       AND status = 'A'                                                     
       AND (session_status IN ('C') OR session_status is null)
    However,
    Code:
     SELECT id                                                               
         ,status,session_status, count(*)
      FROM tmpDebug fcl                                                     
     where UPPER(coursetitle) LIKE '%DISOLVE%'                             
            
    GROUP
        BY id,status,session_status
    returns.

    RowID ID STATUS SESSION_STATUS COUNT(*)
    1 1001 A A 46
    2 1001 A C 1

    Thanks.

  13. #13
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    hmmm...

    still no idea why it's duplicating the sums in the outer query...
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  14. #14
    Join Date
    May 2004
    Posts
    184
    Okay, I was able to reproduce this in scott/tiger...

    Note the difference in the aggregates between these two queries.

    Code:
    select  deptno,sal
    from (
    select deptno,sum(sal) as sal
    from emp emp1
    where exists
    (select * from emp emp2
            where emp1.deptno=emp2.deptno)
    group by deptno  )
    Generates
    DEPTNO SAL
    1 10 8750
    2 20 10875
    3 30 9400

    but

    Code:
    select  DISTINCT deptno,sal
    from (
    select deptno,sum(sal) as sal
    from emp emp1
    where exists
    (select * from emp emp2
            where emp1.deptno=emp2.deptno)
    group by deptno  )
    DEPTNO SAL
    1 10 26250
    2 20 54375
    3 30 56400

    Consider the following deptno counts in emp:
    Code:
    select deptno,count(*)
    from emp
    group by deptno
    Returns:
    DEPTNO COUNT(*)
    1 10 3
    2 20 5
    3 30 6

    There is obviously something that Oracle does differently with how it handles WHERE clauses in combination with DISTINCT and aggregates one value per record found in the WHERE clause.



    Thanks.

  15. #15
    Join Date
    Jan 2004
    Location
    Croatia, Europe
    Posts
    4,094
    Provided Answers: 4
    ... and your database version is ...?
    Code:
    SQL> select banner from v$version;
    
    BANNER
    ----------------------------------------------------------------
    Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
    PL/SQL Release 10.2.0.1.0 - Production
    CORE    10.2.0.1.0      Production
    TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
    NLSRTL Version 10.2.0.1.0 - Production
    
    SQL> -- the first query:
    SQL>
    SQL> select  deptno,sal
      2  from (
      3  select deptno,sum(sal) as sal
      4  from emp emp1
      5  where exists
      6  (select * from emp emp2
      7          where emp1.deptno=emp2.deptno)
      8  group by deptno  )
      9  order by deptno;
    
        DEPTNO        SAL
    ---------- ----------
            10       8750
            20      10875
            30       1250
            40       8150
    
    SQL>
    Code:
    SQL> -- the second query:
    SQL>
    SQL> select  DISTINCT deptno,sal
      2  from (
      3  select deptno,sum(sal) as sal
      4  from emp emp1
      5  where exists
      6  (select * from emp emp2
      7          where emp1.deptno=emp2.deptno)
      8  group by deptno  )
      9  order by deptno;
    
        DEPTNO        SAL
    ---------- ----------
            10       8750
            20      10875
            30       1250
            40       8150
    
    SQL>
    Unless I did something wrong, it seems to work correctly on 10g.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •