Results 1 to 14 of 14
  1. #1
    Join Date
    Feb 2010
    Posts
    75

    Unanswered: Data manipulation in SQL or c-sharp ?

    I have a pretty complicated query, where I basically select a ton of data into a temp table, and then loop over that table, massaging the data, doing things like removing outliers and other wierd data.
    The query is very slow, and I know its mostly because of the loop/massage, not the initial select.

    My question is would it significantly speed things up to do the loop/massage in the back-end in csharp, instead of in TSQL? I could just try it and see, but as I said its a pretty complicated algorithm so its not trivial to port it over to csharp. I wanted to get your feelings on it first.

    To me it seems that manipulating data is SQL Server's specialty so doing it in the TSQL should be the fastest place. Not to mention since I'm pruning the data BEFORE sending it back, that saves on bandwidth as well.

    Thoughts?

  2. #2
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    The first question would typically be if you could move this to set based instead of looping. However, if the algorithm is that complex then converting it to set based might not have the ROI you would need.

    Can you tell us something of the algorithms? Apart from looping, SQL is also very slow with text manipulations and complex maths compared to C# so again this might influence your decision process.

    In short though I would expect a C# loop to outperform a SQL one pretty well every time.
    Final point - you don't actually need to perform a loop in C#. You could write a CLR scalar function and access that via a query as you would a normal function.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  3. #3
    Join Date
    Feb 2010
    Posts
    75
    Thanks flump.
    If I get the time, I will port the ugly part of the algorithm from SQL to csharp and post back here with the results. I'm expecting a significant speed increase, not for any logical reason other than its really slow now and it couldnt possibly get slower

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    I have yet to see any processing done via temp tables and loops that can't be done better, cleaner, and orders of magnitude faster using set based operations. I'd be willing to bet that if you can post your algorithm, we can do it using sets so that it is easier for you to maintain and runs somewhere between ten and ten thousand times faster.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  5. #5
    Join Date
    Feb 2010
    Posts
    75
    I am all for set operations instead of loops.
    The logic is very complicated though, and I would be blown away if you could turn this into a set operation.
    Unfortunately I have a feeling that posting the SQL could get me into trouble due to the nature of this project. If I describe the data and provide pseudocode, would that be enough for you to help me?

  6. #6
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    It depends entirely on how well you can describe the problem (so that we can understand the important parts), and how well you can interpret our responses and apply them to your real problem. I'm one of the first people to get angry when someone comes here and says "here's my roller skate, tell me how to fix my dump truck" and then have the poster complain when we describe how to fix the problem they described but not the actual problem that they have.

    I understand what it is like to have to "keep the kimono closed" in public, and how that can complicate finding a solution, but I'm willing to give it a shot.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  7. #7
    Join Date
    Feb 2010
    Posts
    75
    Im asking for help now, but Ive been the one giving help many times as well. I completely understand what you mean. I'll do my best to describe, and if thats not sufficient (which wouldnt surprise me), then thanks for trying anyway!


    We have a giant table of events. Each event has a timestamp for when it was completed, as well as a duration for how long it took. Each even also has a type. The task is to run a report on the durations of events between @startTime and @endTime, for each event type. We want the average and maximum durations, after outliers are removed.

    The fun part is removing outliers. An outlier is an event with a duration that is way bigger or way smaller than the other durations around the same time. For example if most durations between 11:30 and 12:00 are around 3 - 4 minutes, and one of them is 58 minutes, then the 58 is dropped and not included in the average or maximum.

    I currently have a separate table-valued function to remove the outliers. SP_NO_OUTLIERS() It takes parameters @startTime, @endTime and @eventType, and returns a table containing all the events of that type between those times, but without the outliers. Outliers are defined as follows: in the 10% of events with the lowest duration, any events separated from the 'main chunk' of data by 1 minute or more are considered outliers. Also: in the 10% of evensts with the longest duration, any events separated from the 'main chunk' of data by 1 minute or more are considered outliers. The function itself uses cursors and loops as well.

    So my main algorithm uses this function heavily. The report should show the average and max durations (minus outliers) in 30 minute intervals. IE, for each 30 minute interval, what was the average and maximum duration, with ouliers not included.

    Here is some pseudocode. Generally a foreach means a cursor - which is the kind of thing we want to turn into SET operations instead. (I think)

    foreach eventtype
    foreach 30-minute-interval
    insert into #TEMP exec SP_NO_OUTLIERS(eventtype, 30-minute-interval)
    insert into #RESULTS select (eventtype, 30-minute-interval, AVG(seconds), MAX(seconds)) from #TEMP
    end
    end
    select * from #RESULTS



    The actual algorithm is actually much much more complicated than this, but I am just going to have to hope that this will be enough to help me out.

  8. #8
    Join Date
    Mar 2003
    Location
    The Bottom of The Barrel
    Posts
    6,102
    Provided Answers: 1
    SP_NO_OUTLIERS is a blackbox in this example. What does it do?
    oh yeah... documentation... I have heard of that.

    *** What Do You Want In The MS Access Forum? ***

  9. #9
    Join Date
    Sep 2001
    Location
    Chicago, Illinois, USA
    Posts
    601
    I think it is natural to think that code might be complicated, but it may not be all that complicated.

    I think you need to post the actual code here.
    Ken

    Maverick Software Design

    (847) 864-3600 x2

  10. #10
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    How do you define your intervals? Does the data define them, an external table, an argument value, or something else?

    This could be done with a single Machiavellian SELECT operation, but that's bad design and poor coding practice.

    What version of SQL Server are you using? Establishing a minimum version could significantly reduce the complexity of the code needed to implement your request.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  11. #11
    Join Date
    Feb 2010
    Posts
    75
    The intervals are defined by parameters to the stored procedure. So theres no way of knowing in advance what they will be.


    This could be done with a single Machiavellian SELECT operation, but that's bad design and poor coding practice.
    Not sure what you mean by this. I thought that SET operations were always preferable to doing loops, and I thought a single 'Machiavellian' SELECT operation would be a SET operation. Whats bad design about that?

    SQL server 2008

    Thanks!

  12. #12
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Quote Originally Posted by BeerOclock View Post
    I thought that SET operations were always preferable to doing loops
    And 99.9% of the time you would be right. But every rule has exceptions.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  13. #13
    Join Date
    Jun 2005
    Posts
    319
    It sounds like you want to split the data by event_type, interval then figure out what bounds you are dealing with using code similar to below and then take it from there. You'll have to join against a set of intervals (you could do this using a recursive CTE) given your requirements.

    Code:
    SELECT 
        @MIN_EVENT_DURATION = MAX(EVENT_DURATION)
    FROM
        (
             SELECT
                 TOP 10 PERCENT EVENT_DURATION
             FROM
                 EVENT
             ORDER BY
                 EVENT_DURATION           
        ) Q1
    
    SELECT 
        @MAX_EVENT_DURATION = MIN(EVENT_DURATION)
    FROM
        (
             SELECT
                 TOP 10 PERCENT EVENT_DURATION
             FROM
                 EVENT
             ORDER BY
                 EVENT_DURATION DESC          
        ) Q2
    Once you have established your bounds for any given event_type and interval it should be pretty straight forward how to proceed from there given your business logic.
    Last edited by Gagnon; 09-29-10 at 15:23.

  14. #14
    Join Date
    Sep 2010
    Posts
    3
    My question is would it significantly speed things up to do the loop/massage in the back-end in csharp, instead of in TSQL? I could just try it and see, but as I said its a pretty complicated algorithm so its not trivial to port it over to csharp. I wanted to get your feelings on it first.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •