Results 1 to 6 of 6
  1. #1
    Join Date
    Nov 2010
    Posts
    3

    Unanswered: data Cleansing-standarizing column,s data

    i have converted many text,acess,acess,oracle data to sql now i have to perform data cleansing.

    i ,m new to SSIS and i don`t know how to use SSIS to perform my data cleansing before ETL. i have used some queries in my SQL to check dirty data,

    for gender i found 8 values with (00,11,01,10,001,M) .

    similarly, for date of birth
    12 invalid dates. 22-Jal-75,1/27/75,27-Apl-77..... 29-Feb-75,31-jun-89 etc.

    WHAT I WANT????

    1)first SSIS read data.

    2)if don`t find 0 or 1 in gender send erronous data(00,11,01,10,001,M) to "error table".

    3)in "error table", dirty data (00,11,01,10,001,M) should be converted in 0,1

    4)after standarization, error table data records should go to original table at its place.

    5)Now i will change 0 to male and 1 to female.

    6)now i will send complete table of student(stid,name,father,gender,adress,last degree,reg data,dob) to a new table STD_INFO.

    7) i will repeat same 1-6 steps for all databases(lahore,karachi,islamabad,peshawar) and then i will integrate STD_INFO tables of all databases into one.

    i hope now you can understand what i have to do. Similar case with DOB first dirty data to erronous table after conversion there to original table and then i will check all records at the same time.

  2. #2
    Join Date
    Nov 2010
    Posts
    3
    i need answer ASAPplzzzz help plzzzz

  3. #3
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Personally, I would use SSIS to load a staging table only. Do all cleaning in T-SQL.

    Load in to a totally unconstrained staging table (all columns NVARCHAR). Do all your transformations & cleaning in T-SQL. Create tables to keep an audit trail (primary key, original value, new value, boolean indicating if the row\ value could be cleaned etc.). When done load into final destination table.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  4. #4
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    Do not put any logic or data-related coding in your ETL packages. Use them for E, use them for L, but for God's sake don't use them for T.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  5. #5
    Join Date
    Nov 2010
    Posts
    3
    how can i do cleaning in T-SQL.??
    any toturial /suggestions plzzz

    i have used some queries in my SQL to check dirty data,


    for gender

    SELECT



    [Gender]



    FROM [Isb_practice].[dbo].[BS_Students]

    Where (Gender <>'0') AND ( Gender <>'1')

    i found 8 values with (00,11,01,10,001,M) .

    similarly, for date of birth

    SELECT *
    FROM Student
    WHERE ([Date of Birth] NOT LIKE '%_-Jan-__')
    AND ([Date of Birth] NOT LIKE '%_-Feb-__')
    AND ([Date of Birth] NOT LIKE '%_-Mar-__')
    AND ([Date of Birth] NOT LIKE '%_-Apr-__')
    AND ([Date of Birth] NOT LIKE '%_-May-__')
    AND ([Date of Birth] NOT LIKE '%_-Jun-__')
    AND ([Date of Birth] NOT LIKE '%_-Jul-__')
    AND ([Date of Birth] NOT LIKE '%_-Aug-__')
    AND ([Date of Birth] NOT LIKE '%_-Sep-__')
    AND ([Date of Birth] NOT LIKE '%_-Oct-__')
    AND ([Date of Birth] NOT LIKE '%_-Nov-__')
    AND ([Date of Birth] NOT LIKE '%_-Dec-__')
    AND ([Date of Birth] <> '') AND ([Date of Birth] IS NOT NULL)

    The output of the above query shows 3 invalid dates. 22-Jal-75,1/27/75,27-Apl-77.

    SELECT * FROM Student
    WHERE ([Date of Birth] LIKE '29-Feb-%') OR
    ([Date of Birth] LIKE '3_-%')

    The output of the above query shows 9 invalid dates. 29-Feb-75,31-jun-89 etc.

  6. #6
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    You need to code the rules. You have several options:
    1) Hard code each change, e.g. use an UPDATE statement to change "00" -> "0", "11" -> "1" etc.
    2) Or you can put these in to a table with a "bad" column and a "convert into" column and use an UPDATE statement with a join. This is the same as 1, but better (since you could get other users to make these conversion judgements).
    For 1 & 2 only values you set are converted.
    3) Write algorithms for the server to try to work out what to convert values to. This is very complex and likely beyond your skill level. It could include, however, distance algorithms and metaphone.
    Testimonial:
    pootle flump
    ur codings are working excelent.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •