Results 1 to 8 of 8
  1. #1
    Join Date
    Feb 2008
    Posts
    5

    Research into data quality in large databases

    Hi,

    I am currently conducting a research project for a Masters programme, looking at how data quality issues affect large (2 million+ rows) marketing databases and CRM systems and investigating how data quality can be improved by automated solutions.

    As practitioners in the field, I was hoping some of you might be able to spare some of your time to discuss some of the issues, or to fill in a questionnaire. I can forward you full details of my research if you wish.

    This work will produce actionable recommendations about how to effectively deploy automated processing solutions, particularly in data loading and data cleaning. The research in this area is currently very sparse and therefore you would be contributing a great deal to the body of knowledge in this area. If you can help, I will cite you as a source if you like, and your name might forever be in lights on the references pages of an ACM journal or similar!

    Please could you contact me at mailto:0313223 AT chester DOT ac DOT uk, pm me or reply on here if you are interested in contributing?

    Thanks a lot.

    Dave
    Last edited by dasiddall; 02-22-08 at 05:43.

  2. #2
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    dasidall - unless you want loads of spam I would obsucate your email address.

    Sounds interesting - please could I request that people respond to the thread, rather than email or pm. This way you gain from the exposure on the forum and the forum gains by keeping much of your source material here for others to reference. It would also be jolly nice if you could somehow link to your eventual article, or at least paste in the recommendations, upon completion.

  3. #3
    Join Date
    Feb 2008
    Posts
    5
    Quote Originally Posted by pootle flump
    dasidall - unless you want loads of spam I would obsucate your email address.
    Thanks.

  4. #4
    Join Date
    Feb 2008
    Posts
    5
    Quote Originally Posted by pootle flump
    Sounds interesting - please could I request that people respond to the thread, rather than email or pm. This way you gain from the exposure on the forum and the forum gains by keeping much of your source material here for others to reference. It would also be jolly nice if you could somehow link to your eventual article, or at least paste in the recommendations, upon completion.
    Thanks for looking kindly on the thread, especially if you are a mod!

    Anyway, I will happily post further details here. After collating the ideas of people like you about these things, I will be able to produce a framework for an automated solution.

    Background/Purpose Of The Study
    Poor data quality costs enterprises money; by making business processes less efficient, by increasing the cost of maintaining contact with their customer base and through loss of customers due to poor customer service provision. The purpose of this research is to help demonstrate that automation remains worthwhile, bringing many perceptible benefits to an enterprise, particularly in terms of the quality of service it provides.

    There has been much investigation into the benefits of automation in other areas of computing, such as automation of testing in software development. As data loading and cleaning involves similarly programmable recurring processes, and automation programmes are being initiated on this basis, I believe it is now worth fully investigating whether the benefits of automation can be replicated in this area. Although there is some research on the benefits of automation in other areas, the research into the benefits of automation in data loading and cleaning is meagre.

    Thanks for taking the time to have a look over my research – if you do respond, you are of course free to withdraw at any time, and without giving a reason. The results will be written up and a copy will be made available in the university library. I will also, following pootle flump's request, link to the paper if and when it is published. If you would like a copy please don’t hesitate to contact me. I would like to make clear that all results would be made anonymous unless you would like me to cite you as a source, please say and I would be happy to do so...

    The questions I am looking to discuss are as follows.

    1) What are the main data quality problems in marketing databases?
    2) What are the main factors that cause these issues to arise?
    3) What are the main costs of poor data quality?
    4) How would you go about measuring data quality? What are the key indicators?
    5) Do you believe that automated or manual loading and/or cleaning can best remedy these problems?
    6) How possible do you think it is to achieve fully automated processes to improve data quality?
    7) Are there any major impediments to being able to improve data quality using automated processes?
    8) What change do you think automated data loading and cleaning solutions may have on an organisation?
    9) Are there any other side benefits of automated processing in improving data quality?
    10) Do you think the impact of these is felt to be beneficial to an organisation's clients/stakeholders?
    11) Therefore, are manual or automated solutions preferable?

    Many thanks.

  5. #5
    Join Date
    Aug 2003
    Location
    Toronto, Ontario, Canada
    Posts
    203
    I haven't actually worked with "marketing" databases, though many of the database structures I have worked with have done some marketing as well. I'll try to answer at least some of your questions.

    1. The main data quality problems:
    - Transposition of numbers (reversed key strokes or entering a date in an amount field and the amount in a date field)
    - Mis-spelling or inconsistent spelling of names (ON, Ont, ONT., ONTARIO, Onatrio, etc.)
    - Skipping of necessary fields (even if you make a field required people will enter nonsense in it to get past it)

    2. The main factors that cause these issues to arise:
    - Lack of training
    - Lack of time (a person in a hurry makes more mistakes)
    - Lack of data checks / cross checks / validation

    3. The main costs of poor data quality:
    - Lost business
    - Reshipping (if incorrect product is sent or incorrect address was used)
    - Incorrect billing (refunds)

    4. How to go about measuring data quality. The key indicators are:
    - Amounts that are outside an acceptable range
    - Dates that are outside an acceptable range
    - Quantities that are outside an acceptable range
    - Returned mail or Postal / Zip codes that do not match the area being shipped / billed to

    5. The best remedy for these problems:
    - Neither automated nor manual loading is the best remedy
    - Cleaning either load can help (but how do you define cleaning?)
    - Cross checking should always be done by both a manual and an automated process if at all possible

    6. How possible do you think it is to achieve fully automated processes to improve data quality?
    - At some point a user interface is required for any system, nothing is EVER fully automated. Even with a "fully automated process" you will want some sort of manual over-ride in case of problems in either the data or the equipment being used.

    7. Major impediments to being able to improve data quality using automated processes:
    - Expense (the more cross checking, the higher the cost to do it)
    - Time (the more cross checking, the longer it takes to do it)
    - Exceptions (no matter what the rule is, there is always the case of the "exceptions")

    8. Changes automated data loading and cleaning solutions may have on an organisation:
    - It could save time and money, as well as improving customer satisfaction, if done correctly

    9. Other side benefits of automated processing in improving data quality:
    - Less stressful for employees
    - Less stressful for clients / customers
    - Easier to generate and build reports

    10. Yes, I believe the impact of these is felt to be beneficial to an organisation's clients/stakeholders.

    11. Automated solutions are preferable, however manual interaction must be allowed and encouraged for a truly useful system.
    When it rains, it pours.

  6. #6
    Join Date
    Feb 2008
    Posts
    5
    Thanks very much for your input rockingred.

  7. #7
    Join Date
    Nov 2004
    Location
    on the wrong server
    Posts
    8,835
    Code:
    1) What are the main data quality problems in marketing databases?
    You can only do so much programatically to validate input, but if the users are going to enter junk, then that is what they should expect.

    Code:
    2) What are the main factors that cause these issues to arise?
    databases full if junk data happen most often when there is no business owner of the data.

    Code:
    3) What are the main costs of poor data quality?
    bad business decisions and actions.

    Code:
    4) How would you go about measuring data quality? What are the key indicators?
    Beyond validating input in the application code and ensuring referential and domain integrity with constraints, if the data is really that critical, some business person needs to validate the information and hold nthe people entering in junk accountable.

    Code:
    5) Do you believe that automated or manual loading and/or cleaning can best remedy these problems?
    You need both usually.

    Code:
    6) How possible do you think it is to achieve fully automated processes to improve data quality?
    It's impossible. I can set up a datatype of varchar on a name field and even prevent the entering of numbers and special characters but that does not stop people from entering Mickey Mouse.

    Code:
    7) Are there any major impediments to being able to improve data quality using automated processes?
    see # 6.

    Code:
    8) What change do you think automated data loading and cleaning solutions may have on an organisation?
    if you went fully automated, which is impossible, you would end up with a junky database.

    Code:
    9) Are there any other side benefits of automated processing in improving data quality?
    Code:
    10) Do you think the impact of these is felt to be beneficial to an organisation's clients/stakeholders?
    Code:
    11) Therefore, are manual or automated solutions preferable?
    you need both.
    “If one brings so much courage to this world the world has to kill them or break them, so of course it kills them. The world breaks every one and afterward many are strong at the broken places. But those that will not break it kills. It kills the very good and the very gentle and the very brave impartially. If you are none of these you can be sure it will kill you too but there will be no special hurry.” Earnest Hemingway, A Farewell To Arms.

  8. #8
    Join Date
    Feb 2008
    Posts
    5
    Thanks Thrasymachus.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •