If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > General > Database Concepts & Design > Research into data quality in large databases

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 02-22-08, 04:24
dasiddall dasiddall is offline
Registered User
 
Join Date: Feb 2008
Posts: 5
Research into data quality in large databases

Hi,

I am currently conducting a research project for a Masters programme, looking at how data quality issues affect large (2 million+ rows) marketing databases and CRM systems and investigating how data quality can be improved by automated solutions.

As practitioners in the field, I was hoping some of you might be able to spare some of your time to discuss some of the issues, or to fill in a questionnaire. I can forward you full details of my research if you wish.

This work will produce actionable recommendations about how to effectively deploy automated processing solutions, particularly in data loading and data cleaning. The research in this area is currently very sparse and therefore you would be contributing a great deal to the body of knowledge in this area. If you can help, I will cite you as a source if you like, and your name might forever be in lights on the references pages of an ACM journal or similar!

Please could you contact me at mailto:0313223 AT chester DOT ac DOT uk, pm me or reply on here if you are interested in contributing?

Thanks a lot.

Dave

Last edited by dasiddall; 02-22-08 at 04:43.
Reply With Quote
  #2 (permalink)  
Old 02-22-08, 04:42
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
dasidall - unless you want loads of spam I would obsucate your email address.

Sounds interesting - please could I request that people respond to the thread, rather than email or pm. This way you gain from the exposure on the forum and the forum gains by keeping much of your source material here for others to reference. It would also be jolly nice if you could somehow link to your eventual article, or at least paste in the recommendations, upon completion.
Reply With Quote
  #3 (permalink)  
Old 02-22-08, 04:44
dasiddall dasiddall is offline
Registered User
 
Join Date: Feb 2008
Posts: 5
Quote:
Originally Posted by pootle flump
dasidall - unless you want loads of spam I would obsucate your email address.
Thanks.
Reply With Quote
  #4 (permalink)  
Old 02-22-08, 05:07
dasiddall dasiddall is offline
Registered User
 
Join Date: Feb 2008
Posts: 5
Quote:
Originally Posted by pootle flump
Sounds interesting - please could I request that people respond to the thread, rather than email or pm. This way you gain from the exposure on the forum and the forum gains by keeping much of your source material here for others to reference. It would also be jolly nice if you could somehow link to your eventual article, or at least paste in the recommendations, upon completion.
Thanks for looking kindly on the thread, especially if you are a mod!

Anyway, I will happily post further details here. After collating the ideas of people like you about these things, I will be able to produce a framework for an automated solution.

Background/Purpose Of The Study
Poor data quality costs enterprises money; by making business processes less efficient, by increasing the cost of maintaining contact with their customer base and through loss of customers due to poor customer service provision. The purpose of this research is to help demonstrate that automation remains worthwhile, bringing many perceptible benefits to an enterprise, particularly in terms of the quality of service it provides.

There has been much investigation into the benefits of automation in other areas of computing, such as automation of testing in software development. As data loading and cleaning involves similarly programmable recurring processes, and automation programmes are being initiated on this basis, I believe it is now worth fully investigating whether the benefits of automation can be replicated in this area. Although there is some research on the benefits of automation in other areas, the research into the benefits of automation in data loading and cleaning is meagre.

Thanks for taking the time to have a look over my research – if you do respond, you are of course free to withdraw at any time, and without giving a reason. The results will be written up and a copy will be made available in the university library. I will also, following pootle flump's request, link to the paper if and when it is published. If you would like a copy please don’t hesitate to contact me. I would like to make clear that all results would be made anonymous unless you would like me to cite you as a source, please say and I would be happy to do so...

The questions I am looking to discuss are as follows.

1) What are the main data quality problems in marketing databases?
2) What are the main factors that cause these issues to arise?
3) What are the main costs of poor data quality?
4) How would you go about measuring data quality? What are the key indicators?
5) Do you believe that automated or manual loading and/or cleaning can best remedy these problems?
6) How possible do you think it is to achieve fully automated processes to improve data quality?
7) Are there any major impediments to being able to improve data quality using automated processes?
8) What change do you think automated data loading and cleaning solutions may have on an organisation?
9) Are there any other side benefits of automated processing in improving data quality?
10) Do you think the impact of these is felt to be beneficial to an organisation's clients/stakeholders?
11) Therefore, are manual or automated solutions preferable?

Many thanks.
Reply With Quote
  #5 (permalink)  
Old 02-22-08, 10:49
rockingred rockingred is offline
Registered User
 
Join Date: Aug 2003
Location: Toronto, Ontario, Canada
Posts: 203
I haven't actually worked with "marketing" databases, though many of the database structures I have worked with have done some marketing as well. I'll try to answer at least some of your questions.

1. The main data quality problems:
- Transposition of numbers (reversed key strokes or entering a date in an amount field and the amount in a date field)
- Mis-spelling or inconsistent spelling of names (ON, Ont, ONT., ONTARIO, Onatrio, etc.)
- Skipping of necessary fields (even if you make a field required people will enter nonsense in it to get past it)

2. The main factors that cause these issues to arise:
- Lack of training
- Lack of time (a person in a hurry makes more mistakes)
- Lack of data checks / cross checks / validation

3. The main costs of poor data quality:
- Lost business
- Reshipping (if incorrect product is sent or incorrect address was used)
- Incorrect billing (refunds)

4. How to go about measuring data quality. The key indicators are:
- Amounts that are outside an acceptable range
- Dates that are outside an acceptable range
- Quantities that are outside an acceptable range
- Returned mail or Postal / Zip codes that do not match the area being shipped / billed to

5. The best remedy for these problems:
- Neither automated nor manual loading is the best remedy
- Cleaning either load can help (but how do you define cleaning?)
- Cross checking should always be done by both a manual and an automated process if at all possible

6. How possible do you think it is to achieve fully automated processes to improve data quality?
- At some point a user interface is required for any system, nothing is EVER fully automated. Even with a "fully automated process" you will want some sort of manual over-ride in case of problems in either the data or the equipment being used.

7. Major impediments to being able to improve data quality using automated processes:
- Expense (the more cross checking, the higher the cost to do it)
- Time (the more cross checking, the longer it takes to do it)
- Exceptions (no matter what the rule is, there is always the case of the "exceptions")

8. Changes automated data loading and cleaning solutions may have on an organisation:
- It could save time and money, as well as improving customer satisfaction, if done correctly

9. Other side benefits of automated processing in improving data quality:
- Less stressful for employees
- Less stressful for clients / customers
- Easier to generate and build reports

10. Yes, I believe the impact of these is felt to be beneficial to an organisation's clients/stakeholders.

11. Automated solutions are preferable, however manual interaction must be allowed and encouraged for a truly useful system.
__________________
When it rains, it pours.
Reply With Quote
  #6 (permalink)  
Old 02-22-08, 12:00
dasiddall dasiddall is offline
Registered User
 
Join Date: Feb 2008
Posts: 5
Thanks very much for your input rockingred.
Reply With Quote
  #7 (permalink)  
Old 02-22-08, 13:15
Thrasymachus Thrasymachus is offline
SQL Server Street Fighter
 
Join Date: Nov 2004
Location: Down The Rabbit Hole
Posts: 7,979
Code:
1) What are the main data quality problems in marketing databases?
You can only do so much programatically to validate input, but if the users are going to enter junk, then that is what they should expect.

Code:
2) What are the main factors that cause these issues to arise?
databases full if junk data happen most often when there is no business owner of the data.

Code:
3) What are the main costs of poor data quality?
bad business decisions and actions.

Code:
4) How would you go about measuring data quality? What are the key indicators?
Beyond validating input in the application code and ensuring referential and domain integrity with constraints, if the data is really that critical, some business person needs to validate the information and hold nthe people entering in junk accountable.

Code:
5) Do you believe that automated or manual loading and/or cleaning can best remedy these problems?
You need both usually.

Code:
6) How possible do you think it is to achieve fully automated processes to improve data quality?
It's impossible. I can set up a datatype of varchar on a name field and even prevent the entering of numbers and special characters but that does not stop people from entering Mickey Mouse.

Code:
7) Are there any major impediments to being able to improve data quality using automated processes?
see # 6.

Code:
8) What change do you think automated data loading and cleaning solutions may have on an organisation?
if you went fully automated, which is impossible, you would end up with a junky database.

Code:
9) Are there any other side benefits of automated processing in improving data quality?
Code:
10) Do you think the impact of these is felt to be beneficial to an organisation's clients/stakeholders?
Code:
11) Therefore, are manual or automated solutions preferable?
you need both.
__________________
software development is where smart people go to waste their lives
Reply With Quote
  #8 (permalink)  
Old 02-22-08, 14:17
dasiddall dasiddall is offline
Registered User
 
Join Date: Feb 2008
Posts: 5
Thanks Thrasymachus.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On