Dear Sirs et Madames,
I have recently been asked to combine an institutes varied and disparate health and demographic data sources into a centralized database which can be (eventually) queried via a query interface. Am trying to design the schema at the moment but would be glad of some advice before I move forward (My background is more as a Java programmer, I have never had to spend a significant amount of time on the database side of things). Scenario:
I have 3 applications which are the sources of data, HRS1, the original Data Collection application,HRS2 an upgrade of the original application (the data model changes somewhat) and the third is completely separate new application (OpenHDS) written in a different language (Java) and using MySQL as the database (the first 2 use VisualFoxPro), and the datamodel once again changes.
An additional complication is that the institute has several Demographic Surveillance sites. Over time, some sites have added columns to certain tables in order for them to capture more information , and as such most tables will have variations across site even when the datacollection application was the same. As an example let me just look at the Individual table from 2 separate sites:
HRS1 (Site A) Individual:
PERM_ID
REGION
OLD_REG
FAMILY_NUM
MEMBER_NUM
NAME
SEX
BIRTH_DATE
RELATIONSHIP_TO_HEAD
MOTHER_ID
FATHER_ID
HUSBAND_ID
ENTRY_TYPE
ENTRY_DATE
IS_PRESENT
EXIT_TYPE
EXIT_DATE
FIELD_WRKR
DATA_CLERK
DATAENTRY_DATE
STATUS_DAT
AGE
HRS1 (Site 2) Individual:
PERM_ID
REGION
FAMILY_NUM
MEMBER_NUM
NAME
SEX
BIRTH_DATE
RELATIONSHIP_TO_HEAD
MOTHER_ID
FATHER_ID
HUSBAND_ID
OCCUPATION
OCCUPATION_OTHER
EDUCATION
ENTRY_TYPE
ENTRY_DATE
EXIT_TYPE
EXIT_DATE
FIELD_WRKR
DATA_CLERK
DATAENTRY_DATE
STATUS_DAT
MARRIED
MARRIED_OTHER
HRS2 (Site 1) Individual:
INDIVIDID
NAME
GENDER
BIRTH_DATE
MOTHERID
FATHERID
FIELD_WRKR
DATA_CLERK
DATAENTRY_DATE
STATUS_DAT
OpenHDS (Site 1 & 2) Individual:
uuid
deleted
insertDate
status
voidDate
voidReason
dob
externalId
firstName
gender
lastName
middleName
pD
partialDate
collectedBy_uuid
insertedBy_uuid
voidedBy_uuid
father_uuid
mother_uuid
mother_uuid
Using this one Entity (Individual) as an example, how may one go about the steps of centralizing this data? The institute insists that no data should be left behind no matter how trivial it may seem. Am using PostgreSQL as the DBMS for the centralized database.
Please assist in any way, or let me know if there is any more information I can provide.
Thanks in advance