Hello,
I am trying to get all duplicates records from a Tab Delimited TXT file using ODBC with the Microsoft Text Driver and schema.ini.
The problem is getting the extra fields with the duplicated fields.
To obtain the duplicates records I did the following query:
SELECT CITYSTD, ZIPCODESTD, DIRECCION1STD, DIRECCION2STD, COUNT(*) AS CT FROM SSNUNIQUES.TXT WHERE VALIDSCORE >= " + limit.ToString
+ " GROUP BY CITYSTD, ZIPCODESTD, DIRECCION1STD, DIRECCION2STD
HAVING Count(*) > 1
The VALIDSCORE column values range from -1 to 100 integers, I used 0 to get all records, since the records with VALIDSCORE = -1 are NULL in the grouping columns.
This query results in aproximatelly 50,000 records out of 400,000.
To clarify, the CITYSTD, ZIPCODESTD, DIRECCION1STD, DIRECCION2STD columns may have a DIRECCION1STD, DIRECCION2STD with NULL values.
Now I am executing the following query in order to obtain each full duplicate record from the same TXT file:
SELECT T1.RECID, T1.ETC, T1.SS, T2.CT AS DUPCT, T1.NOMBRE,
T1.`PRIMER APELLIDO`, T1.`SEGUNDO APELLIDO`, T1.`DIRECCION 1`,
T1.`DIRECCION 2`, T1.`CIUDAD/PUEBLO`, T1.ZIPCODE,
T1.`FECHA DE SUBSCRIPCION`, T1.REFID, T1.VALIDSCORE,
T1.DIRECCION1STD, T1.DIRECCION2STD, T1.CITYSTD, T1.ZIPCODESTD
FROM SSNUNIQUES.TXT AS T1,
(SELECT CITYSTD, ZIPCODESTD, DIRECCION1STD, DIRECCION2STD,
COUNT(*) AS CT FROM SSNUNIQUES.TXT WHERE VALIDSCORE >= " + limit.ToString + " GROUP BY CITYSTD, ZIPCODESTD, DIRECCION1STD, DIRECCION2STD HAVING Count(*) > 1) AS T2
WHERE T2.DIRECCION1STD = T1.DIRECCION1STD
AND T2.DIRECCION2STD = T1.DIRECCION2STD
AND T2.CITYSTD = T1.CITYSTD
AND T2.ZIPCODESTD = T1.ZIPCODESTD
ORDER BY T1.CITYSTD, T1.ZIPCODESTD, T1.DIRECCION1STD,
T1.DIRECCION2STD, T1.ETC, T1.RECID
This results in aproximatelly 8,000 records out of 400,000 by 50,000.
The COUNT(*) value in the first query in average is 2 so I would expect to get 100,000 records, but instead I get only a subset of the records where the VALIDSCORE equals 0.
I tryed removing the ORDER BY clause so to reduce Microsof Text Driver memory problems, but still get the same record set.
So I have two questions:
1. Is the query well written to obtain what I intend to get, what should it be?
2. Is there any system setting that I could make to the
VB.NET connection, command codes or to the computer registry?
Cordially,
Jose