Subject
I have been asked to look at a rather peculiar (web page) tracking system which has two parts. One tracks people by IP address as they view pages (Visitors), and another which tracks people by IP address as they view a totally different set of pages (Clients).
My job is to match the two together - by IP
Given that my external code often extracts more than one IP address from an individual hit, I have constructed an Addresses table which stores uses a CRC32() of the IP Address (a comma separated list in fact) which also permits faster insertion:
Incoming data is :
* $ip_list -- a comma separated list of IPv4 addresses
* $user_id -- a simple integer value
$ip_crc = crc32( $ip_list );
INSERT INTO psAddress ( ipID, ipString, ipHits, ipLast )
VALUES ( $ip_crc, $ip_list, 1, UNIX_TIMESTAMP() )
ON DUPLICATE KEY UPDATE
ipHits = ipHits + 1, ipLast = VALUES(ipLast);
$id = generate_new_unique_log_id( );
INSERT INTO psVisitors ( logID, logUser, logIP, logTime )
VALUES ( $id, $user_id, $ip_crc, UNIX_TIMESTAMP() );
An identical system is in place for tracking clients hits to the secondary table psClients.
The quantity of data coming into the two main tables is reasonable:
psVisitors : ~2,000 .. ~10,000 rows per hour
psClients : ~200 .. ~800 rows per hour
Now to the actual meat of the matter...
Clients log in and need to know which clients "visited" the remote site.
So assuming we have a $clientUser
#1 Retrieve a list of ip addresses from hits relevant to that client:
SELECT GROUP_CONCAT(DISTINCT ipString) AS ip_list
FROM psVisitors
LEFT JOIN psAddresses ON ipID = logIP
WHERE (logUser = $clientUser) AND (logTime {in range})
#2 Match those IP's ($ip_list) against the clients:
SELECT DISTINCT logUser
FROM psClients
LEFT JOIN psAddresses ON ipID = logIP
WHERE MATCH_IP( $ip_list, ipString) >= 28 AND (logTime {in range})
The MATCH_IP function was written in C and installed as a UDF. (Source available if required)
It takes two comma separated lists and returns the highest number of matching bits (counting from left to right) in any IP between the two lists.
My trouble is that this is *way* too slow.
Using direct IP matching doesn't work well, as we know that certain ISPs provide rotating IPs allocated from a large pool, so every page hit results in a different IP.
There's also the issue of people using proxy servers to access the web. In some cases, I correctly pick up their IP address, but it is often detected as part of a list of IP addresses, again any component of which can (and does) change.
I'm beginning to think that structurally, I may have made some mistakes, but am at a loss as to how to improve this system.
Any suggestions?
(Table Definitions)
CREATE TABLE psAddresses
(
ipID INT(10) UNSIGNED NOT NULL,
ipString VARCHAR(250) NOT NULL,
ipHits INT(10) UNSIGNED NOT NULL,
ipLast INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (ipID),
UNIQUE KEY ipString (ipString)
)
ENGINE=MyISAM;;
CREATE TABLE psVisitors
(
logID INT(10) UNSIGNED NOT NULL,
logUser INT(10) UNSIGNED NOT NULL,
logIP INT(10) UNSIGNED NOT NULL,
logTime INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (logID)
)
ENGINE=MyISAM;
CREATE TABLE psClients
(
logID INT(10) UNSIGNED NOT NULL,
logUser INT(10) UNSIGNED NOT NULL,
logIP INT(10) UNSIGNED NOT NULL,
logTime INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (logID)
)
ENGINE=MyISAM;