Results 1 to 3 of 3
  1. #1
    Join Date
    Nov 2008
    Posts
    2

    Question Unanswered: A little restructuring needed?

    Subject

    I have been asked to look at a rather peculiar (web page) tracking system which has two parts. One tracks people by IP address as they view pages (Visitors), and another which tracks people by IP address as they view a totally different set of pages (Clients).

    My job is to match the two together - by IP

    Given that my external code often extracts more than one IP address from an individual hit, I have constructed an Addresses table which stores uses a CRC32() of the IP Address (a comma separated list in fact) which also permits faster insertion:

    Incoming data is :
    * $ip_list -- a comma separated list of IPv4 addresses
    * $user_id -- a simple integer value

    $ip_crc = crc32( $ip_list );

    INSERT INTO psAddress ( ipID, ipString, ipHits, ipLast )
    VALUES ( $ip_crc, $ip_list, 1, UNIX_TIMESTAMP() )
    ON DUPLICATE KEY UPDATE
    ipHits = ipHits + 1, ipLast = VALUES(ipLast);

    $id = generate_new_unique_log_id( );

    INSERT INTO psVisitors ( logID, logUser, logIP, logTime )
    VALUES ( $id, $user_id, $ip_crc, UNIX_TIMESTAMP() );

    An identical system is in place for tracking clients hits to the secondary table psClients.

    The quantity of data coming into the two main tables is reasonable:

    psVisitors : ~2,000 .. ~10,000 rows per hour
    psClients : ~200 .. ~800 rows per hour

    Now to the actual meat of the matter...

    Clients log in and need to know which clients "visited" the remote site.

    So assuming we have a $clientUser

    #1 Retrieve a list of ip addresses from hits relevant to that client:

    SELECT GROUP_CONCAT(DISTINCT ipString) AS ip_list
    FROM psVisitors
    LEFT JOIN psAddresses ON ipID = logIP
    WHERE (logUser = $clientUser) AND (logTime {in range})

    #2 Match those IP's ($ip_list) against the clients:

    SELECT DISTINCT logUser
    FROM psClients
    LEFT JOIN psAddresses ON ipID = logIP
    WHERE MATCH_IP( $ip_list, ipString) >= 28 AND (logTime {in range})

    The MATCH_IP function was written in C and installed as a UDF. (Source available if required)

    It takes two comma separated lists and returns the highest number of matching bits (counting from left to right) in any IP between the two lists.

    My trouble is that this is *way* too slow.

    Using direct IP matching doesn't work well, as we know that certain ISPs provide rotating IPs allocated from a large pool, so every page hit results in a different IP.

    There's also the issue of people using proxy servers to access the web. In some cases, I correctly pick up their IP address, but it is often detected as part of a list of IP addresses, again any component of which can (and does) change.

    I'm beginning to think that structurally, I may have made some mistakes, but am at a loss as to how to improve this system.

    Any suggestions?


    (Table Definitions)

    CREATE TABLE psAddresses
    (
    ipID INT(10) UNSIGNED NOT NULL,
    ipString VARCHAR(250) NOT NULL,
    ipHits INT(10) UNSIGNED NOT NULL,
    ipLast INT(10) UNSIGNED NOT NULL,

    PRIMARY KEY (ipID),
    UNIQUE KEY ipString (ipString)
    )
    ENGINE=MyISAM;;

    CREATE TABLE psVisitors
    (
    logID INT(10) UNSIGNED NOT NULL,
    logUser INT(10) UNSIGNED NOT NULL,
    logIP INT(10) UNSIGNED NOT NULL,
    logTime INT(10) UNSIGNED NOT NULL,

    PRIMARY KEY (logID)
    )
    ENGINE=MyISAM;

    CREATE TABLE psClients
    (
    logID INT(10) UNSIGNED NOT NULL,
    logUser INT(10) UNSIGNED NOT NULL,
    logIP INT(10) UNSIGNED NOT NULL,
    logTime INT(10) UNSIGNED NOT NULL,

    PRIMARY KEY (logID)
    )
    ENGINE=MyISAM;

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    start by redesigning your tables so that you never store a comma-separated list of values inside a single column

    do a search for first normal form

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    Nov 2008
    Posts
    2
    pfft - of course.

    My apologies for missing the bleeding obvious.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •