Results 1 to 2 of 2
  1. #1
    Join Date
    Oct 2002
    Location
    Baghdad, Iraq
    Posts
    697

    Obfuscating Personal Information

    I'm in the army and I'm working on a database for company level... stuff. Admin information. This is partly a call for advice from anyone who has experience working with these issues, and partly me writing out my thoughts for schitzengiggles. (And it got a little long... mea culpa.)

    We store all kinds of personal information on soldiers and no one has really thought through the problem of how you secure the data and keep it readily available to the people who need it. Companies usually track everything in Excel, often by printing out a list and tediously keying in the changes. Multiple copies of information on shared drives are the norm, and most of the IT guys don't know what an ACL is. (You heard the joke about military intelligence? That goes double for S-6 Automations.)

    My app is based on a typical DBMS with a fairly typical schema. Some of the schema has information that needs to be secured, and there are three ways to go about this, as I see it:

    1. Completely lock down the system. This tends to encourage user behavior like sharing passwords and passwords-on-stickies that defeat the original intent.

    2. Two separate schemas. This isn't as complicated as it sounds... most databases will let you link to tables on the fly, effectively giving you the ability to switch schemas on the fly. And the difference would only be one or two tables.

    3. One schema, but use obfuscated data when the privileged data isn't available. This also relies on linking in tables on the fly. It has the advantage that most reports and such will still work.

    As you probably guessed by the thread title, I'm going with 3. The main concerns are for usability and that someone writing ad hoc queries or reports need not be concerned with whether the system is in privileged mode or not. There's also the matter of what kind of attacker I'm defending against. This is supposed to be a small system storing data on 100 or maybe 1000 people. My understanding is that people will grab databases of personal information and sell them on the black market, and that's the type of attack I'm looking at. I don't need to completely prevent the theft, just make sure that what the attacker can get is pretty much worthless. A stalker is far more likely to use humint.

    So next there's the question of what data needs to be obfuscated and how.

    For each type of data, I want to design a function y=f(x) such that:

    1. Many possible x's map to the same y.
    2. The value y is stable, that is, for a given x there is one and only one possible y. y can not depend on any side effects or data other than x.
    3. When possible, textual representations of y are obviously not valid values of x. That is, a competent observer will immediately be able to distinguish between a form that has correct data and one with obfuscated data.
    4. f(x) is easy to compute and the algorithm is more or less explainable in plain English.

    One thing f(x) explicitly doesn't have to do, since it directly contradicts requirement 1, is produce a value that is useful as a candidate key.

    Here are some major personal details I'm hiding:

    Social Security Number. The biggest issue was finding another primary key. Since everyone in the Army or working for the Army has an Army Knowledge Online email account, their AKO id is a good substitute. It's human readable and public knowledge. (There's a white pages feature on the AKO website.) Currently I'm obfuscating the SSN by replacing the first 5 digits with 0's or, when possible X's.

    Date of birth. This one is hard to do without since PT tests are graded partly on how old you are.

    This is the algorithm I'm considering: Convert the date to a Julian date, that is, to an integer value x. Take x - (x mod 10). Convert that back to a Gregorian date. Replace the last digit in the day with an X.

    The x - (x mod 10) essentially throws out the least significant digit; thus the 10th through the 19th might all be represented as the 10th. Obviously I could throw out more data, but it becomes an issue of how I represent

    There's other information, like driver's license number, that I wouldn't even bother with if our post didn't insist on it. We're expected to fill out a form that has, among other things, VIN and driver's license number. Some people misinterpret the field that says "verification of insurance" to mean that they have to put down their policy number. This is then handled by over a dozen different people, as well as copies being filed in multiple places because it's such a pain gathering all the data just to go out of town for a four day weekend. Given that the military police already store this information, I'm considering sending a request up to HQ to ask that the forms be altered to remove that info.

    We store "next of kin" information... it's important that this be partly accessible as the recent fires in California demonstrated when I had to compile a list of soldiers with family there. (In my troop, there weren't any, thankfully.) My function for addresses is to reduce them to city and state.

    I store personal emails for when soldiers get out, in case we have a late award to send them. Most people never check their AKO once they get out, and often times they move around a bit before settling down. I obfuscate that by replacing it with email@hidden. (I have a constraint on the email field that requires that emails be validated according to the RFC.)

    So far all this information is relatively simple to obfuscate. But this approach doesn't provide real security and real compartmentalization of data, which you need for any kind of really private information like disciplinary records. I'd like to add features like counseling (army counseling combines performance review, career development and the most basic level of disciplinary action) because it's highly desirable that junior NCOs be able to easily share their notes on subordinates with their bosses. Platoon sergeants generally like to try to solve problems at the lowest level, so they'd be reluctant to use a system for counseling that allowed the first sergeant or commander to look over their shoulders at their day to day notes. I've got ideas for a system that could handle that, but I'll save those for a later post.

  2. #2
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Quote Originally Posted by sco08y
    For each type of data, I want to design a function y=f(x) such that:

    1. Many possible x's map to the same y.
    2. The value y is stable, that is, for a given x there is one and only one possible y. y can not depend on any side effects or data other than x.
    3. When possible, textual representations of y are obviously not valid values of x. That is, a competent observer will immediately be able to distinguish between a form that has correct data and one with obfuscated data.
    4. f(x) is easy to compute and the algorithm is more or less explainable in plain English.
    I believe this function satisfies all four requirements:
    Code:
    y="Obfuscated"
    ...so I'm not sure why you are over-engineering this. Besides, doesn't the military have volumes of standards and regulations regarding such things?
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •