I am trying to add structure to the equipment data kept by the Defense Logistics Agency. Currently, they use a index called the Line Item Number, which is (hopefully) unique to each item of equipment DoD owns. The piece of equipment is described in one field in a marvelously undisciplined manner. (How many ways are there to abbreviate Trailer???)
I intend to add about five categories, such as Category, Class, Type, Model, and Version. The last two are specific to the item on that line. The first three are more for for the purpose of future analysis - slicing and dicing the data - as well as for clarity and consistency.
My question involves the number of distinct Categories, Classes, and Types I should shoot for. Are there any rules of thumb for categorization (or decomposing if that's a more proper term)?
For example, if there are 10,000 separate line item numbers, and I am using five "organizing" fields, I should shoot for the fifth root of 10,000. (10 categories, 100 Classes...) Or, since I am reserving the last two fields strictly for description, rather than for organizing the data, should it be the third root or 10,000 (~20 categories).
First, is categorization the right word for this attempt to add structure to unstructured data? Second, are there any rules of thumb for this sort of thing?
Is this going to be a Bill of Materials Process (BOMP)?
If the item was an airplane, will you track the "tail number" or all of the subsystems that make up the weapons system individually? I guess i'm asking to which level of detail do you want to track? Also at some level many of the same subassemblies are no longer limited to one particular component - they become usable on many.
I think my requirement is to apply a consistent categorization scheme to unstructured data, in order to allow effective analysis. For example, Line item number T61494 is described: TRUCK UTILITY: CARGO/TROOP CARRIER 1-1/4 TON 4X4 W/E (HMMWV).
Given a limit of ten Categories, I might categorize this as a Vehicle, class it as a Truck, Type it as a HMMWV, give it a model name M998 (and I don't have or seemingly need a version). Given more Categories, I might set the Category to Truck.
I am not trying to restrict myself too tightly, with respect to the number of Categories, for instance in a drop-down list, but obviiously 10,00 Categories is useless. That's why I was looking for (loose) rules of thumb.
At the moment, not a BOMP, but it is very possible someone else might run that way with this concept. Right now I'm using it to aggregate results from a combat simulation. Queries might seek the fuel used by all the vehicles, or all the trucks, or just the HMMWVs.
My simulation doesn't get down to individual vehicles, but a sister organization does. The would want to expand the Table of Organization and Equipment (TOE) data I am using, but that is a separate project. Given that someone goes to the effort to detail individual bumper numbers, the BOMP or other inventory control operations become possible.
It does seem like you are describing a Bill Of Materials type of scenario, in which case I would steer you away from a fixed set of arbitrary categories, and instead recommend a model that does not limit the number of nesting levels. The adjacency model is the most popular, easiest to understand, and best for datasets that are modified frequently. If your data is going to be very static and very large, then you might consider a nested set model.
In short, from what you describe, I don't think you are going to be happy down the road with whatever arbitrary category counts you come up with. Its not going to model your requirements, and so eventually you'll run into requirements you can't support.
If it's not practically useful, then it's practically useless.
Time for some research on the models you mentioned. The data set is quite static, but I don't think that 10,000 items qualifies as "very large". I'll check the adjaceny model first, as flexibility seems more important at the moment.