Thread: Data structure and Design help
04-20-10, 07:50 #1Registered User
- Join Date
- Apr 2010
Unanswered: Data structure and Design help
going to keep this simple and not go into deep details and just keep it to the data i wish to store and how i desire to access it.
for the specific dataset i am storing a very long number string
the string is something like this
now each piece of the key is important to the project.
meaning each segment of it holds specific importance. there can only be 3 different integers, 0 1 and 2
now the data of the key is simply an observation
so there can only be 3 different observations
" 0 " , " 1 " , " -1 "
Now to access the keys quickly i simply just MD5 them so their access times are very fast.
But here is the problem.
The scope or this is i need to track each segment of the key
so basically for example purposes
now lets say i want to check the score of that key on segment "3" (meaning the third integer of the key listed) where the Segment = "2" and return the number of observations and what happened " 0 " , " 1 " , or " -1 " across the entire system.
what this specifically means is i want to search every single key on the third segment that has a value of "2" and return all observations.
Now here is the big problem.
My keys are 128 chars long
and i have millions of keys in the database at any one time across many different spans of time.
using MD5 or other hashing algo's presents the problem because it destroys the specific segments of the key. also for example if lets say each key is 100 chars long. each segment of the key can have 3 different possibilities. and there is millions of keys generated. that equals a whole bunch of key possibilities. and tracking each segment of each key and storing that in an effective manner is what needs to be accomplished. or still access or process the data in a very fast efficient manner.
thanks for whoever helps
storing the keys and their observations is not a problem especially when everything is MD5 hashed. but where i run into the problem is finding an effective way to call the data using mysql VIA a querry. or storing the data in an efficient manner where i do not have excessive redundant data across the database.
04-20-10, 08:20 #2vaguely human
- Join Date
- Jun 2007
04-20-10, 11:18 #3Registered User
Provided Answers: 5
- Join Date
- Dec 2007
- Richmond, VA
As Mike, says additional info would help. Just a thought might be to put the different meaningful pieces of this key(s) into separate columns. Just like a phone number, a lot of folks just make it a single column rather than the multiple columns that it should be stored in.
04-20-10, 20:29 #4Registered User
- Join Date
- Apr 2010
Ok what i am creating is an object tracking model. but i am just at the proof of concept stages, the hope is to take a number of objects that move through time and space at different rates and determine what the end result is going to be. meaning how did the object move through time and space did it travel up. down. or did it have a collision with another object. and what was the cause to this. Now each object is considered and individual in a swarm. the swarm has the ability to affect a single individual very easily. but its hard for one individual to affect a whole swarm. Now lets track the movement on a swarm level. How will Swarm A affect swarm B. Now if we think about this logically an individual from either swarm will have a very hard time affecting either Swarm A or Swarm B. through loss or gain death or life. But what is noticed is when Swarm A does the same exact thing Swarm B will either have a specific reaction. Be it stand there and undergo a collision. or move accordingly to both the individuals of either Swarm (home swarm or foreign Swarm). But here is where things get tricky Each Swarm A and B are affected by the individuals inside them. the other swarm. but finally what is affecting the overall conduct movement or action this is all controlled by the environment each swarm is in. Say the Area or world has a drastic change this is going to affect each and every swarm on a drastic level by way of affecting the swarm all the way down to the individual. if enough individuals are affected due to environment swarms will become affected etc etc. But there is not just one environment there is a number of them which are all contained and create the existence of a world. So if one environment changes it will not affect the entire world drastically but it could have an affect on other environments changing the world. The point is to essentially track and see how if worlds are changed through environments how will the individuals change. and will individual change dictate the coarse of the world. or can one individuals change in pattern behavior due to changes which the individual cannot control will the individual make a change that will affect the entire world. These are just a few questions I hope to find the answers too in my models and they are more expansive then what I am going to go into here.
These are all questions that i hope to model out. and i am only at one small part of this project. the question i am asking today is simply how do i store one spec of data which the scope of this project is going to require. I have already created 3 other smaller test models that take days to compute outcomes. and my goal here on this rewrite is to simply take out bottlenecks that i was facing using my old storage concept which did not allow me to retrieve and store data on the mass scale which MYSQL or any database engine possibly can.
so its essentially this i desire to predict the actions and movements of individuals in a specific swarm but each swarm is affected by one another and the environment which holds the swarm and the world the environments resides in. All of this helps determine the individuals success or peril.
Now this is a very complex task but how i am going about it is a little different and i am not going to get into that. but here is some starting data. These numbers are essentially coordinates of one individuals movement. this is essentially the raw data of the already pre defined tracking system i have created as the first part of this project. Storing the raw data is very easy and the queries are super fast and the organization is perfect (not really but its fast enough). each individual is essentially its own table with time used as the key. the table is named after its individual. each individual also is given a key. that key is stored in a separated table. In that table the individuals Key is stored. along with its swarm environment and world. this allows me to do very quick queries and access rapid amounts of data really fast. for storing the raw data is the best way to do things that i have came up with and tested out. because i can rapidly make a list of all members in a swarm, of an environment, and of the world. and get all that data as fast as my computer can dull it out which is amazingly fast. so i am satisfied with the storage and access capabilities of all the raw data coordinate
I could not have asked for a better way to get data to make the keys which a small form of it is essentially this.
Table ID = Individual#_Swarm#_Environment#_World#
Line A (time1) 583.44 586.82 572.25 582.98 560.75 582.98
Line B (time2) 585.98 585.98 575.29 580.41 558.22 580.41
ok each of those numbers equates out to an observation
so this is how the key is created
each number in those 2 lines is compared to one another.
the first dataset in line A is compared to the second dataset of line A and also to each dataset in line B. Line A and B are just a snapshot in time. A is what happened when things started and Line B is where they are at.
depending on the value IE if there is an increase in a number upon comparison the number 1 in the key is identified. if there is a decrease the number 2 in the key is assigned. if the numbers are one in the same then the number 0 will be given in the key.
12222112222 (from above data example)
would be just the first comparison of the key. the First line data just the first number compared to every other number that is present going in order from left to right line A to B this operation is repeated across all numbers but for simplicity purposes i am not going to create an entire key by eye alone (that is why i am programming it) LOL but you can imagine 012 repeated over 100+ chars.
now if i did my math right with the above 2 listed lines of data i am prity sure the key would be 132 chars long from all the comparisons. How i got to this is there is 2 lines of data each line has 6 observations there is 11 different possibilities of comparison and each comparison gives you one char of the key. solve for X. LOL i had to say it like that. but i came up with 132 chars for 1 key if this was the data that i was using in this example.
Now after that happens each key is tagged with an end result. Where did the object go. up down or did it not move. so if the object went up it gets a " 1 " score if the object went down it gets a " -1 " score if the object ends in the same place as it started it gets a " 0 ".
Now each key is very important. because that tracks exactly how a specific individual moved through time and space. Its easy to hash the key. and then store the hash. and simply keep a tally of what happened. But that is not "tracking enough" that destroys each segment of the key which is actually important. Now each segment can only have 3 possibilities 0 1 and 2. But each segment of a specific key holds a great deal of information and that is why i desire to store it. So that is why i might have to store the key in a raw form. for the entire key gets a score of -1 1 or 0. but each segment of the key also needs to be readily accessible so i can essentially ask this question to the database or something like this
04-20-10, 20:29 #5Registered User
- Join Date
- Apr 2010
Where segment_key_number = 1 and value = 2 return what_happened from (world)
Now as a benchmark what i did is i stored a database table of all keys for rapid access and lookup. then i stored a database of the hashed version of that key as the table name and in that table i stored the entire key segmented out along with what happened as in -1 0 or 1.
Now this quickly created too much information and it bogged down the server due to excessive tables created within a database with excessive columns in each table. but each table did not have a great deal of rows.
I have had to learn how to program to complete this project. I have been at it for about 4 years. i have completed over 100 different projects but always working on this one in my free time. I have read books on SQL and things like that. So i understand how the data should be laid out but for some reason i cannot come up with a good way to organize this portion of data. It has to be perfect because there is a great deal of Data. Each key is well over 100 chars long. but each char in the key needs to be tracked. but it needs to be organized down to the individual which is apart of a swarm which is contained within a environment which makes up the world. but there is also other swarms. I have written this thing to be expandable to many different sizes to handle many different swarms each containing a few different individuals and both many. but as a benchmark this is about what it works out to be and has always been a benchmark
I am looking at creating
A world > with 10-20 different environments > 100-300 different swarms > each swarm containing 10-100 individuals
so if we take all that into account the overall amount of data generated is daunting at best.
because lets say
each environment has
each swarm has
10 individuals per swarm
each individual has
100 snapshots in time
each snapshot in time has
a key has
each segment can be only 3 different numbers
each segment has 3 different scores that need to be tallied
so using the above data as an example for database(s) size is about like this
-10,000 individuals within all swarms
-1,000,000 total snapshots in time.
-each snapshot in time creates one key
-each key has over 100 segments
-100,000,000 different data points are created
-Each data point has 3 different scores which are tracked. -1 1 and 0 which is tallied
-Each data point needs to be organized and and tallied by its corresponding World / Environment / Swarm / Individual.
-Each data point can and will only have 3 values 0 1 or 2
So what happens is each key exists on 4 levels and affects each level in a different way so that is why it needs to be tracked. the 4 levels are World, Environment, Swarm, Individual.
So with all that information I presented to you I ask my question. What is the best way to store this god awful amount of data so I can access it in an effective manner.
04-21-10, 06:34 #6vaguely human
Originally Posted by concept08I'm still trying to get my head round what you're doing so excuse the questions:
- Join Date
- Jun 2007
- Don't swarms have 1000's (often 100k's) of individuals and not just 10 as you state above?
- Why only 3 observations per individual (0,1,-1) - surely insects aren't that simple?
- Storing the swarm as a group key looks doomed to failure. At the least you're forcing you're swarm to only have so many individuals. Also accessing any individual within that key is likely to result in a table scan as the index can't be used.
- But is storing the historic data on all individuals really necessary? Why would you want to know where individual x is at time y?
- Could you not just store the current position and state of individuals and then keep a history of the aggregate position and state of the swarm.
- Do you really need 1m historic entries for any individual? If a bee lives for 3 weeks then you're storing it's position every 10 seconds!
- I thought the behaviour of swarms was reasonably well understood - isn't it just each individual reacting to it's immediate neighbours by position and pheromones. The queen having the largest influence of all. Perhaps you could just store the history on the queen alone and then you could create a swarm around her at any given point in time.
- Do swarms interact or is it just the individuals keeping away from foreign pheromones?
If you stored only a very limited amount of history for individuals but kept a complete history of the queen then your data becomes more manageable. The insect x,y,z positions could be kept relative to the queen which would allow you to move the swarm about by just moving the queen. You could have many more individuals per swarm. It might be faster to store a list of queen id's in a separate table rather than searching the above table to find all the queens.
Individual id type (ie queen) time x,y,z state parent_id (ie the queen's id) observation
I remember reader some guys phd thesis on the idea of viewing an insect colony as a single entity but remember who wrote it. It was very famous at the time (I guess I read it 20 years ago) - would you know who wrote it? There's also that book "Godel, Escher Bach" that contains a fictional conversation between an ant eater and the ant colony. I'm sure you're well acquainted with both but I thought I'd mention them just in case
04-22-10, 03:59 #7Registered User
- Join Date
- Apr 2010
Answer to your questions
before I start I want to take a second and thank you for your reply. It means a great deal to me and you got my mind going in a different direction.
What I am going to do is go through and answer questions from your response. And again thank you for your time and efforts its greatly appreciated.
Ok first off I am not tracking insects. Or patterns of a swarm. And yes we know a great deal of swarming tendencies of insects and birds. The rule of the swarm is simple, stay X distance from the closest members at all times. Pheromones also can be released into the air to induce specific reactions upon the swarm which are caused by some form of outside stimulus. IE an attack of a predator or an obstruction in path etc. Also another big rule of the swarm is to mimic your closest partners or like u said a queen or alpha individual(s). This prevents collisions from happening creates order out of anarchy. This also helps induce protective patterns and also offensive ones.
Yes swarms usually have thousands of individuals and this would create a problem due to the mass amounts of individuals which comprise the swarm. Since my model does not incorporate birds insects or fish or mammals or anything else there is no needs to get into the thousands of members which any of these swarms would in the real world include.
There is only 3 observations in my model because I only want to know the answer to 3 basic questions. Did it. Did it not. Or did it stay exactly the same. Now what makes this project so hard is simply asking those questions and application of it to essentially everything that can be tracked through any numerical representation which consists of multiple levels of processing.
Due to the scope of the project storing a swarm key and even further an environment key there is much to be learned from this. For example like I said before it has proven through past testing creating keys of the swarms actions will help dictate the movement or actions of the individuals in the swarm. Also tracking the actions of other swarms can help determine the action of others. For example when the numbers of individuals of one swarm occurs one thing that might be assumed is that there might be a shortage of food. Making the survival harder for one swarm over another. There is too much information that can be used to predict things that might or might not happen. The concern that there might be usage issues is very valid. And to date I have not encountered these problems but they are something I am aware of that might occur.
I store the historic data only for selected periods of time. After the time of interest has expired all old data is expelled from the database. But it is important with the scope of this project to store a portion of past data on all individuals, swarms, environment's who are all apart of the world. This helps for pattern recognition and provides a great amount of data that can be used as a self weighting system to help derive the end result of a specific individual of a swarm. For I am more concerned with the individual of the swarm and not the swarm itsself. I use the data that swarms provide because there is many different individuals who survive and operate in a plethora of different environments. There is a great deal of similarities and a large percentage of them work across the world. Now the reason I want to know where things were and how they got there is simple. The past always dictates the future. To know where we are headed we must look to the past. (there is my 2 quotes for you) hehehe. I have done tests and I have found and calculated a degradation rate which I have found to be very accurate for dealing with the decay and how past events affect everything as a whole, right down to the individual. Now I can prove this by stating a past instance that changed humanity as those who were alive knew it. And that was 9/11. Now this was a catastrophic event that changed life as we knew it, and has become life as we know it today. But as a natural effect of society as each day passes the situation changes. Each day a little more is done. Something is cleaned up or built and people band together to help one another out. But as time goes on these feelings fade and we adopt a new form of normalcy. From this one event if it were disregarded information we would be unable to predict that its going to take us at minimum one more hour at the airport to get through security. If we disregarded this information we would be left helpless at the airport terminal missing our flight. So yes I do desire to know where an individual was what they did and how they got to where they are at.
The time frames I am looking at is over micro and macro time frames. I do compile these into an aggregated data structure. But the concern here is before we get to that point.
Swarms do not interact together but they have influence on one another. If there is an interaction which is an anomaly and does occur over long expanses of time this event is taken into account. When this happens usually what occurs is one member of a swarm goes to another. And even rarer still in my modeling has one swarm enveloped a competing swarm. What is the most common observation is that when the world or environment is doing good or bad the swarm classically follows but not always.
Now to give you a little more insight on why I desire to track a large amount of data like this is for the simple fact I am creating a different form of AI that has not been done yet.
The end result which is desired is essentially this. The creation of an Artificial intelligence system which can make determinations of simple questions in a yes or no manner. A good example question would be if a ball is thrown at me should I catch the ball. But just asking yourself something as simple as that is nothing easy. There is a great deal of sub questions that go into this that need to be answered. If we are following a human model. The first question that is asked am I in danger? if no then where is the balls location to me data received can I catch the ball due to its location alone.? if yes then is the energy expenditure worth catching the ball?. These are all valid questions that fly through any cognitive thinkers mind no matter if they are aware of it or not. There is also other thoughts that happen that I myself will be unaware of and will never be able to identify from my perspective. That is why its so important that everything is tracked. Due to the algorithms that I have developed there is a way to essentially sniff out the needed data and aggregate the rest. But I let the equations make those decisions for themselves.
The reasons I have taken this approach goes into my thesis. Artificial intelligence is flawed because its designed by humans. Humans need to give the ability for cognitive thinking and give the artificial intelligence system to make its own decisions with no pre-disposed bias.
Before I set out on this task which will provide no more of a reward of a job well done or a very good attempt. I interviewed many leading individuals on artificial intelligence and read papers from many dead people now who developed equations on paper before computers were readily available. These equations are complete with self weighting systems. But when you get down to it they work a great deal like quantum mechanics. Just because you say something is going to happen. Or something should behave in a specific way. What is noticed is that the actual act of observation essentially changes the outcome of any experiment conducted on the quantum level. The same thing happens in MMC (markov monty carlo). The probabilities of specific events are calculated based on what is known and what has happened. But what it fails to conclude is free choice. Things are predicted by what has happened in the past alone. And all the possible outcomes that might happen or will happen. But no matter how many instances you include the equation begins to fall apart and destroy the decision making process due to a multitude of choices. I have attempted methods such as this but ultimately I derive an erroneous number or final probabilities that are never over 9% likely to happen and the decision is usually incorrect.
But what happened and I quickly learned is that the approach of using other peoples work was doomed because the methodology was for their solution and could never be applied accuratley to my solution in any way because apples and TNT dont go together unless you are making apple sauce. And we are not making apple sauce here.
So here is what I think I am going to do.
I am going to make 2 different tables for this part of the project alone
One for the whole keys who they belong to, such as swarm environment world and individual. There I will record the observations.
The second table which is the one this whole question was based on will be laid out like this
Segment_number (as primary)
Segment_value (as key)
does this sound about right to you. Or should I hit the drawing board again?
04-22-10, 13:19 #8vaguely human
Originally Posted by concept08It's difficult saying whether the design is correct simply because all we've seen is your solution but we still have no clue what you're modelling. I'm definitely within my comfort zone talking about insect behaviour or AI but I'm guessing you're trying to model stock prices for groups of shares in a similar sectors.
- Join Date
- Jun 2007
Your idea sounds a little like the neural networks used in AI programs. These seem to work well when working within a very limited world, like board games, but they haven't had much success in larger problem areas.
Major issues I can see with your design:
- Performance: Take the one hour delay at the airport - you mentioned that your system might know about 9/11 and that by following the history of 6 billion people over 9 years it could predict the one hour delay but I suspect by the time it's processed all that data then the one hour delay will look like chicken feed.
- The end result: Would people be willing to build a system that holds billions of records yet only produces a result of 1, 0 or -1 as an answer? It reminds me of the computer in the hitch hikers guide
- And will it work: my gut feeling tells me no but it would be easy to test by entering the first 9 years of data and then seeing if it can predict the 10th years results. If it does work then who knows you might be the next Black-Sholes model.
I don't think I can offer any more without a real example of what you're solving.
04-23-10, 08:21 #9Registered User
- Join Date
- Apr 2010
No i am not using stock markets or other financial things. good guess though i will give you that.
And yes this does resemble neural networks. and that was the premiss for this project to expand where current ones fail. but this is going to be for specific uses.
The real answer is this. its not modelling anything specifically but its more of an everything approach,
and why i kept this as simple as possible in the beginning is the question was not the plausibility of this project or even what this project was about.
the subject matter was what is the best way to store this one and yes believe it or not VERY SMALL PART of the data system this clusterf*ck of a program will use.
so to rehash the question because mike you are a very very smart person. and i bet i am correct in my guess that you are hyper intelligent. and i do value your input because you do have the answer to my question
Now what i have is basically 4 levels of observations World -> Environment -> Swarm -> Individual
Each level has its own set of keys
The key in whole form is stored
its stored with an observation 3 different ones
A tally is kept and tracked of the observations track records
Then the key is segmented out.
Each segment needs to be stored not as a key but identified with its segment placement
Observations need to be stored for individual segments.
A database structure needs to be created that can be rapidly accessed. and which is expandable.
Now this is how i was thinking about doing it. I cannot come up with anything better at the moment over my many days of thinking and reading on this problems of organization of data.
**this is for the whole key database
Segmented_key (there will be over 100 of these generated from each key)
**this is for the segmented key database
Now the problem with this in having 2 different tables but mainly there is redundant data.
Now what i was thinking is we add an identification key to the first table
remove everything but the segmented_key and the observation_count fields and simply add Identification key.
But the problem with adding that identification key the data will be corrupted on the higher levels. Because lets say segment_0 with a value of 2 appears 100k times. by simple logic that segment would be tagged with the last identification key and not all with the contributors from each and every individual. A side affect from this would be a different form of aggregating data. You would lost the intensive tracking on the segmented level but all the data would be able to be constructed from the Table_1 data cluster.
So if you take away all the stuff. you know you are storing nothing but numbers. scores and other small keys. the largest string is the entire key which is stored. but when the data is broken down. it becomes a great deal harder to manage and that is the problem i am facing.
You are familiar with the data. you know the types. and you know the objectives. Lots of it. lots of tracking. and things need to remain fast in some manner. or as fast as they can by avoiding common data structure design pitfalls.
so please try to understand this is storage of generated data and that is essentially it. types are defined. and they all need to be readily accessed, this is what i came here for help. because i am not a database guru. i am not a guru at anything.
And for my last final line is this. I will tell you what the project is for if you guess the answer correct. Its like a bounty hunters wife. He was killed. and only he knew her name. After he died she went by the name V.T. people would try to guess what it stood for but no one could until one day. That day came and she told the truth of what the letters stood for and so like her. but the difference here its from man to man. If you can guess the use of the program i have been working on it will be something for you to hold as a metaphorical prize. but when you get down to it what needs to be known is on the table. I am a real person who has a real problem. and you can help
this article helped me it shows intermediate database design and queries that can be applied to them to make them work properly. MySQL :: Managing Hierarchical Data in MySQL its short sweet and to the point. it was a good short read that was a good refresher.
take care and thanks for everything mike.
04-23-10, 09:33 #10Registered User
Provided Answers: 5
- Join Date
- Dec 2007
- Richmond, VA
Segmented_key (there will be over 100 of these generated from each key)