I recently started working as an assistant in an economics research department. My job will essentially be to prepare time data series that can be used to run regressions to test different hypothesis.
For example, maybe we will want time series of all companies in a specific sector, the average education level of their workers, some characteristics of the new hires at the company, their sales and assets that year, etc.
The underlying data is text files containing annual reports from all registered companies and all registered workers, about 5M a year, between 1990 and 2013. Extracting the time series can be problematic for various reasons. The standards by which companies are classified change, the id's of the companies change even though the company remains intact with basically the same employees, companies split and merge.
So far, everything has been done in a program called Stata. I'm new to Stata, but basically you have a matrix with variables in columns and the same number of observations for all variables. So if you want a data set with both workers and companies, you will need one observation per worker per year. Also, any variables relating to a single company have to be stored in the matrix once for each worker in that company. Also, making searches is not easy. If I know a person working in a company I will still have to search the whole data set to find the id of the company.
To me, it seems like doing all this in Stata is madness. I imagine it would be much easier to have the data stored in some object-like structure, with say an object called firm-year which then links to workers that have their caracteristics stored separately.
I don't know much about data-bases, but I have quite a lot of experience of programming in c and java. Should I making some sort of database? How long could that take? Also, maybe you have some tips for languages, platforms etc, or some good guide I could read? All help is much appreciated!