Hi there folks. Short time reader, first time poster.
I'm working on a relatively simple application for displaying real estate listings and doing some other processing on the listing data.
Our current MLS database of listings is about 100k rows. About 15k of those listing rows are active and of the most interest to folks and, to be honest, to my application.
Our MLS system allows up to 12 photos per listing. This means that I'll have about 65k photos of about 70k each if I were to download them all. Not every listing has all 12 photos.
The issue I'm having is trying to figure out how to store these files so that they are easy/fast for me to access programatically and also so that when I'm manipulating the data on the server that I don't have a folder with 65k files in it. The current ftp process that the MLS provides me dumps all the files into a single folder and it sucks. They are kind enough to use a simple naming system like [MLS#]-[picture index].jpg
I can do a new pull from the master MLS system anytime so the data does not need to be backup up or redundant, and it usually doesn't change, but on occasion it may if someone changes the photos for a given listing. The only time writes to the files are typically done is in a batch process as incremental updates occur daily.
I searched the forum and noted some other suggestions, but to be honest I'm not sure which system would be good for my application and for allowing it to grow as it does every single year with the addition of new listings.
What I came up with were the following:
1) Store all photos in the file system in a big directory. This sucks.
2) Store the photos in the file system, but do a mod100 on the mls# and then store the files in 100 separate folders hoping that the random MLS numbers generated are evenly distributed.
3) Store all files in the database
4) Store a reference to each file in the database and leave the files in the file system.
I am using mysql, php, and dojo to create my application if that helps with your giving me insight. My server seems plenty capable for the limited load that I'll have initially. It's a dual core Xenon @ 2.33GHz, 4 Segate SATAII drive of 400GB each, 4GB RAM, and we're running SUSE and Win2k3 64bit.
Any help would be great. And if I'm leaving key info out to help just let me know.
I confess I didn't read the problem, just the solutions, and peiced the problem together from that
Most here would go for 1) or 4). There's one regular on here, who's name I forget, who likes to store images in databases, but most of us don't.
We had a lively discussion on this which is worth reading. I believe most Unix's will work with any number of files in a directory but you will run into performance problems with 65k files in a directory - even doing an ls will be slow.
I assume you can't access the image files on the original server (ie not ftp them across) because life would be much easier for you if you could. Another slightly different way of doing this would be to only pull the main picture for each listing then, if the user wants to see more pictures of a place, you could check to see if you already have the pictures and ftp them if not. A nightly process could then remove the additional pictures if they haven't been accessed in a month.
How do you currently intend to remove the images on properties that have now been sold? What does MLS mean?
Thanks. I think number 4 sounded the best to me. When I'm debugging and stuff I like to check and make sure that I have the right photo (either from our FTP or from the application) by hand so the 65k thing is really annoying.
Perfect post. I had found another post, but it didn't really go into the pros and cons as much. I liked the point about being able to edit and access the files easily in the file system. solution 4 files in FS and links in DB sounds right to me.
Regarding the MLS. That is broker-speak for Multiple Listing Service. Another fancy sounds more important than it is word for database. I hadn't given much thought to removing the data or archiving it. I was really thinking more along having a separate table for 'old' data. Those would include listings that are expired, sold, or withdrawn from the market.
Then for most of the users (i.e. prospects looking at the data from who knows where), the relevant data is the Active, unsold, listings. That data would be optimized to handle load wrt searching/fetching photos.
The cool thing is that I can use the sold data and allow agents to do things like estimate market value based on searches and aggregates that I'll create. The only thing is that our MLS restricts the viewing of sold data to agents and brokers of the MLS. We are a non-disclosure state in Texas so property sales prices are not public record.
I suppose that would work so long as I didn't make the force of habit mistake of opening the root directory using a GUI. I'll consider that as well because it seems pretty organized also.