Quote:
Originally Posted by
CommonTater
Why do you care if the file is sorted anyway? Can't you just read the whole thing into memory and sort it as you read it for whatever it is you need to do with it?
Yes, this is an option. But, just think about how easy/efficient things would be if the file itself is already sorted! For e.g., in a transaction summary kind of report, if I am looking for trades that happened on 01/15/2008, as I am reading, when I come across a trade on 01/16/2008, I would simply stop reading the file as I know for a fact there are not going to be any more records for 01/15/2008.
Quote:
...there are very fast in-memory sorting techniques like quicksort supplied with modern versions of C that can be amazingly fast. The strategy would be to load the entire file into memory, sort it there, then write it back out in order...
It is heartening to hear that there are “amazingly fast” sorting utilities in C. But, sorting is “always slow and machine intensive.” Is it not? I am trying to avoid this, if possible. That is all.
Quote:
However, there are other problems... Keeping a separate file for each client while superficially more efficient is not ultimately your best method. If you want an amalgamated list of "all transactions done June 14th 2009", you are going to have to search every single one of hundreds --maybe thousands-- of files, write intermediate files, sort them and shuffle the whole thing into a final report file and that's just way too error prone. This is going to be painfully slow even beside a sequential search of a single transaction file.
I whole heartedly agree! This is “the” drawback in keeping one file - one account methodology. But, there are packages out there where this concept has been implemented. This system is very reliable. But, could very well be slow. I should say that such across the accounts kinds of reports are not that very many. Usually, in asset management, for one family you may create 10 accounts. It is across these 10 or 20 accounts that you would find yourself searching for records. But, advisors might very well choose to run a report across “all” accounts and the system should not conk out in such rare scenarios. Again, I agree it will be slow. But, would be definitely more reliable.
Quote:
Indexing is once again your best answer... when transactions arrive, just tack them onto the end of your main file and as with your clients etc. you don't actually care what the record number is, so long as the darned thing is in the file.
I am hesitant when it comes to keeping all transactions in a single mother of all transaction file. And, creating index files for it. What would happen if the one big file is lost?! If each account is in a single file, reading it from backup and getting the system back on line would be relatively simple. Moreover, in the latter case, I would upset one customer whereas in the former case, the whole client base would be screaming for my head!
Quote:
Brand them yourself! Simply assign a sequence number to each transaction as it arrives on your system
Yes, this is “the” only option! You are correct that the fields in the records can not be used as unique keys.
Quote:
A further refinement would be to find all events for that date *for that client*... meaning that you would have to have the timestamp, client id and record number in your index file.
I agree totally! Not necessarily the time stamp, but the transaction date in our case.
Quote:
Another technique you can use when transactions arrive out of order is to dump them into a temporary file.
This is something we routinely do. We do a copy and write of transaction date field only in to a separate file, make sure there is no date other than the date for which we are posting trades, and then do the post. If dates other than the date for which we post trades are found, the earlier dated transaction is moved to a temp file as you suggested and then processed separately. (Not done in order to keep the transaction dates sorted but, posting an earlier dated transaction upsets previously reconciled position records and hence, has to be manually cross checked).
Quote:
Then after hours run a routine that sorts the temp file (much smaller) then rewrites your transaction file record by record inserting the out of order transactions as it goes.
Yes, I have to do this record by record reading and then writing. This is what I am trying to avoid! Just like my earlier example where the OS is allocating physical memory space for the 5 rows to be inserted but, keeping track of the order of the memory locations to be read in the FAT file, is there a way for me to keep the out of date order record in newly allocated physical memory space but, change the FAT to maintain the order of memory locations to be read correctly?
I know this might sound weird. But, just imagine if a way can be found... ;-)
Quote:
...but even this will be orders of magnitude faster than searching literally thousands of individual customer files.
I agree. File I/O always is expensive. But, individual account files do improve the system's reliability. I totally agree that for system where you run reports that access data “across” accounts “very often”, I would not recommend individual files for every account. As you said earlier, analysis of the situation, and, hence, the requirements drive the solution.
Quote:
Since you've not shown me a unique transaction number (as I showed you in message #50) and since these transactions do not appear to be branded with any kind of client code or sequence code... this is some serious bad juju even for a small business. Because of that fundimental flaw, it may simply be that your information is far too incomplete for this type of filing system. Random access files and their indices rely upon unique identifiers for their effiency...
Sorry, if my colleague's earlier postings were not totally clear. Every trade would have an account code with it. Without it when I download transaction data from the brokerage, I would not know which account it is meant for. If I were to choose a system where I keep one mother of all files for transactions, I will use account code, and (internally generated) sequence number for the index file.
Quote:
Quite honestly, I'm thinking you're on a road to noplace unless each transaction is somehow uniquely identified...
Nope! You have no idea how much you have helped us crystallize our thoughts here! Thank you, sir. I do appreciate you taking the time to write to us. Next time when I am visiting Niagara Falls, I would very much like to take your family for lunch.
Quote:
Exactly how many transactions are we talking about on a given day?
If I can design a system that can handle, say, 300,000 accounts and around 1,000,000 transactions per day, I would indeed be a happy man. Looks like I will be going down the “one account – one transaction file” road. I would be using around 20 machines to divvy up and post trades. So, that comes about roughly 15,000 accounts per machine and 50,000 transactions per machine. Looks like Pentium dual cores in Linux environment ought to be able to handle that, right?
Quote:
And... FWIW, nobody uses FAT filesystems anymore... they're too slow, insecure and limited to 4gb file sizes.
I am “NOT” an expert in file systems at all! So, I do not know what I am talking about! But, I am providing you with an example using FAT; and I am asking if the OS's way of allocating memory and keeping track of the order in which disk memory locations should be read and presented, can be mimicked in our solution or not.
Quote:
I don't know about what our friend is doing... but the inventory files I deal with that would be impossible... on one system the main file is almost 18gb and it's indexes run nearly 800megs each.
I am scared by the mere thought of handling a 18GB file! What would happen if I were to lose it?! Your index alone is 800MB! I would be scared if my main file itself reached more than 50MB! How reliable is your system? Do you sleep well?! ;-)
Quote:
If this is what I think it is (stock trading) it's very likely his transaction file will grow at a rate of megabytes per day so he's going to have to look into temp files and rewrite-updates to solve his problems.
Yes, asset management. Keeping track of what an asset manager does with investment accounts that he manages for investors.