Hash Table with Simple record in files

**~~CommonTater~~** · 10-10-2011

Originally Posted by infantheartlyje

Okie. Now I got another doubt. The every record stored in random order in that file.

First of all you need to understand this... The records are not stored in random order. They are stored in the order you added them. These are called "random access files" because you can access them in any order ... these are crucial concepts, they are not difficult and you do have to understand them.

Just forget my client's Transaction file.
These are the records stored in that file.
one name and date of birth
there are 200,000 Records. Ok. Now i want to search the person who were born in specific date. In your method i have to search for all (200,000) records.

Yep... there's just no other way to do it. Indexing can be used but your code will have to be able to deal with multiple identical fields in your indexes... and that means you will have to search the index sequentially. One way other the other, a birtday search requires looking at every record in the file.

But if the records are in sorted order by date of birth, we can check the record until our date is reached. after that we can neglect the remaining records.

Yes. As I said above extracting a birthday index from the main client records as an index can be done... but this file then has to be sorted, which isn't much of a chore...

But in the random order access, we have to search for all record. Please explain me how to rectify the problem ??

By not seeing it as a problem... Stop and think about your daily business practices... what information do you look up the most often?

I'm betting it's the client's account number... so you have 2 choices here...

1) Record number in account number...
Give your customers sequential account numbers like this...
232345-1000000-4454378
245344-1000001-9870869
908798-1000002-8798798
and so on...
It's a very simple scheme, the first and last group are randomly generated (i.e. meaningless) the actual account number is in the second group... take the second group subtract 10000000 ... and you have the record number in the client file.

2) Previously assigned account numbers... (eg. Credit card number)
If you are working with previously assigned numbers you can simply put the account number in the main client file as a field. Extract and sort an index and us a binary search of the index to locate the record number....

If you have other *unique* bits of information, like a phone number you could build an index for phone numbers.

However, when you have information that can easily be duplicated many times in unknown places throughout a the main file you are relegated to searching sequentially and possibly dealing with multiple return values from the search... FindFirst, FindNext... and yes it can be a bit slowish.

In my client transaction file i want to search a record for specific date. If it is ordered by date means, we can search it upto the specific date. In random access method we have to search all records. The processing is waste for some records. I hope you understand my problem ? Please help me !

This is just so simple... if you add the transaction records in the order they are created (i.e. always add to end of file) they will automatically be in chronological order. Since it is most likely that inquiries will be about recent (as opposed to ancient) transactions, just search the file *backwards* from end to beginning.

You can optimize this further by breaking the file at specific intervals... yearly or perhaps monthly... simply create a new file the first day of each year so you have files named 2000.trans, 2001.trans, 2002.trans, etc ... If your transaction numbers imbed a date code for example...
13334342011
13334352011
13334362011
and so on, it's a fairly easy matter to open the 2011 file and search there.

However, to correllate them with clients you will need to add a customer account number field (see above) to each transaction record. In that way the transaction file also becomes an index into the main client file... so in the process of finding a transaction you also have the means to open the customer's file as well.

Each transaction record will need fields for...
1) Transaction number (sequential)
2) Customer account number (or record # in main file)
3) Transaction date

As well as any transaction specific information.

Now you can do binary searches for transaction numbers... which will be very fast. You can do sequential searches based on multiple criteria... Customer, date, transaction type etc...

The thing is... before you start pounding out code you need to do an analysis of the requirements and plan your data base layout accordingly... This isn't something you can do "off the cuff"... you HAVE to analyse and plan this kind of data structure or it will eventually fail.

**infantheartlyje** · 10-11-2011

Hi. .. I generated Record number using the date.

Code:

long int getpos(char *date)
{
    date=strcat(date,"0");
    date[strlen(date)]='\0';
    return (atol(date));
}

This is my Record Write calling

Code:

    case 'A' :
            printf("Enter date (MMDDYYYY) : ");
            scanf("%8s",date);
            pos=getpos(date);
            strcpy(Record.S_trancode,"dp");
            WriteRecord(File,pos);
                 printf("Record  Added\n\n");
                 break;

This is my Structure definition

Code:

struct t_Record
  { 
    char S_trancode[3];
}Record;

This is my Writing Record Definition

Code:

// Write a record to the file
long int WriteRecord(FILE *File, long int RecNum)
{
    
        if(! ReadRecord(File,RecNum))
        {
            if( fseek(File, RecNum * sizeof(Record), SEEK_SET) == 0 )
                if ( fwrite(&Record,sizeof(Record),1,File) )
                    return RecNum;
        }
        else
        {
            RecNum=RecNum+1;
            WriteRecord(File,RecNum);
        }
        return RecNum; 
}

After writing Only three Records in the file (abc.cli) , the file having 118 MB. But i stored only 3 Records. Why it is like this? Is it any wrong in Record Number ??

**infantheartlyje** · 10-11-2011

This is my Reading Record Definition :

Code:

// read a record from the file
int ReadRecord(FILE *File, long int RecNum)
  { if( fseek(File, RecNum * sizeof(Record), SEEK_SET) == 0 )
      if ( fread(&Record,sizeof(Record),1,File) )
        return 1;
    return 0; }

This is Viewing Record Definition :

Code:

int ViewRecord (FILE *File, long int RecNum)
  {
    int cnt=0;
     if (ReadRecord(File,RecNum))
     {
        cnt=1;
            //printf("-----\n");
          printvalues(RecNum);
        RecNum=RecNum+1;
        ViewRecord(File,RecNum);
    }
    if (cnt==0)
    {
        printf("\n No Records \n");
        return 0;
    }
       return 1; 
    
}

Problem No 1:
After writing Only three Records in the file (abc.cli) , the file having 118 MB. But i stored only 3 Records. Why it is like this? Is it any wrong in Record Number ??

Problem No 2:
I inserted first Record for 04042000
pos : 40420000S_trancode : dp
then second record for 04042000
pos : 40420001
S_trancode : dp
If i viewing records for date 04042000 means its printing outpur correctly.

Now i added third record for date : 04052000
pos : 40520000
S_trancode : dp
now 4th record for date : 04052000
pos : 405200001
S_trancode : dp
If i viewing records for date 04052000 means, it show records from 40420000 to 40520001 . Why it is printing like this ?

(If you want to see my full code means, tell me. I'll post it in next reply).

**~~CommonTater~~** · 10-11-2011

Yes... there is a problem...

1) What happens if you have more than one record for the same date? (Hint: in a busy office there could be thousands!)

2) You are not generating sequential record numbers you are writing 3 records *very far apart* in the file.

3) Given the way dates increment you will not be writing in chronological order since the calculated record number for the first of one month can be lower than that for the 30th of the month before.

4) You will get overwrites since there can be multiple dates that calculate to the same record number.

Have it display the calculated record numbers... I think you'll be in for something of a surprise.

For Random Access Files to work efficiently the records must be placed one right after the other... look at my last sample code in message 33 of this thread... see how it adds new records? Seek to end of file... write new record sequentially...

And yes... So you know... if you took a file with 0 data in it and told it to write record 100,000 you're going to have a single entry at position 100,000 of the the file... that could be gigabytes away depending on your record size.

Once again... you need to stop randomly trying stuff... sit down and plan your structs and their data, design a storage system on some sensible basis... all this monkeying around isn't going to get you to a solution any time soon.

For example, here's the file structure of one of my inventory programs ... ("Inventory" could be "Clients" or just about anything else)

MainData...

Code:

typedef struct t_PARTSLIST
  { long long int partnum;
     char description[64];
     char location[4];
     int instock;
     int minstock;
     int maxstock;
     int cost;
     int retail;
     int supplier;
     time_t lastorder; }
  PARTSLIST, *pPARTSLIST;

From this I take 1 index file

Code:

typedef struct t_PARTSIDX
  { long long int partnum;
     int recnum; }
  PARTSIDX, *pPARTSIDX;

This gives me everything I need to enter parts, add them sequentially to the end of the file, extract the index file and sort it by part numbers... This allows the operator to enter a part number and get the main inventory information on screen.

I do the same with suppliers...

Code:

typedef struct t_SUPPLIER
  { char name[64];
     char street[64];
     char city[32];
     char zip[12];
     char phone[15];
     int minorder;
     time_t lastorder;
     int taxstatus;
     int courier; }
     SUPPLIER, *pSUPPLIER;

Like the other two files these records are simply added sequentially to the end of the file. When a new supplier is added the record number becomes the supplier code ... So for example: Mercury enterprises, 123 anystreet, anytown, canada, a1a 1a1 might be record #6... To this I add an offset of 1230 (or such) so that suppler numbers range from 1230 to whatever... Mercury Enterprises would be seen as 1236 in the main file. To look up their information I simply subtract 1230 and open that record...

The same tactic is used with the couriers referenced in the Supplier file... producing an easy look up of a supplier's preferred courier.

Searches for part descriptions are done sequentially upon the main file as are searches for items in a given stock location... this is because there can be multiple returns and a sequential search allows me to list them all. More complex searches can be written: for example list all parts in a given location, list parts due for reorder from a given supplier... etc.

This represents a complete filing system for an electronc supplier's warehouse inventory... These files typically carry a load of 100,000 or more parts.

Access times when entering part numbers for customers, are in fractions of a second.

A re-order list (parts where instock < minstock) takes less than a minute, the parts manager edits the list and the system actually prints the purchase orders to be sent to the suppliers.

The core code used to read, write, and add to these files is essentially that given in my two examples, those three simple functions...

Again, I emplore you to stop blundering around like a lost bull...
ANALYSE what you need...
PLAN how to make it happen...
WRITE your code...
TEST your code using dummy information...
The four steps in my signature...

Now... you need to sit down and figure out what your system needs, and work out a sensible plan for providing it.

**~~CommonTater~~** · 10-11-2011

Ok, now... How do you write a transactions file?

Here's the struct based on your information...

Code:

typedef struct t_TRANSACTION
  { long long int transnum;    // a numerical transaction number
     long long int client;        // numerical customer number
     time_t date;                 // the record date and time in seconds
     char transcode[3];        // ascii transaction code
     char securtype[5];        // ascii security code
     char secursymb[5];       // ascii security symbol
     time_t traded;              // timestamp of trade
     time_t settled;             // timestamp of settlement 
     int qtytraded;              // number of items traded
     int amount;                 // value of trade 
     char sourcetype[10];    // source of trade
     char sourcesymb[5]; }  // source symbol
     TRANSACTION, pTRANSACTION;

Simply add each new trade to the end of the file... don't go calculating an screwy record numbers... tack it on the end of the file and you don't care what record number it's in... you really don't.

Now you have two choices...

1) If transaction numbers are not assigned eleswhere, have your software assign transnum so that each new record is simply the transaction code of the previous record + 1 ... Now you can simply do a binary search for the transaction number and retrieve the information lightning fast.

2) if the transaction numbers are assigned for you... extract an index file...

Code:

typedef struct t_TRANSIDX
  { long long int transnum;
     int recnum; }
    TRANSIDX, *pTRANSIDX;

... and sort it into order by transaction codes and use it to access your records.

**infantheartlyje** · 10-11-2011

First of all you need to understand this... The records are not stored in random order. They are stored in the order you added them. These are called "random access files" because you can access them in any order ... these are crucial concepts, they are not difficult and you do have to understand them.

Quite true, the records are stored in the same order that I add them. If/when financial houses send “all” their records for “every day” without any interruption, the appended records would always be in chronological order. But, this is hardly the case. There are instances when 2 year old transactions are sent and we are trying to figure a way to store the records chronologically by “trade/transaction” date. (Of course, we are going to have a Post Date which will be the current/system date to keep track of when exactly we got the record, but that is besides our discussion).

Yep... there's just no other way to do it. Indexing can be used but your code will have to be able to deal with multiple identical fields in your indexes... and that means you will have to search the index sequentially. One way other the other, a birtday search requires looking at every record in the file.

Yes, looking/searching “every” record is going to be the solution. But, it is going to be “very” expensive in terms of machine resources. This is “exactly” what we are trying to avoid.

Yes. As I said above extracting a birthday index from the main client records as an index can be done... but this file then has to be sorted, which isn't much of a chore...

I believe that you are asking me to sort the “index” and not the main file. I believe that is why you are saying that it would not be much of a chore. I agree. On the other hand, sorting the main file itself, I believe, would be expensive.

By not seeing it as a problem... Stop and think about your daily business practices... what information do you look up the most often?
However, when you have information that can easily be duplicated many times in unknown places throughout a the main file you are relegated to searching sequentially and possibly dealing with multiple return values from the search... FindFirst, FindNext... and yes it can be a bit slowish.

Yes, I agree. If we have to look up the whole file, we need to read the entire file. As you said correctly, this is going to be machine expensive or slow. This is what we are trying to overcome.

This is just so simple... if you add the transaction records in the order they are created (i.e., always add to end of file) they will automatically be in chronological order. Since it is most likely that inquiries will be about recent (as opposed to ancient) transactions, just search the file *backwards* from end to beginning.

I totally agree. The issue is that fiancial institutions send past records (which they left out earlier) at later dates and that throws a wrench in to a well oiled machine!

The thing is... before you start pounding out code you need to do an analysis of the requirements and plan your data base layout accordingly... This isn't something you can do "off the cuff"... you HAVE to analyse and plan this kind of data structure or it will eventually fail.

I totally agree. After due analysis, we have come to the following conclusions.

Our conclusions based on analysis:
Let me state the conclusions first and then I will list our reasons for doing so.
1) Keep every account's transactions in its own individual files. For e.g., account numbers, 100, 101, 102... and so on will have transaction files 100.trn, 101.trn, 102.trn and so on.
2) We are not going to keep individual indexes for each account/transaction file.
3) Whenever a file, for e.g., 100.trn, is opened and records appended/edited to it, we want to figure a way to sort this file by transaction date and then save the sorted records only in the transaction file.

Reasons for our conclusions:
(The reasons are numbered to correspond to the conclusions numbered above).
1) Whenever reports are generated, it is usually for individual accounts. Of course, there are times when multiple accounts are combined “on the fly” to create consolidated reports but, when you look at the “total” number of accounts and the “accessed” number of accounts in a typical report, the number of accessed accounts is very few. So, instead of keeping all accounts in one mother of all file, let us keep every account in individual files to enable fast record access.
2) If we are going to keep every account in its own .trn file, it becomes infinitely cumbersome to maintain “one index per one transaction file”. Looking up records in considerably smaller individual account files is also much faster/more efficient compared to looking up in a mother of all account file. So, index files are unnecessary.
3) Even though individual account/transaction files are considerably small, some of these files can be considerably large, for e.g., financial institutions' house accounts. So, if at all we can figure a way to sort and keep the transaction files in chronological transaction date order, it would be easier when reports for specific dates are run. Reports almost always are for specific dates only and hence, keeping the files in chronological transaction date order would be immensely efficient for transaction reading. Whenever the transaction date of a record is later than the date range, record access can be terminated under the assumption that the remaining records would also fall out of date range.

So, the question is:
How can we find the best way to sort each account file by transaction date whenever transactions are appended/edited?

We looked up sort utilities that we can buy and also there are open source sort utilities. But, is there a way to bypass this sort to begin with? After all, sorts are machine expensive too. Is there a way to bypass the sort but still be able to retrieve the records in chronological transaction date order.

In a typical File Allocation Table(FAT) for a file, whenever records are inserted/edited/appended, instead of sorting the entire data set, the Operating System would simply allocate memory for the added records and then update the FAT. This will make sure that when the file is opened, retrieval of the data would be in the same order kept in the FAT. For e.g., in a spreadsheet(ss) when 5 new records are inserted in the middle of the existing rows, the OS would allocate memory space for the 5 new records wherever it can find memory. But, the OS would revise the FAT to account for the order in which the records would have to be retrieved in order to maintain the same order in which the rows of the spreadsheet would have to be displayed when the spreadsheet is opened later.

Is there a way to mimic the OS's way of updating the FAT to maintain record order in files, when we append/edit records in our transaction files also?

Ultimate question:
Is there a way to keep the records in chronological transaction date order without sorting the file?

**quzah** · 10-11-2011

Originally Posted by infantheartlyje

In a typical File Allocation Table(FAT) for a file, whenever records are inserted/edited/appended, instead of sorting the entire data set, the Operating System would simply allocate memory for the added records and then update the FAT. This will make sure that when the file is opened, retrieval of the data would be in the same order kept in the FAT. For e.g., in a spreadsheet(ss) when 5 new records are inserted in the middle of the existing rows, the OS would allocate memory space for the 5 new records wherever it can find memory. But, the OS would revise the FAT to account for the order in which the records would have to be retrieved in order to maintain the same order in which the rows of the spreadsheet would have to be displayed when the spreadsheet is opened later.

Is there a way to mimic the OS's way of updating the FAT to maintain record order in files, when we append/edit records in our transaction files also?

Ultimate question:
Is there a way to keep the records in chronological transaction date order without sorting the file?

I don't think you know what you are talking about here. But, to answer your ultimate question: No.

You can't magically insert into the middle of a file unless there is blank space set aside for it, which you can't really do, since you have no idea how much blank space you would have had to have set aside ahead of time. Why do you care if the file is sorted anyway? Can't you just read the whole thing into memory and sort it as you read it for whatever it is you need to do with it?

Quzah.

**~~CommonTater~~** · 10-11-2011

Quzah asks a good question... unless you have ancient machines with limited memory (less than a gigabyte) there are very fast in-memory sorting techniques like quicksort supplied with modern versions of C that can be amazingly fast. The strategy would be to load the entire file into memory, sort it there, then write it back out in order...

However, there are other problems... Keeping a separate file for each client while superficially more efficient is not ultimately your best method. If you want an amalgamated list of "all transactions done June 14th 2009", you are going to have to search every single one of hundreds --maybe thousands-- of files, write intermediate files, sort them and shuffle the whole thing into a final report file and that's just way too error prone. This is going to be painfully slow even beside a sequential search of a single transaction file.

Indexing is once again your best answer... when transactions arrive, just tack them onto the end of your main file and as with your clients etc. you don't actually care what the record number is, so long as the darned thing is in the file.

If the records are branded with sequential transaction numbers, your job will get a LOT easier. Each source should have it's own sequence code for example... Kelly's has code 1123000000 and sequentially asigns numbers Frank's has 212100000000 and so on. Now you can index these transactions according to a *unique* key. If these transactions are not uniquely branded, you are basically screwed because of the very high likelihood of duplication and overlaps... Your plan dies because you don't have enough information.

However; there is still a way to recover from such a problem... Brand them yourself! Simply assign a sequence number to each transaction as it arrives on your system... this sequence number can --easily as not-- be the transaction's record number in the file. This would then have to be reported back to source in some way as the transaction number would then be your reference for the event.

Even failing that, you can still index the file by dates. The catch is you need to prepare yourself to deal with the ambiguities. If you extract dates and record numbers from the file you are going to get a lot of duplication. You will thus have to write an index and sort it so that similar records (same date) appear sequentially in your sort... You can then use a customized binary search to find *ANY* record from the desired date... then roll down the index until you find the beginning of the sequence and list all events for that date... A further refinement would be to find all events for that date *for that client*... meaning that you would have to have the timestamp, client id and record number in your index file. From there you could present a list of records, that can be manually checked by the operator.

Your problem is that you are not dealing with *unique* information. For example in my inventory packages (detailed in message #49) every part number is guaranteed to be unique, thus it is easily indexed. It makes no difference whatsoever what order they appear in the main file, the index sort takes care of that. Were that not the case, I would be faced with writing hybrid searches that could deal with ambiguities by finding the exact part number, then checking the one before and the one after for duplicates and listing multiple hits for the operator to sort out... It can be done, but it's a royal pain the backside to get it right. A simple binary search is fast and elegant, very easy to program... the ambiguities complicate the thing to no end.

Another technique you can use when transactions arrive out of order is to dump them into a temporary file. Then after hours run a routine that sorts the temp file (much smaller) then rewrites your transaction file record by record inserting the out of order transactions as it goes. This again mandates a smarter search algorythm in that you will have to first search the main file and then look in the temp file... but even this will be orders of magnitude faster than searching literally thousands of individual customer files.

Since you've not shown me a unique transaction number (as I showed you in message #50) and since these transactions do not appear to be branded with any kind of client code or sequence code... this is some serious bad juju even for a small business. Because of that fundimental flaw, it may simply be that your information is far too incomplete for this type of filing system. Random access files and their indices rely upon unique identifiers for their effiency...

Quite honestly, I'm thinking you're on a road to noplace unless each transaction is somehow uniquely identified...

Exactly how many transactions are we talking about on a given day?

And... FWIW, nobody uses FAT filesystems anymore... they're too slow, insecure and limited to 4gb file sizes.

**~~CommonTater~~** · 10-11-2011

Originally Posted by quzah

I don't think you know what you are talking about here. But, to answer your ultimate question: No.
You can't magically insert into the middle of a file unless there is blank space set aside for it, which you can't really do, since you have no idea how much blank space you would have had to have set aside ahead of time. Why do you care if the file is sorted anyway? Can't you just read the whole thing into memory and sort it as you read it for whatever it is you need to do with it?
Quzah.

I don't know about what our friend is doing... but the inventory files I deal with that would be impossible... on one system the main file is almost 18gb and it's indexes run nearly 800megs each. Ican do in-memory sorts of the indexes but wouldn't stand a chance of doing it with the main file.

If this is what I think it is (stock trading) it's very likely his transaction file will grow at a rate of megabytes per day so he's going to have to look into temp files and rewrite-updates to solve his problems.

**~~CommonTater~~** · 10-11-2011

Originally Posted by infantheartlyje

In a typical File Allocation Table(FAT) for a file, whenever records are inserted/edited/appended, instead of sorting the entire data set, the Operating System would simply allocate memory for the added records and then update the FAT. This will make sure that when the file is opened, retrieval of the data would be in the same order kept in the FAT.

Sorry but that's just not how filing systems work. The File Allocation Table (when used) is a bitfield that marks disk sectors as used or unused. This has literally nothing to do with A) the order of things in memory or B) the content of the file. It is only about used and unused allocation units.

Is there a way to mimic the OS's way of updating the FAT to maintain record order in files, when we append/edit records in our transaction files also?

No. Plain and simple... no there isn't.

**infantheartlyje** · 10-11-2011

Originally Posted by quzah

I don't think you know what you are talking about here. But, to answer your ultimate question: No.

You can't magically insert into the middle of a file unless there is blank space set aside for it, which you can't really do, since you have no idea how much blank space you would have had to have set aside ahead of time.

I absolutely agree that there is no way to insert data in to the middle of an existing file “physically”; what I mean is that it can not be done in the recording media, the magnetic disk. Yes, there is no blank space that was left earlier and there is no way I can insert any data without erasing already existing data.
Again, please take my example that I provided: inserting 5 rows in to a spreadsheet. (Obviously, all the sample given below is hypothetical. I am NOT an expert in memory allocation or FAT, but, based on what I studied I tried to give an example).
Let us say the FAT would look like something like this prior to insertion:
File Block Memory Location Begin Memory Location End
File.xls 1 80 90
File.xls 2 560 600
File.xls 3 440 490
File.xls 4 120 140
In this example, the file, File.xls, was written in to the magnetic media in 4 different locations. When the file is accessed, based on what was in the FAT, block 1 is read, then block 2, block 3 and finally block4 to put together the file in the same order in which it was intended to be retrieved and presented.
When 5 new records/rows are inserted, the FAT would be changed, say, as follows:
File Block Memory Location Begin Memory Location End
File.xls 1 80 90
File.xls 2 560 600
File.xls 3 440 490
File.xls 4 120 140
File.xls 2.5 510 540
When the file is accessed, based on what was in the FAT, block 1 is read, then block 2, block 2.5, block 3 and finally block4. Obviously, the newly inserted 5 records would be in memory location 510 to 540. When the ss editor reads the file, it would put the read data in order and we would have the record order we originally intended.
Now, all that I am asking is that whether this method could be somehow implemented thru programming so that when records are posted for earlier dates, the newly arrived record would be nicely inserted in to the FAT, thereby, eliminating the necessity for a) sorting the file based on transaction date and then b)writing back the sorted records as a file in the disk.
I am no expert in file systems. Again, I am not saying this is going to be feasible, but asking if this could be done.
File fragmentation elimination utilities rearrange non contiguous memory blocks in to contiguous memory blocks so that files can be read faster. Can a similar concept be used in what I am trying to do where the record with the old transaction date will be written in a memory location (similar to the newly inserted 5 rows in the ss), but the FAT would be changed in such a way that when this file is read, all records will be in chronological transaction date order?

**infantheartlyje** · 10-11-2011

Originally Posted by CommonTater

Why do you care if the file is sorted anyway? Can't you just read the whole thing into memory and sort it as you read it for whatever it is you need to do with it?

Yes, this is an option. But, just think about how easy/efficient things would be if the file itself is already sorted! For e.g., in a transaction summary kind of report, if I am looking for trades that happened on 01/15/2008, as I am reading, when I come across a trade on 01/16/2008, I would simply stop reading the file as I know for a fact there are not going to be any more records for 01/15/2008.

...there are very fast in-memory sorting techniques like quicksort supplied with modern versions of C that can be amazingly fast. The strategy would be to load the entire file into memory, sort it there, then write it back out in order...

It is heartening to hear that there are “amazingly fast” sorting utilities in C. But, sorting is “always slow and machine intensive.” Is it not? I am trying to avoid this, if possible. That is all.

However, there are other problems... Keeping a separate file for each client while superficially more efficient is not ultimately your best method. If you want an amalgamated list of "all transactions done June 14th 2009", you are going to have to search every single one of hundreds --maybe thousands-- of files, write intermediate files, sort them and shuffle the whole thing into a final report file and that's just way too error prone. This is going to be painfully slow even beside a sequential search of a single transaction file.

I whole heartedly agree! This is “the” drawback in keeping one file - one account methodology. But, there are packages out there where this concept has been implemented. This system is very reliable. But, could very well be slow. I should say that such across the accounts kinds of reports are not that very many. Usually, in asset management, for one family you may create 10 accounts. It is across these 10 or 20 accounts that you would find yourself searching for records. But, advisors might very well choose to run a report across “all” accounts and the system should not conk out in such rare scenarios. Again, I agree it will be slow. But, would be definitely more reliable.

Indexing is once again your best answer... when transactions arrive, just tack them onto the end of your main file and as with your clients etc. you don't actually care what the record number is, so long as the darned thing is in the file.

I am hesitant when it comes to keeping all transactions in a single mother of all transaction file. And, creating index files for it. What would happen if the one big file is lost?! If each account is in a single file, reading it from backup and getting the system back on line would be relatively simple. Moreover, in the latter case, I would upset one customer whereas in the former case, the whole client base would be screaming for my head!

Brand them yourself! Simply assign a sequence number to each transaction as it arrives on your system

Yes, this is “the” only option! You are correct that the fields in the records can not be used as unique keys.

A further refinement would be to find all events for that date *for that client*... meaning that you would have to have the timestamp, client id and record number in your index file.

I agree totally! Not necessarily the time stamp, but the transaction date in our case.

Another technique you can use when transactions arrive out of order is to dump them into a temporary file.

This is something we routinely do. We do a copy and write of transaction date field only in to a separate file, make sure there is no date other than the date for which we are posting trades, and then do the post. If dates other than the date for which we post trades are found, the earlier dated transaction is moved to a temp file as you suggested and then processed separately. (Not done in order to keep the transaction dates sorted but, posting an earlier dated transaction upsets previously reconciled position records and hence, has to be manually cross checked).

Then after hours run a routine that sorts the temp file (much smaller) then rewrites your transaction file record by record inserting the out of order transactions as it goes.

Yes, I have to do this record by record reading and then writing. This is what I am trying to avoid! Just like my earlier example where the OS is allocating physical memory space for the 5 rows to be inserted but, keeping track of the order of the memory locations to be read in the FAT file, is there a way for me to keep the out of date order record in newly allocated physical memory space but, change the FAT to maintain the order of memory locations to be read correctly?
I know this might sound weird. But, just imagine if a way can be found... ;-)

...but even this will be orders of magnitude faster than searching literally thousands of individual customer files.

I agree. File I/O always is expensive. But, individual account files do improve the system's reliability. I totally agree that for system where you run reports that access data “across” accounts “very often”, I would not recommend individual files for every account. As you said earlier, analysis of the situation, and, hence, the requirements drive the solution.

Since you've not shown me a unique transaction number (as I showed you in message #50) and since these transactions do not appear to be branded with any kind of client code or sequence code... this is some serious bad juju even for a small business. Because of that fundimental flaw, it may simply be that your information is far too incomplete for this type of filing system. Random access files and their indices rely upon unique identifiers for their effiency...

Sorry, if my colleague's earlier postings were not totally clear. Every trade would have an account code with it. Without it when I download transaction data from the brokerage, I would not know which account it is meant for. If I were to choose a system where I keep one mother of all files for transactions, I will use account code, and (internally generated) sequence number for the index file.

Quite honestly, I'm thinking you're on a road to noplace unless each transaction is somehow uniquely identified...

Nope! You have no idea how much you have helped us crystallize our thoughts here! Thank you, sir. I do appreciate you taking the time to write to us. Next time when I am visiting Niagara Falls, I would very much like to take your family for lunch.

Exactly how many transactions are we talking about on a given day?

If I can design a system that can handle, say, 300,000 accounts and around 1,000,000 transactions per day, I would indeed be a happy man. Looks like I will be going down the “one account – one transaction file” road. I would be using around 20 machines to divvy up and post trades. So, that comes about roughly 15,000 accounts per machine and 50,000 transactions per machine. Looks like Pentium dual cores in Linux environment ought to be able to handle that, right?

And... FWIW, nobody uses FAT filesystems anymore... they're too slow, insecure and limited to 4gb file sizes.

I am “NOT” an expert in file systems at all! So, I do not know what I am talking about! But, I am providing you with an example using FAT; and I am asking if the OS's way of allocating memory and keeping track of the order in which disk memory locations should be read and presented, can be mimicked in our solution or not.

I don't know about what our friend is doing... but the inventory files I deal with that would be impossible... on one system the main file is almost 18gb and it's indexes run nearly 800megs each.

I am scared by the mere thought of handling a 18GB file! What would happen if I were to lose it?! Your index alone is 800MB! I would be scared if my main file itself reached more than 50MB! How reliable is your system? Do you sleep well?! ;-)

If this is what I think it is (stock trading) it's very likely his transaction file will grow at a rate of megabytes per day so he's going to have to look into temp files and rewrite-updates to solve his problems.

Yes, asset management. Keeping track of what an asset manager does with investment accounts that he manages for investors.

**~~CommonTater~~** · 10-11-2011

Originally Posted by infantheartlyje

[FONT=Univers, sans-serif]I absolutely agree that there is no way to insert data in to the middle of an existing file “physically”; what I mean is that it can not be done in the recording media, the magnetic disk. Yes, there is no blank space that was left earlier and there is no way I can insert any data without erasing already existing data

You are overthinking this by great leaps and bounds... how the file is on disk is not about FATs, in fact it's very unlikely the software is even aware of what fileing system is in use on the hard disk. It is a mistake to extend your thinking that far as you will end up lost in the minutia.

Yes you can insert records into a file. What you do, as I described earlier is write the out of order records to a temporary file. Then you take your main file and start copying it record by record, when you get to a spot where one of the records in the temp file belongs you write the temp record to the file then carry on copying the main records over... when you're done, you delete the old file and rename the new one... all done. The effect is to insert the out of order records during the file copy...

**~~CommonTater~~** · 10-11-2011

Originally Posted by infantheartlyje

Sorry, if my colleague's earlier postings were not totally clear.

Wanna try confused and confusing? "Not totally clear" doesn't even get close.

Nope! You have no idea how much you have helped us crystallize our thoughts here! Thank you, sir. I do appreciate you taking the time to write to us. Next time when I am visiting Niagara Falls, I would very much like to take your family for lunch.

That's very nice of you but we don't do that... Online is online... private is private... the two never meet.

If I can design a system that can handle, say, 300,000 accounts and around 1,000,000 transactions per day, I would indeed be a happy man. Looks like I will be going down the “one account – one transaction file” road. I would be using around 20 machines to divvy up and post trades. So, that comes about roughly 15,000 accounts per machine and 50,000 transactions per machine. Looks like Pentium dual cores in Linux environment ought to be able to handle that, right?

ROFL... not even bloody close... a million transactions a day?

You're probably talking about 15 or more Xeon or Itanium blade servers running 64 bit OS, with gobs of memory, redundent raid arrays, continuous backup, routing software, load balancing and all the horrific complexity that comes with it.

You're talking about --> THIS

My god man... you need to stop asking for advice on public forums and hire a proper consultant... Really... I don't think you have the first clue how big a job this is. The principles I've outlined will be at the core of it... but with that much traffic, you aren't going to cobble together some quicky C proggys and cross your fingers.

I am scared by the mere thought of handling a 18GB file! What would happen if I were to lose it?! Your index alone is 800MB! I would be scared if my main file itself reached more than 50MB! How reliable is your system? Do you sleep well?! ;-)

I sleep well enough... I've never had a data loss --ever-- because I make sure the software is running on a proper server, with mirrored drives and daily backup... The worst case data loss in any of my systems will be one business day... which these systems can easily tolerate. If these were handling even a small fraction of a million transactions a day, I'd be running double backups and a whole lot better hardware...

It time to drop this... you need to hire a consultant and get hooked up with a software house capable of such a job.

**quzah** · 10-11-2011

Originally Posted by infantheartlyje

It is heartening to hear that there are “amazingly fast” sorting utilities in C. But, sorting is “always slow and machine intensive.” Is it not? I am trying to avoid this, if possible. That is all.

I don't know how you can possibly think manipulating it on disk is going to be faster than in memory.

Quzah.