Thread: Best route to the result

  1. #1
    Registered User
    Join Date
    Sep 2010
    Posts
    16

    Best route to the result

    I have a 100mb file which contains 10 million entries, I need to append 3 characters to the end of each line.

    I know I can do this in DOS.... but way too slow....

    Does anyone have an idea how long this would take to complete in C..... Is C the best program language for this?

    I have read I can use strcat to do this, but is this the best route for dealing with such large volumes....?

    Any help appreciated....

  2. #2
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268
    Does anyone have an idea how long this would take to complete in C
    If you already know C, this should take about 5-10 minutes for you to write the code.

    Is C the best program language for this?
    I guess. For such a simple application, I would say that the "best" language is the one you feel most comfortable with.

    I have read I can use strcat to do this, but is this the best route for dealing with such large volumes....?
    There's nothing wrong with strcat in this this situation as long as you know your buffer is big enough to accommodate the string you are appending.
    bit∙hub [bit-huhb] n. A source and destination for information.

  3. #3
    Making mistakes
    Join Date
    Dec 2008
    Posts
    476
    Depends mostly on IO speed. Appending three characters to a line (even if it's ten million lines) shouldn't be too slow. Read one line at a time, strcat the characters and write.

  4. #4
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Strings in any language are not the fastest way to work with letters. The fastest way would be simply by using blocks of a char array, and forget mucking about with the end of string char and strlen(), and such not.

    These things are for convenience's sake, and are some of the slower parts of the C language. C is certainly one of the fastest languages on the planet - if not the fastest - but string handling is rather slow in any language.

    Can you post up a sample of the file, say 50 lines or so, and what you need to append onto the lines, as well? What is possible, and fastest, depends on the specific details of the data.

    By "DOS", you mean using a bat file, right? This is on a Windows system then?
    Last edited by Adak; 09-22-2010 at 01:12 PM.

  5. #5
    Registered User
    Join Date
    Sep 2010
    Posts
    16
    @Adak, thanks for helping...

    1st txt file contains 10million lines of

    AAAAA
    AAAAB
    AAAAC
    AAAAD
    etc

    2nd file contains 5 3 character codes
    XXX
    YYY
    ZZZ

    output required is 3 new text files containing
    FILE1
    XXXAAAAA
    XXXAAAAB
    XXXAAAAC
    etc

    FILE2
    YYYAAAAA
    YYYAAAAB
    YYYAAAAC
    etc

    So in effect I need to append 3 characters to the front of each line in a file of 10milion, in a reasonably effecient manner...

    I can do this in my sleep in DOS (yes Windows) but output is limited to 200 per minute, hoping C will complete task in an hour or 2?

    Thanks for any help...

  6. #6
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Write the prefix to the new file
    Read a line from the existing file
    Write the line from the existing file to the new file

    Repeat.

    By the way.... you are not "appending" your are "Prepending"...

  7. #7
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Only 200 per minute with a bat file? You're in for a VERY pleasant surprise with C.

    About the system you'll be running this on:

    cpu is?:

    Amount of memory?:

    Your compiler is?:

    Your operating system is Windows XP?

  8. #8
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    I don't see how anything is going to be much faster than this:

    Code:
    char *suffix = "foo";
    while ((ch = fgetc(input)) != EOF)
    {
        if (ch == '\n')
            fwrite(suffix, 1, 3, output);
        fputc(ch, output);
    }
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  9. #9
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    As it turns out, this should be fast, all around. It appears the 5 letters per row data, is a 5 letter permutation list. Since I didn't have enough sample data to work with, I generated my own list, up through AZZZZ - that took just a couple seconds.

    Which made me think that maybe re-generating the entire permutation file, WITH these extra 5 char's we're trying to prepend, would take less time than coding up a prepending program with the file handling the OP wanted, etc.

    In any case, this is only slightly tested for accuracy, and not optimized, or tested against other algorithms or data structures. I avoided using strcat, because I thought it would be a slow down, and wasn't necessary. It's fast enough, imo - about 2.2 Million records prepended, in about 2.5 seconds.

    As Brafil mentioned earlier in the thread, it's running speed is largely bound by the IO throughput.

    Code:
    /* 
    prepends 3 char's from perms3.txt file, (which has 5 rows of char's), 
    each row having 3 char's and a newline we don't use. These are 
    written out into the front of each row before the 5 chars 
    (6 counting the newline), in the perms6.txt file, are written out
    to five sequentially numbered files, (allperm1.txt - allperm5.txt).
    
    It's not optimized, or tested against other algorithms, but it's fast. 
    Try it and C. ;)
    
    */
    
    #include <stdio.h>
    
    typedef struct {
      char char3[3];
      char newline;
    }record;
    
    int main(void) {
      FILE *fpin3, *fpin6, *fpout;
      int i;
      const char *filename6="perms6.txt"; //reminder to take the newline char #6
      const char *filename3="perms3.txt";//leave the newline behind
      char fileOut[]="allperm1.txt";
      record rec;
      char char6[6];
      unsigned long int count = 0;
    
      fpin3 = fopen(filename3, "rt");
      fpin6 = fopen(filename6, "rt");
      if(fpin3 == NULL || fpin6 == NULL) {
        printf("\nError opening input files");
        return 1;
      }
      if((fpout =fopen(fileOut, "wb"))== NULL) {
        printf("\nError opening output file");
        return 1;
      }
      printf("\n\n\n");
      for(i=0;i<5;i++) {
        fread(&rec, sizeof(rec), 1, fpin3);
        
        while(fread(char6, 6, 1, fpin6) >0) {
          fwrite(rec.char3, 3, 1, fpout);     //fpout or stdout (for debug)
          fwrite(char6, 6, 1, fpout);         //ditto
          ++count;
          //getch();
        }
        if(fileOut[7]=='5')
          break;
        rewind(fpin6);
        fclose(fpout);
        printf("\n closing file %s", fileOut);
        fileOut[7]++; //increment the file number in the name
        printf("\n opening file %s", fileOut);
        if((fpout =fopen(fileOut, "wb"))== NULL) {
          printf("\nError opening output file");
          return 1;
        }
      }
      fcloseall();
      printf("\n\n %lu\n\t\t\t    press enter when ready", count);
      i=getchar();
      return 0;
    }

  10. #10
    Registered User
    Join Date
    Sep 2010
    Posts
    16
    @Adak and Brewbuck.....

    Thanks great help, trialling Adaks version at the mo, is massivley quicker than I thought it would be.

    I take your point about generating the whole file from within the program, hadnt really considered that approach, and well out of my skills at the moment.... I suspect maybe one for me to try in the future!!

    Also going to amend code to have an append version as I think this will be useful for me shortly.

    Thanks again....

  11. #11
    Registered User
    Join Date
    Sep 2010
    Posts
    16
    @Adak...

    Got it working to append rather than preappend....

    The program/terminal requires a key to be pressed to exit the terminal window, is there a way round that? I woudl like the progrsm to run create the files then exit.

    I have searched, looked at break, return and exit but none seem to work?

    Thanks again...

  12. #12
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Remove or REM out the i=getchar() line of code. Second from the last line. By REM I mean "REMark" by putting either a // in front of the first letter of the line, or by surrounding the line, like so:

    /* i=getchar(); */

    How long is the longest number of permutations you need? 8 char's is still pretty quick, but around 15, it REALLY begins taking up more time than you'd probably like.

    May I ask what you want these permutations for?

  13. #13
    Registered User
    Join Date
    Sep 2010
    Posts
    16
    (I had tried that but it doesnt close properly i.e with no user intervention)

    When i do

    Code:
    /*i=getchar();*/
    I get.....

    closing file dic1.txt
    opening file dic2.txt
    closing file dic2.txt
    opening file dic3.txt
    closing file dic3.txt
    opening file dic4.txt
    closing file dic4.txt
    opening file dic5.txt
    Process returned 0 (0x0) execution time : 11.421 s
    Press any key to continue.

    As to the why I am doing this....

    I am building a process to utilse aircrack to crack the WPA key of SKY routers over a cluster of machines (i.e. I have 100 machines at work so need them all to have their own dictionaries and it is more effecient to have each machine build its own (hence the 3 character index txt file) than to have one master copy and copy them across. This is possible on the SKY routers (V2) as the algorithm is known that generates the default WPA key. I am still looking in total at 130 days of process (hence the need for 100 machines) but this is a considerable improvement on 1511 days without the algorithm.... This may be familiar to you....

  14. #14
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    That "press any key to continue" message, is from your console window, not the program.

    The program message is "press enter when ready", and if you've REM'd out the getchar line, then you will not have any pause there.

    I have not been able to find the Windows control for the behavior of the console window when it's program terminates.

    I remember when they were working to attack the RSA (small key) challenge message, using gobs of computers, over the internet. Very successful, I believe. Your 100 machine "army" sounds formidable.

    Generating the permutations is (relatively) fast, but what about the time needed to test if a string is correct or not? That could REALLY slow the search to a crawl.

    I'm not familiar with WAP or sky routers, and like all crypto problems, it's interesting. How long is the WAP key?

    Whatever you do, don't do anything illegal.

  15. #15
    Registered User
    Join Date
    Sep 2010
    Posts
    16
    The generation of a 100mb dictionary takes about 4 minutes, aircrack requires approx 2 hours to test this. Aircrack compares the dictionary permutation one after another, on my machine processing about 1500 per second.

    100 is reasonable but its very CPU consuming.... my aim is to try and develop this into a SETI type process (where lots of users share the load i.e. take it from 100 computers to xxxxx). This would make most WPA keys breakbable.

    The WPA key is 8 charcters (A-Z) therfore there are 26 ^ 8 permutations = 208 827 064 576 (208 billion). This would take 1 machine 1606 days to complete (at 1500 per sec). This is what supports the security on wireless WPA keys.

    Using the algorithm which is based on the SSID ie. SKY12345 reduces the permutations to 16 billion, which again one machine 124 days to crack.

    Take 124 days over 100 machines and the SKY WPA is crackable in just over 1 day.

    Its by no means a break through (the knowledge of this is well spread over the internet), but its I am interested in seeing if it can be put into practice (please note I have no interest in breaking peoples keys, why woudl I spend 124 days of processing to get interenet acccess....) It s challenge to myself, if it works I am sure someone will look to improve and develop onwards....

    On a programming note.... I can work round the terminal charcter for the time being (I'll just generate the 5 dics on my machine and pass them to the slave machine).... I am trying to build the 3 charater dics (there will be 1000 of these each containing approx 1521 3 character codes).

    In effect if the SSID code is 123 is will either need to go and get "dic 123" or generate it. To generate is effectivley charcater 1 if from a pool of 13 letters, character 2 is from a pool of 13 letters, character 3 is from a pool of 8 or 9 letters. This giving 13 * 13 *9= 1521 3 character permutations.

    Just trying to decide if to build the 1000 files as static data, or try and code this.

    i.e enter SSID = 12345
    convert to short code 135
    1 = letters pool ABHYJIODF
    3 = letters pool FHDFEHGFD
    5 = letters pool SDGGVDF
    generate permutations to txt file (for use in code built earlier)

    Which route do you thinkis easiest?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. power_of_two function
    By Aisthesis in forum C++ Programming
    Replies: 19
    Last Post: 09-24-2010, 02:54 PM
  2. Buidl Library with ./configure script
    By Jardon in forum C Programming
    Replies: 6
    Last Post: 07-24-2009, 09:36 AM
  3. Inserting a swf file in a windows application
    By face_master in forum Windows Programming
    Replies: 12
    Last Post: 05-03-2009, 11:29 AM
  4. Need help with basic calculation program.
    By StateofMind in forum C Programming
    Replies: 18
    Last Post: 03-06-2009, 01:44 AM
  5. Output problems with structures
    By Gkitty in forum C Programming
    Replies: 1
    Last Post: 12-16-2002, 05:27 AM

Tags for this Thread