Thread: efficient strings reading

  1. #1
    Registered User
    Join Date
    Apr 2006
    Posts
    1

    efficient strings reading

    Hi, I'm building an algorithm for very huge graphs, over several millions of edges (e.g. 5*10^6). A graph is represented in a text file where the edges are in this way:

    "string_origin"|"string_destination"|integer

    where "string_origin" and "string_destination" are the name of the initial and ending nodes of each edge, respectively (such names can have characters, blanks and numbers). integer is the weight of the edge. Now I read this edges as:

    Code:
     while (fd.getline(my_string, SIZE)) {
     istringstream fichlin(my_string);
     fichlin.getline(node1, SIZE, '|'); 
     fichlin.getline(node2, SIZE, '|');
     fichlin >> Weight;
     //process node1 (char *), node2 (char *) and Weight (int)...
    //...
     }
    My problem is that it is very slow!! In my computer the execution of this loop runs in over 30 sec., so the entry of the input data of the problem supposses more running time than the performance of the underlying algorithm used after.

    I would like to know if there's a more efficient way of doing this same operation of strings reading, either in C or in C++ and although it were less clear, so that I could decrease the running time of the program.

    Thank you very much in advance!

    ----------------------------------------------------

  2. #2
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    Disk I/O is inherently very slow, but perhaps reading data in large chunks at a time would help? For example, allocate a buffer of maybe 10kb, use fd.read() to fill the buffer, parse appropriate information from it until you get to a fragmented line, shift the leftovers to the left, fill the buffer again, etc.

    Otherwise, probably the best way to speed things up is to represent the data in a more efficient manner in the file. For example, instead of storing the name of each vertex at each edge, store 2 or 4 bytes corresponding to an unsigned long ID for the vertex, and have a lookup table at the head of the file mapping ID's to names. Similarly, the 'integer' could be stored as an unsigned short or unsigned long.

    If this is not possible, I don't know how else you can improve speed; I'm not familiar with the lower-level workings of file I/O.
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  3. #3
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    By using getline() so often, you're reading the string in more times than you have to. You also don't have to use stringstream to assign node1 and node2 their values, it is only necessary to use it to assign Weight's value as an integer type.

    Code:
    int parts = 1;
    while(fd.getline(my_string, SIZE, '|') && parts!=3)
    {
      switch(parts) {
             case 1: node1 = my_string;
             break;
             case 2: node2 = my_string;
             break;
             case 3:      // fall through
             default: istringstream fichlin(my_string)>>Weight;
      }
      ++parts;
    }
    
    // now that you have all the data... process it outside the loop
    That should be a lot easier to understand and faster. You can always use the c_str() function to get a character array if you need one.

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Some results from a 5M line file.
    Code:
    #include <iostream>
    #include <fstream>
    #include <string>
    #include <sstream>
    using namespace std;
    const int SIZE = 100;
    void process ( char *node1, char *node2, int w ) {
      *node1 = '\0';
      *node2 = '\0';
      w = 0;
    }
    int main ( ) {
      ifstream fd("tmp");
      char my_string[SIZE];
      while (fd.getline(my_string, SIZE)) {
        char node1[100], node2[100]; int Weight;
        istringstream fichlin(my_string);
        fichlin.getline(node1, SIZE, '|');
        fichlin.getline(node2, SIZE, '|');
        fichlin >> Weight;
        process( node1, node2, Weight );
      }
      return 0;
    }
    # Using standard C++
    $ g++ -W -Wall -ansi -pedantic -O2  foo.cpp
    $ time ./a.out
    real    0m23.032s
    user    0m16.821s
    sys     0m0.457s
    
    
    
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    void process ( char *node1, char *node2, int w ) {
      *node1 = '\0';
      *node2 = '\0';
      w = 0;
    }
    int main ( ) {
      char buff[BUFSIZ];
      FILE *fp = fopen("tmp","r");
      while ( fgets( buff, sizeof buff, fp ) != NULL ) {
        char  *node1 = strtok(buff,"|");
        char  *node2 = strtok(NULL,"|");
        int   Weight = atoi(strtok(NULL,"\n"));
        process( node1, node2, Weight );
      }
      fclose(fp);
      return 0;
    }
    # Using standard C
    $ gcc -W -Wall -ansi -pedantic -O2 bar.c
    $ time ./a.out
    real    0m15.179s
    user    0m5.492s
    sys     0m0.431s
    
    
    
    #include <stdlib.h>
    #include <stdio.h>
    int main ( ) {
      char buff[BUFSIZ];
      FILE *fp = fopen("tmp","r");
      while ( fread ( buff, 1, sizeof buff, fp ) != 0 ) {
      }
      fclose(fp);
      return 0;
    }
    # Using standard C API to read blocks of data
    $ gcc -W -Wall -ansi -pedantic -O2 foo.c
    $ time ./a.out
    real    0m12.633s
    user    0m0.022s
    sys     0m0.448s
    
    
    
    #include <unistd.h>
    #include <stdio.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    int main ( ) {
      char buff[BUFSIZ];
      int fd = open("tmp", O_RDONLY );
      while ( read( fd, buff, BUFSIZ ) > 0 ) {
      }
      close( fd );
      return 0;
    }
    # Using implementation-specific API to read blocks of data
    $ gcc -W -Wall -ansi -pedantic -O2 baz.c
    $ time ./a.out
    real    0m12.210s
    user    0m0.015s
    sys     0m0.510s
    I didn't bother to write the tokeniser for the last two cases.

    But as Hunter2 says, disk I/O is expensive, so it's hard to get away from it.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Help with reading strings from a file
    By ldcel in forum C Programming
    Replies: 3
    Last Post: 12-01-2007, 01:31 AM
  2. Reading and Writing Strings
    By howdy_hound in forum C++ Programming
    Replies: 5
    Last Post: 01-17-2007, 11:28 PM
  3. Reading strings input by the user...
    By Cmuppet in forum C Programming
    Replies: 13
    Last Post: 07-21-2004, 06:37 AM
  4. question about reading in strings from a file :>
    By bball887 in forum C Programming
    Replies: 8
    Last Post: 04-13-2004, 06:24 PM
  5. Reading Strings
    By blsst40 in forum C Programming
    Replies: 1
    Last Post: 11-20-2001, 12:29 PM