Hi, I'm building an algorithm for very huge graphs, over several millions of edges (e.g. 5*10^6). A graph is represented in a text file where the edges are in this way:

"string_origin"|"string_destination"|integer

where "string_origin" and "string_destination" are the name of the initial and ending nodes of each edge, respectively (such names can have characters, blanks and numbers). integer is the weight of the edge. Now I read this edges as:

Code:
``` while (fd.getline(my_string, SIZE)) {
istringstream fichlin(my_string);
fichlin.getline(node1, SIZE, '|');
fichlin.getline(node2, SIZE, '|');
fichlin >> Weight;
//process node1 (char *), node2 (char *) and Weight (int)...
//...
}```
My problem is that it is very slow!! In my computer the execution of this loop runs in over 30 sec., so the entry of the input data of the problem supposses more running time than the performance of the underlying algorithm used after.

I would like to know if there's a more efficient way of doing this same operation of strings reading, either in C or in C++ and although it were less clear, so that I could decrease the running time of the program.

Thank you very much in advance!

----------------------------------------------------

2. Disk I/O is inherently very slow, but perhaps reading data in large chunks at a time would help? For example, allocate a buffer of maybe 10kb, use fd.read() to fill the buffer, parse appropriate information from it until you get to a fragmented line, shift the leftovers to the left, fill the buffer again, etc.

Otherwise, probably the best way to speed things up is to represent the data in a more efficient manner in the file. For example, instead of storing the name of each vertex at each edge, store 2 or 4 bytes corresponding to an unsigned long ID for the vertex, and have a lookup table at the head of the file mapping ID's to names. Similarly, the 'integer' could be stored as an unsigned short or unsigned long.

If this is not possible, I don't know how else you can improve speed; I'm not familiar with the lower-level workings of file I/O.

3. By using getline() so often, you're reading the string in more times than you have to. You also don't have to use stringstream to assign node1 and node2 their values, it is only necessary to use it to assign Weight's value as an integer type.

Code:
```int parts = 1;
while(fd.getline(my_string, SIZE, '|') && parts!=3)
{
switch(parts) {
case 1: node1 = my_string;
break;
case 2: node2 = my_string;
break;
case 3:      // fall through
default: istringstream fichlin(my_string)>>Weight;
}
++parts;
}

// now that you have all the data... process it outside the loop```
That should be a lot easier to understand and faster. You can always use the c_str() function to get a character array if you need one.

4. Some results from a 5M line file.
Code:
```#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
using namespace std;
const int SIZE = 100;
void process ( char *node1, char *node2, int w ) {
*node1 = '\0';
*node2 = '\0';
w = 0;
}
int main ( ) {
ifstream fd("tmp");
char my_string[SIZE];
while (fd.getline(my_string, SIZE)) {
char node1[100], node2[100]; int Weight;
istringstream fichlin(my_string);
fichlin.getline(node1, SIZE, '|');
fichlin.getline(node2, SIZE, '|');
fichlin >> Weight;
process( node1, node2, Weight );
}
return 0;
}
# Using standard C++
\$ g++ -W -Wall -ansi -pedantic -O2  foo.cpp
\$ time ./a.out
real    0m23.032s
user    0m16.821s
sys     0m0.457s

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void process ( char *node1, char *node2, int w ) {
*node1 = '\0';
*node2 = '\0';
w = 0;
}
int main ( ) {
char buff[BUFSIZ];
FILE *fp = fopen("tmp","r");
while ( fgets( buff, sizeof buff, fp ) != NULL ) {
char  *node1 = strtok(buff,"|");
char  *node2 = strtok(NULL,"|");
int   Weight = atoi(strtok(NULL,"\n"));
process( node1, node2, Weight );
}
fclose(fp);
return 0;
}
# Using standard C
\$ gcc -W -Wall -ansi -pedantic -O2 bar.c
\$ time ./a.out
real    0m15.179s
user    0m5.492s
sys     0m0.431s

#include <stdlib.h>
#include <stdio.h>
int main ( ) {
char buff[BUFSIZ];
FILE *fp = fopen("tmp","r");
while ( fread ( buff, 1, sizeof buff, fp ) != 0 ) {
}
fclose(fp);
return 0;
}
# Using standard C API to read blocks of data
\$ gcc -W -Wall -ansi -pedantic -O2 foo.c
\$ time ./a.out
real    0m12.633s
user    0m0.022s
sys     0m0.448s

#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main ( ) {
char buff[BUFSIZ];
int fd = open("tmp", O_RDONLY );
while ( read( fd, buff, BUFSIZ ) > 0 ) {
}
close( fd );
return 0;
}
# Using implementation-specific API to read blocks of data
\$ gcc -W -Wall -ansi -pedantic -O2 baz.c
\$ time ./a.out
real    0m12.210s
user    0m0.015s
sys     0m0.510s```
I didn't bother to write the tokeniser for the last two cases.

But as Hunter2 says, disk I/O is expensive, so it's hard to get away from it.