# Thread: Porformance problem building a list

1. ## Porformance problem building a list

Hi, i regret asking for help on my very first topic on the forum, but i truly need a hand or two, and please forgive my poor english.

I have to read "symbols" from a file, and build a list saving on each node a symbol and the number of times that symbol appears, it may seem easy, but the problem is that the program has to finish under 3 seconds and there are 26 million characters to be read. As you will see, im a novice, so my code is very rudimentary, and of course its not even close to those 3 seconds i need to get.

You may be wondering why i wrote "symbols", the reason is that i get from the parameter line how many ASCII characters form a "symbol" for example if i get "2" from the parameter line, each symbol will be pairs of ASCIIs like "7(" or "P;".

That being said, i googled a lot with no succes, so i hope that you guys can help me figure out the way to speed up this program. Thanks a lot for your time.

Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

int num_param, t;
char *param[4];
struct symbolic_node{
int counter;
short status;
struct symbolic_node *sig;
struct symbolic_node *ant;
char *symb;
};
int sintax_check();
int symbol_obtainer(char **);
void list_builder(char **,int,struct symbolic_node **);

int main(int argc, char *argv[]){

int i,j,sintax_error,n;
short array_pos=0;
FILE *pFile;
int lSize;
char *buffer;
struct symbolic_node *l_symb;

num_param=argc-1;
for(i=1;i<=num_param;i++){
param[i]=argv[i];
}
n=atoi(param[2]);
sintax_error=sintax_check();
if(sintax_error==1){
fputs("Sintax error\n",stderr);
}else{
lSize=symbol_obtainer(&buffer);
list_builder(&buffer,lSize,&l_symb);
free(buffer);
printf("Done\n");
}
}

int sintax_check(){

int i, error=0;
char *ptr;

if(num_param!=2){
error=1;
}else{
if(error==0){
if(isdigit(param[2][0])==0)error=1;
}
}
return error;
}

int symbol_obtainer(char **buff){

FILE *pfile;
int num_bytes;
char *char_storage;

n=atoi(param[2]);
pfile=fopen(param[1],"rb");
if(pfile==NULL){
exit(1);
}else{
fseek(pfile,0,SEEK_END);
num_bytes=ftell(pfile);
rewind(pfile);
num_bytes=num_bytes/n;
char_storage=(char *)calloc(num_bytes,n);
if(char_storage==NULL){
fputs("Memory allocation error\n",stderr);
exit(2);
}
exit(3);
}
}
*buff=char_storage;
fclose(pfile);
return num_bytes;
}

void list_builder(char **buff,int tamanio,struct symbolic_node **header){

struct symbolic_node *aux, *aux1, *aux2, *aux3, *lsymb;
short n,run_level;
short num_nodos=0;
int i,cont=0;
char *char_storage;
n=atoi(param[2]);

aux=(struct symbolic_node *)malloc(sizeof(struct symbolic_node));
aux->ant=NULL;
aux->sig=NULL;
char_storage=*buff;
aux3=aux;
aux->symb=(char *)malloc(n);
memcpy(aux->symb,char_storage,n);
aux->counter=1;
num_nodos++;
char_storage+=n;
for(i=1;i<tamanio;i++){
run_level=0;
aux1=(struct symbolic_node *)malloc(sizeof(struct symbolic_node));
aux1->symb=(char*)malloc(n);
memcpy(aux1->symb,char_storage,n);
aux1->counter=1;
aux2=aux;

while(aux2!=NULL&&(run_level!=1)){
if(memcmp(aux2->symb,aux1->symb,n)==0){
free(aux1->symb);
free(aux1);
aux2->counter+=1;
run_level=1;
}
aux3=aux2;
aux2=aux2->sig;
}
if(run_level==0){
num_nodos++;
aux1->ant=aux3;
aux3->sig=aux1;
aux3=aux3->sig;
aux1->sig=NULL;
}
char_storage+=n;
}

}
Thx a lot, plz let me know if you dont understand something or if you need anything else.

2. Sounds like you want to implement hash table or tree?

3. > aux1=(struct symbolic_node *)malloc(sizeof(struct symbolic_node));
> aux1->symb=(char*)malloc(n);
> memcpy(aux1->symb,char_storage,n);
You should definitely do this AFTER the search has failed to find what you're looking for.
It's expensive to keep allocating and freeing things you're not going to keep.

Also, check your editor isn't using spaces and TABS for indentation (switch to spaces only, and make sure all hard tabs are replaced).
Your editor can cope, but forums usually mess it up.

4. Originally Posted by Bayint Naung
Sounds like you want to implement hash table or tree?
Hi, do you mean building a tree or a hash table instead of building the list? Are those data estructures faster than a list? I had never used them before.

Originally Posted by Salem
> aux1=(struct symbolic_node *)malloc(sizeof(struct symbolic_node));
> aux1->symb=(char*)malloc(n);
> memcpy(aux1->symb,char_storage,n);
You should definitely do this AFTER the search has failed to find what you're looking for.
It's expensive to keep allocating and freeing things you're not going to keep.

Also, check your editor isn't using spaces and TABS for indentation (switch to spaces only, and make sure all hard tabs are replaced).
Your editor can cope, but forums usually mess it up.
Hi, ok, im gonna try allocating memory after the search is done, and thx for the advice about the tabs, gedit displayed the code very good but as you said the forum messed it up.

EDIT: MOTHER OF GOD, i did as you said and the time dropped dramatically from 6.3/6.5 to 3.6-3.8 its amazing. Thinking about it, what i was doing was retarded, creating and destroying a node when i can compare directly from the array. Thx a lot, i think i will never forget this mistake.

5. A balanced binary tree is definitely faster than a linked list.
A binary tree can be searched in O(log2(N)) time, whereas a linked list is O(N/2). So for 1000 elements, you're looking at 10 comparisons vs. 500.

An alternative is an array of lists, like
struct symbolic_node *lists[128];

Rather than having one global list, have a separate list keyed on the first letter of the symbol, say
memcpy(lists[char_storage[0]]->symb,char_storage,n);

You'll have many shorter lists, but you only need to search one of them at any one time.

6. Are you guaranteed to have 26 million bytes of memory? Why not just read the entire file with one fread. Then sort and count duplicates using the 'width' as specified.
I don't understand the need for lists especially when worst case it can be just as big as the entire file, plus all that linking junk, plus it complicates everything.

7. Originally Posted by nonoob
Are you guaranteed to have 26 million bytes of memory? Why not just read the entire file with one fread. Then sort and count duplicates using the 'width' as specified.
I don't understand the need for lists especially when worst case it can be just as big as the entire file, plus all that linking junk, plus it complicates everything.
Well, i use lists becouse we are just starting with C at college so im unfamiliar with any other data estructures. Thankfully memory usage doesnt matter so yes, we can spend as much memory as we want. So you would keep the symbols in the array, sort them, and use the memory width to know how many of them are there right? sounds good, im going to try your idea, wish me luck and thx a lot for your help.