Sorting a very large list

**EdwardCandel** · 05-16-2018

I'm quite new to programming, please excuse my mistakes.

So for my programming project this semester I got a file containing temperature data from all countries in the world, the uncertainty related to that measurement and the date of measure.

I have to upload all of that data to a double linked sorted list. Basically the list has around 600 000 elements (nodes). Each node contains a next pointer, a prev pointer and some data ( as usual, i guess). I have to sort the list by year, so basically it goes node->payload.date.year.

I tried to use the insertion sort alone, but it took me more than 20 minutes to sort it.

I came up with this "algorithm" where I would have an array of structures containing pointers to the middle of the list, so that I wouldn't have to iterate from the head, every single time. I thought saving pointers for each year was reasonable (there are around 200 years).So the structure has a pointer of the type node and it also has an integer (year) as a tag.

If, by now, you have spotted some huge mistake or a hole in my idea, maybe you don't have to look at the code. If it's minimally reasonable, I would appreciate if you did.

The thing is, my new array of structures is messing with the original list. On top of the segmentation fault which ends the execution, the original list dates are being changed.

Code:

void InsertSort(node_t ** _head, dados_temp* _fcountries, pointers_years **pointers, int *count){
    node_t *new_elem = NULL;
    node_t *iterador = NULL;
    new_elem = NewNode(_fcountries);


    int aux=0, i=0, j=0, found=0;


    if(*_head == NULL)
    {
        *_head = new_elem;
    
        *pointers = (pointers_years*)checkedMAlloc(sizeof(pointers_years));


        pointers[0]->year = new_elem->payload.dt.year;
        pointers[0]->pointer = new_elem;


        (*count)++;




        return;
    }


    for(i=0;i<(*count);i++)
    {
        if( (new_elem->payload.dt.year) == (pointers[i]->year))
        {
            iterador = pointers[i]->pointer;
            found=1;
            break;
        }
        else
        {
            found=0;
        }
    }
    if(found==0)
    {
        *pointers = (pointers_years*)realloc(*pointers,sizeof(pointers_years)*(i+1));


        pointers[i]->year = new_elem->payload.dt.year;
        pointers[i]->pointer = new_elem;


        aux = abs(pointers[0]->year-pointers[i]->year);
        iterador=*_head;
        for(j=0;j<*count;j++)
        {
            if(abs(pointers[j]->year-pointers[i]->year)<aux && (pointers[j]->year)<(pointers[i]->year))
            {
                aux=abs(pointers[j]->year-pointers[i]->year);
                iterador=pointers[j]->pointer;
            }
        }
        (*count)++;
    }




    while(date_cmp(iterador,new_elem)<0) // if iterator < new element
    {
        if(iterador->next!=NULL)
            iterador=iterador->next;
        else // if we have reached the last node of the list
        {
            iterador->next=new_elem;
            new_elem->prev = iterador;
            return;
        }
    }


    if(iterador==*_head )
    {
        new_elem->next=iterador;
        iterador->prev=new_elem;
        (*_head)=new_elem;
    }
    else if (iterador==pointers[j]->pointer)
    {
        new_elem->prev=iterador->prev;
        iterador->prev->next=new_elem;
        new_elem->next=iterador;
        iterador->prev=new_elem;
        pointers[j]->pointer=new_elem;
    }
    else
    {
         new_elem->next=iterador;
         new_elem->prev=iterador->prev;
         iterador->prev->next=new_elem;
         iterador->prev=new_elem;
    }
}

I realise it may be very confusing, I am sorry.

If you need to see the function where I call this one (the function that reads from the file and send the data to an auxiliar structure named _fcountries), just say it.

Also here's put a print from the terminal after i compile this.

Sorting a very large list-captura-de-ecra-2018-05-16-s-12-54-27-png

Sorting a very large list-captura-de-ecra-2018-05-16-s-12-54-27-png

The first two ones are okay, the third is obviously not okay. Also, it should print way more than three!

**john.c** · 05-16-2018

Unfortunately, it isn't scanning through the list that is making the sort slow. Selection sort is N-squared, so it sucks even with an array. It's only good for small lists. And it's useless for linked lists. So even if your code worked, it won't be much faster (maybe 18 minutes instead of 20, if you're lucky).

Merge sort is excellent for linked lists.

**Salem** · 05-16-2018

Rather than read into a list, I'd read into an array expanding using realloc. If you double the size each time you need to, it is quite efficient. You can then use qsort to sor it, again efficiently.

**john.c** · 05-16-2018

Salem makes a good point. If you don't need to use a linked list, use an array and sort that (with qsort ... not selection sort!).

I whipped up an imp of a merge sort on a linked list.
Unfortunately it has a bug that, on my machine at least, causes it to segfault if you ask for a list of about 270000 or more.
Haven't been able to ferret it out.
Anyone?

Code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

typedef struct Node {
    int n;
    struct Node *next;
} Node;

Node *merge(Node *a, Node *b) {
    Node *result = NULL;
    if (!a) return b;
    if (!b) return a;
    if (a->n <= b->n) {
        result = a;
        result->next = merge(a->next, b);
    }
    else {
        result = b;
        result->next = merge(a, b->next);
    }
    return result;
}

void split(Node *head, Node **front, Node **back) {
    Node *slow = head;       // head is never NULL
    Node *fast = head->next; // fast moves twice as fast as slow
    while (fast) {
        fast = fast->next;
        if (fast) {
            slow = slow->next;
            fast = fast->next;
        }
    }
    *front = head;
    *back = slow->next;
    slow->next = NULL;
}

void merge_sort(Node **head) {
    Node *nd = *head, *a, *b;
    if (!nd || !nd->next)
        return;
    split(nd, &a, &b);
    merge_sort(&a);
    merge_sort(&b);
    *head = merge(a, b);
}

void print(Node *nd) {
    for ( ; nd; nd = nd->next)
        printf("%d ", nd->n);
    printf("\n\n");
}

void add(Node **nd, int n) {
    Node *newnd = malloc(sizeof *newnd);
    if (!newnd) {
        perror("add:malloc");
        exit(EXIT_FAILURE);
    }
    newnd->n = n;
    newnd->next = *nd;
    *nd = newnd;
}

int main(int argc, char **argv) {
    int n = 100;
    if (argc == 2) n = atoi(argv[1]);

    srand(time(NULL));

    Node *head = NULL;
    for (int i = 0; i < n; i++)
        add(&head, rand() % (n * 10));

    if (n <= 100) print(head);

    merge_sort(&head);

    if (n <= 100) print(head);

    // allowing OS to free memory...
}

**EdwardCandel** · 05-16-2018

Originally Posted by Salem

Rather than read into a list, I'd read into an array expanding using realloc. If you double the size each time you need to, it is quite efficient. You can then use qsort to sor it, again efficiently.

Unfortunetely, I'm stuck with having to use a list as it's required for this project. Thanks anyway!

**EdwardCandel** · 05-16-2018

Alright thanks! I have tried using the merge sort, but for some reason when I try to sort more than 180 000 it causes a segmentation fault...

**EdwardCandel** · 05-16-2018

Originally Posted by john.c

Salem makes a good point. If you don't need to use a linked list, use an array and sort that (with qsort ... not selection sort!).

I whipped up an imp of a merge sort on a linked list.
Unfortunately it has a bug that, on my machine at least, causes it to segfault if you ask for a list of about 270000 or more.
Haven't been able to ferret it out.
Anyone?

Code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

typedef struct Node {
    int n;
    struct Node *next;
} Node;

Node *merge(Node *a, Node *b) {
    Node *result = NULL;
    if (!a) return b;
    if (!b) return a;
    if (a->n <= b->n) {
        result = a;
        result->next = merge(a->next, b);
    }
    else {
        result = b;
        result->next = merge(a, b->next);
    }
    return result;
}

void split(Node *head, Node **front, Node **back) {
    Node *slow = head;       // head is never NULL
    Node *fast = head->next; // fast moves twice as fast as slow
    while (fast) {
        fast = fast->next;
        if (fast) {
            slow = slow->next;
            fast = fast->next;
        }
    }
    *front = head;
    *back = slow->next;
    slow->next = NULL;
}

void merge_sort(Node **head) {
    Node *nd = *head, *a, *b;
    if (!nd || !nd->next)
        return;
    split(nd, &a, &b);
    merge_sort(&a);
    merge_sort(&b);
    *head = merge(a, b);
}

void print(Node *nd) {
    for ( ; nd; nd = nd->next)
        printf("%d ", nd->n);
    printf("\n\n");
}

void add(Node **nd, int n) {
    Node *newnd = malloc(sizeof *newnd);
    if (!newnd) {
        perror("add:malloc");
        exit(EXIT_FAILURE);
    }
    newnd->n = n;
    newnd->next = *nd;
    *nd = newnd;
}

int main(int argc, char **argv) {
    int n = 100;
    if (argc == 2) n = atoi(argv[1]);

    srand(time(NULL));

    Node *head = NULL;
    for (int i = 0; i < n; i++)
        add(&head, rand() % (n * 10));

    if (n <= 100) print(head);

    merge_sort(&head);

    if (n <= 100) print(head);

    // allowing OS to free memory...
}

yep same! though it takes less for me to get a segfault, I can't reach the 200 k.

**jimblumberg** · 05-16-2018

It's possibly because your merge() function has infinite recursion.

**john.c** · 05-16-2018

Originally Posted by jimblumberg

It's possibly because your merge() function has infinite recursion.

No. It's just hitting the stack limit. Increase the limit and it works.
I'm going to try redesigning it to use less stack space.

EDIT: This merge fixes the problem.

Code:

Node *merge(Node *a, Node *b) {
    Node *result = NULL, **p = &result;
    while (a && b) {
        if (a->n <= b->n) {
            *p = a;
            p = &a->next;
            a = a->next;
        }
        else {
            *p = b;
            p = &b->next;
            b = b->next;
        }
    }
    if (a) *p = a;
    if (b) *p = b;
    return result;
}

**EdwardCandel** · 05-17-2018

Originally Posted by john.c

No. It's just hitting the stack limit. Increase the limit and it works.
I'm going to try redesigning it to use less stack space.

EDIT: This merge fixes the problem.

Code:

Node *merge(Node *a, Node *b) {
    Node *result = NULL, **p = &result;
    while (a && b) {
        if (a->n <= b->n) {
            *p = a;
            p = &a->next;
            a = a->next;
        }
        else {
            *p = b;
            p = &b->next;
            b = b->next;
        }
    }
    if (a) *p = a;
    if (b) *p = b;
    return result;
}

Wow that works amazing! I'm down to 10 seconds, 8 seconds to read the file, and just 2 seconds to sort it! Could you explain how doing that prevents stack overflow , though? It's just that I don't want to put something in my project that I don't really understand.

**john.c** · 05-17-2018

Originally Posted by EdwardCandel

I'm down to 10 seconds, 8 seconds to read the file, and just 2 seconds to sort it!

That's the magic of an NlogN algorithm (e.g., mergesort) compared to an N-squared algorithm (e.g., selection sort). Time complexity - Wikipedia

Could you explain how doing that prevents stack overflow , though?

The first version of the merge was needlessly recursive in the sense that it's easy to write a non-recursive version of it. And it was not "tail-call" recursive, so it was kind of recursion that piles up stack frames. In particular, it would use stack space proportional to the size of the list. So for a large list it would easily overflow the default stack limit.

The new version isn't recursive so it only uses one stack frame no matter how large the list.

**EdwardCandel** · 05-17-2018

Originally Posted by john.c

That's the magic of an NlogN algorithm (e.g., mergesort) compared to an N-squared algorithm (e.g., selection sort). Time complexity - Wikipedia

The first version of the merge was needlessly recursive in the sense that it's easy to write a non-recursive version of it. And it was not "tail-call" recursive, so it was kind of recursion that piles up stack frames. In particular, it would use stack space proportional to the size of the list. So for a large list it would easily overflow the default stack limit.

The new version isn't recursive so it only uses one stack frame no matter how large the list.

Alright thanks man! I'll make sure to give you some credit if this project goes well

Thread: Sorting a very large list

Thread Tools

Search Thread

Display

Need help in sorting a very large list

Similar Threads

Problem extracting data from a large file and passing it to a linked list

Need help with list sorting

help with sorting sequence in Unix C:sort -t ':' +0 -1 -n +1 -2 +2 -3 -o list list

help with sorting a list :(

std::list<*> sorting

Tags for this Thread