Thread: Sorting a very large list

  1. #1
    Registered User
    Join Date
    May 2018
    Posts
    7

    Question Need help in sorting a very large list

    I'm quite new to programming, please excuse my mistakes.

    So for my programming project this semester I got a file containing temperature data from all countries in the world, the uncertainty related to that measurement and the date of measure.

    I have to upload all of that data to a double linked sorted list. Basically the list has around 600 000 elements (nodes). Each node contains a next pointer, a prev pointer and some data ( as usual, i guess). I have to sort the list by year, so basically it goes node->payload.date.year.

    I tried to use the insertion sort alone, but it took me more than 20 minutes to sort it.

    I came up with this "algorithm" where I would have an array of structures containing pointers to the middle of the list, so that I wouldn't have to iterate from the head, every single time. I thought saving pointers for each year was reasonable (there are around 200 years).So the structure has a pointer of the type node and it also has an integer (year) as a tag.

    If, by now, you have spotted some huge mistake or a hole in my idea, maybe you don't have to look at the code. If it's minimally reasonable, I would appreciate if you did.

    The thing is, my new array of structures is messing with the original list. On top of the segmentation fault which ends the execution, the original list dates are being changed.

    Code:
    void InsertSort(node_t ** _head, dados_temp* _fcountries, pointers_years **pointers, int *count){
        node_t *new_elem = NULL;
        node_t *iterador = NULL;
        new_elem = NewNode(_fcountries);
    
    
        int aux=0, i=0, j=0, found=0;
    
    
        if(*_head == NULL)
        {
            *_head = new_elem;
        
            *pointers = (pointers_years*)checkedMAlloc(sizeof(pointers_years));
    
    
            pointers[0]->year = new_elem->payload.dt.year;
            pointers[0]->pointer = new_elem;
    
    
            (*count)++;
    
    
    
    
            return;
        }
    
    
        for(i=0;i<(*count);i++)
        {
            if( (new_elem->payload.dt.year) == (pointers[i]->year))
            {
                iterador = pointers[i]->pointer;
                found=1;
                break;
            }
            else
            {
                found=0;
            }
        }
        if(found==0)
        {
            *pointers = (pointers_years*)realloc(*pointers,sizeof(pointers_years)*(i+1));
    
    
            pointers[i]->year = new_elem->payload.dt.year;
            pointers[i]->pointer = new_elem;
    
    
            aux = abs(pointers[0]->year-pointers[i]->year);
            iterador=*_head;
            for(j=0;j<*count;j++)
            {
                if(abs(pointers[j]->year-pointers[i]->year)<aux && (pointers[j]->year)<(pointers[i]->year))
                {
                    aux=abs(pointers[j]->year-pointers[i]->year);
                    iterador=pointers[j]->pointer;
                }
            }
            (*count)++;
        }
    
    
    
    
        while(date_cmp(iterador,new_elem)<0) // if iterator < new element
        {
            if(iterador->next!=NULL)
                iterador=iterador->next;
            else // if we have reached the last node of the list
            {
                iterador->next=new_elem;
                new_elem->prev = iterador;
                return;
            }
        }
    
    
        if(iterador==*_head )
        {
            new_elem->next=iterador;
            iterador->prev=new_elem;
            (*_head)=new_elem;
        }
        else if (iterador==pointers[j]->pointer)
        {
            new_elem->prev=iterador->prev;
            iterador->prev->next=new_elem;
            new_elem->next=iterador;
            iterador->prev=new_elem;
            pointers[j]->pointer=new_elem;
        }
        else
        {
             new_elem->next=iterador;
             new_elem->prev=iterador->prev;
             iterador->prev->next=new_elem;
             iterador->prev=new_elem;
        }
    }
    I realise it may be very confusing, I am sorry.

    If you need to see the function where I call this one (the function that reads from the file and send the data to an auxiliar structure named _fcountries), just say it.

    Also here's put a print from the terminal after i compile this.

    Sorting a very large list-captura-de-ecra-2018-05-16-s-12-54-27-png

    The first two ones are okay, the third is obviously not okay. Also, it should print way more than three!
    Last edited by EdwardCandel; 05-16-2018 at 06:15 AM.

  2. #2
    Registered User john.c's Avatar
    Join Date
    Dec 2017
    Posts
    344
    Unfortunately, it isn't scanning through the list that is making the sort slow. Selection sort is N-squared, so it sucks even with an array. It's only good for small lists. And it's useless for linked lists. So even if your code worked, it won't be much faster (maybe 18 minutes instead of 20, if you're lucky).


    Merge sort is excellent for linked lists.
    Simplicity is the ultimate sophistication.

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    36,559
    Rather than read into a list, I'd read into an array expanding using realloc. If you double the size each time you need to, it is quite efficient. You can then use qsort to sor it, again efficiently.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #4
    Registered User john.c's Avatar
    Join Date
    Dec 2017
    Posts
    344
    Salem makes a good point. If you don't need to use a linked list, use an array and sort that (with qsort ... not selection sort!).

    I whipped up an imp of a merge sort on a linked list.
    Unfortunately it has a bug that, on my machine at least, causes it to segfault if you ask for a list of about 270000 or more.
    Haven't been able to ferret it out.
    Anyone?
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    typedef struct Node {
        int n;
        struct Node *next;
    } Node;
    
    Node *merge(Node *a, Node *b) {
        Node *result = NULL;
        if (!a) return b;
        if (!b) return a;
        if (a->n <= b->n) {
            result = a;
            result->next = merge(a->next, b);
        }
        else {
            result = b;
            result->next = merge(a, b->next);
        }
        return result;
    }
    
    void split(Node *head, Node **front, Node **back) {
        Node *slow = head;       // head is never NULL
        Node *fast = head->next; // fast moves twice as fast as slow
        while (fast) {
            fast = fast->next;
            if (fast) {
                slow = slow->next;
                fast = fast->next;
            }
        }
        *front = head;
        *back = slow->next;
        slow->next = NULL;
    }
    
    void merge_sort(Node **head) {
        Node *nd = *head, *a, *b;
        if (!nd || !nd->next)
            return;
        split(nd, &a, &b);
        merge_sort(&a);
        merge_sort(&b);
        *head = merge(a, b);
    }
    
    void print(Node *nd) {
        for ( ; nd; nd = nd->next)
            printf("%d ", nd->n);
        printf("\n\n");
    }
    
    void add(Node **nd, int n) {
        Node *newnd = malloc(sizeof *newnd);
        if (!newnd) {
            perror("add:malloc");
            exit(EXIT_FAILURE);
        }
        newnd->n = n;
        newnd->next = *nd;
        *nd = newnd;
    }
    
    int main(int argc, char **argv) {
        int n = 100;
        if (argc == 2) n = atoi(argv[1]);
    
        srand(time(NULL));
    
        Node *head = NULL;
        for (int i = 0; i < n; i++)
            add(&head, rand() % (n * 10));
    
        if (n <= 100) print(head);
    
        merge_sort(&head);
    
        if (n <= 100) print(head);
    
        // allowing OS to free memory...
    }
    Simplicity is the ultimate sophistication.

  5. #5
    Registered User
    Join Date
    May 2018
    Posts
    7
    Quote Originally Posted by Salem View Post
    Rather than read into a list, I'd read into an array expanding using realloc. If you double the size each time you need to, it is quite efficient. You can then use qsort to sor it, again efficiently.
    Unfortunetely, I'm stuck with having to use a list as it's required for this project. Thanks anyway!

  6. #6
    Registered User
    Join Date
    May 2018
    Posts
    7
    Alright thanks! I have tried using the merge sort, but for some reason when I try to sort more than 180 000 it causes a segmentation fault...

  7. #7
    Registered User
    Join Date
    May 2018
    Posts
    7
    Quote Originally Posted by john.c View Post
    Salem makes a good point. If you don't need to use a linked list, use an array and sort that (with qsort ... not selection sort!).

    I whipped up an imp of a merge sort on a linked list.
    Unfortunately it has a bug that, on my machine at least, causes it to segfault if you ask for a list of about 270000 or more.
    Haven't been able to ferret it out.
    Anyone?
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    typedef struct Node {
        int n;
        struct Node *next;
    } Node;
    
    Node *merge(Node *a, Node *b) {
        Node *result = NULL;
        if (!a) return b;
        if (!b) return a;
        if (a->n <= b->n) {
            result = a;
            result->next = merge(a->next, b);
        }
        else {
            result = b;
            result->next = merge(a, b->next);
        }
        return result;
    }
    
    void split(Node *head, Node **front, Node **back) {
        Node *slow = head;       // head is never NULL
        Node *fast = head->next; // fast moves twice as fast as slow
        while (fast) {
            fast = fast->next;
            if (fast) {
                slow = slow->next;
                fast = fast->next;
            }
        }
        *front = head;
        *back = slow->next;
        slow->next = NULL;
    }
    
    void merge_sort(Node **head) {
        Node *nd = *head, *a, *b;
        if (!nd || !nd->next)
            return;
        split(nd, &a, &b);
        merge_sort(&a);
        merge_sort(&b);
        *head = merge(a, b);
    }
    
    void print(Node *nd) {
        for ( ; nd; nd = nd->next)
            printf("%d ", nd->n);
        printf("\n\n");
    }
    
    void add(Node **nd, int n) {
        Node *newnd = malloc(sizeof *newnd);
        if (!newnd) {
            perror("add:malloc");
            exit(EXIT_FAILURE);
        }
        newnd->n = n;
        newnd->next = *nd;
        *nd = newnd;
    }
    
    int main(int argc, char **argv) {
        int n = 100;
        if (argc == 2) n = atoi(argv[1]);
    
        srand(time(NULL));
    
        Node *head = NULL;
        for (int i = 0; i < n; i++)
            add(&head, rand() % (n * 10));
    
        if (n <= 100) print(head);
    
        merge_sort(&head);
    
        if (n <= 100) print(head);
    
        // allowing OS to free memory...
    }
    yep same! though it takes less for me to get a segfault, I can't reach the 200 k.

  8. #8
    Registered User
    Join Date
    May 2010
    Posts
    4,207
    It's possibly because your merge() function has infinite recursion.

  9. #9
    Registered User john.c's Avatar
    Join Date
    Dec 2017
    Posts
    344
    Quote Originally Posted by jimblumberg View Post
    It's possibly because your merge() function has infinite recursion.
    No. It's just hitting the stack limit. Increase the limit and it works.
    I'm going to try redesigning it to use less stack space.

    EDIT: This merge fixes the problem.
    Code:
    Node *merge(Node *a, Node *b) {
        Node *result = NULL, **p = &result;
        while (a && b) {
            if (a->n <= b->n) {
                *p = a;
                p = &a->next;
                a = a->next;
            }
            else {
                *p = b;
                p = &b->next;
                b = b->next;
            }
        }
        if (a) *p = a;
        if (b) *p = b;
        return result;
    }
    Last edited by john.c; 05-16-2018 at 07:55 PM.
    Simplicity is the ultimate sophistication.

  10. #10
    Registered User
    Join Date
    May 2018
    Posts
    7
    Quote Originally Posted by john.c View Post
    No. It's just hitting the stack limit. Increase the limit and it works.
    I'm going to try redesigning it to use less stack space.

    EDIT: This merge fixes the problem.
    Code:
    Node *merge(Node *a, Node *b) {
        Node *result = NULL, **p = &result;
        while (a && b) {
            if (a->n <= b->n) {
                *p = a;
                p = &a->next;
                a = a->next;
            }
            else {
                *p = b;
                p = &b->next;
                b = b->next;
            }
        }
        if (a) *p = a;
        if (b) *p = b;
        return result;
    }
    Wow that works amazing! I'm down to 10 seconds, 8 seconds to read the file, and just 2 seconds to sort it! Could you explain how doing that prevents stack overflow , though? It's just that I don't want to put something in my project that I don't really understand.

  11. #11
    Registered User john.c's Avatar
    Join Date
    Dec 2017
    Posts
    344
    Quote Originally Posted by EdwardCandel View Post
    I'm down to 10 seconds, 8 seconds to read the file, and just 2 seconds to sort it!
    That's the magic of an NlogN algorithm (e.g., mergesort) compared to an N-squared algorithm (e.g., selection sort). Time complexity - Wikipedia

    Could you explain how doing that prevents stack overflow , though?
    The first version of the merge was needlessly recursive in the sense that it's easy to write a non-recursive version of it. And it was not "tail-call" recursive, so it was kind of recursion that piles up stack frames. In particular, it would use stack space proportional to the size of the list. So for a large list it would easily overflow the default stack limit.

    The new version isn't recursive so it only uses one stack frame no matter how large the list.
    Simplicity is the ultimate sophistication.

  12. #12
    Registered User
    Join Date
    May 2018
    Posts
    7
    Quote Originally Posted by john.c View Post
    That's the magic of an NlogN algorithm (e.g., mergesort) compared to an N-squared algorithm (e.g., selection sort). Time complexity - Wikipedia


    The first version of the merge was needlessly recursive in the sense that it's easy to write a non-recursive version of it. And it was not "tail-call" recursive, so it was kind of recursion that piles up stack frames. In particular, it would use stack space proportional to the size of the list. So for a large list it would easily overflow the default stack limit.

    The new version isn't recursive so it only uses one stack frame no matter how large the list.
    Alright thanks man! I'll make sure to give you some credit if this project goes well

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 2
    Last Post: 11-18-2010, 09:56 PM
  2. Need help with list sorting
    By dnguyen1022 in forum C++ Programming
    Replies: 1
    Last Post: 04-23-2009, 10:46 PM
  3. Replies: 16
    Last Post: 08-25-2008, 11:08 PM
  4. help with sorting a list :(
    By Axel in forum C Programming
    Replies: 54
    Last Post: 10-16-2006, 07:10 AM
  5. std::list<*> sorting
    By ziruz in forum C++ Programming
    Replies: 18
    Last Post: 06-25-2003, 03:23 PM

Tags for this Thread