Thread: BIG Sorting Problem!

  1. #1
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868

    BIG Sorting Problem!

    A program I'm working on has records (structs), with an array of ints for every record, with 52 integers. So it's:

    Code:
    struct person {
      char name[30];
      int data[50];
    }
    The BIG part is the number of records going into an array of these - at least 100 million have to be processed, in several large "chunks".

    My problem, is in the sorting - the data integers can't be moved from one index to the next - their position within the data array of each record, is fixed. The whole record has to be sorted, with the key being the first column, then the second column, then the third column, etc. Time and memory constraints are obviously a consideration, but it will be running on strong desktop or server class systems.

    The first 50 records of Data look like this, when sorted properly:
    Code:
     0:  0   1 474 753 478 945 456 255   1 198 847 945 551  24 396 986 242  85 154 
     1:  0   3 299 465 382 737 460 803 301 771 179  50 801 262 483 330 815 905 349 
     2:  0   3 345  44 796  23 873  36 883 551 419 744 744 659 336 470 891 699  16 
     3:  0   3 849 827 747 316 243 938 517 197 597 700 653 425 613 307 298 446 293 
     4:  0   3 925  53 709 784 756 745 423 178 670  93 311 837 321 362 612 460 455 
     5:  0   4 873  86 923 163 182 771 644 454 353 918 845 682 210 997 429 755 772 
     6:  0   6 417 387 904  42 848  60 345 464 642 849 576 143 770 716 331 564 186 
     7:  0   6 832 185 660 326 499 338 910  34 978 664 625 260 107 298 186 979 132 
     8:  0   8 348 612 127 559 495 254 805 122 677 394 373 613 243 810 686 111 285 
     9:  0  10 154 192 615 536 896  45  15 445 418 637 784 145 829 849 351 313 243 
    10:  0  11 522 647 514 445 715 990  98 406 592 955 431 266 864 113 107 958 138 
    11:  0  12 333 455 930 249 427 629 609 426 967 981 928 488 462  97 778 363 331 
    12:  0  12 361  19 428 960 195  48 826 329 672 520 579 620 729 422 291 659 187 
    13:  0  16 767 659 471 118 680 993 124  49 583 252 279 107 962 435 192 486 419 
    14:  0  16 966 710 733 518 635 786 823  48 252 343 111 925 984 362 707 572 591 
    15:  0  17 654 931 719 475 350 774 603 248  61 441 451 212 806 450 635 904 114 
    16:  0  19 691 146 561 202 679   1 752 965 701  25 437 870 898 537 858 575 464 
    17:  0  20 976 306 427 291 532 412  19 157 639 885 976 321  56 268 570 518 302 
    18:  0  21 193 144 243 660  18 852 615 723 330 703 663 576  50 298 584 875 131 
    19:  0  22 398  74 875 882  38 249 931 199 732  32 154 573 960 950 950 909 364 
    20:  0  22 405 307 198 711 412 774 394 432 396 319  93 783 994 134 963 504 622 
    21:  0  24 115 423 464 806   1 457 739 253 135 909 456 590 381 893 357 742 608 
    22:  0  25 450 853 240 990 249 169 798  10 533 545 534 449 522 559 104 926 153 
    23:  0  25 463 335 908 489 405 391  51 640 763 720 221 598 413 634 383 690 243 
    24:  0  27 533 441 604 117 202  83  94 775 308 738 551 248 638 733 393 156 543 
    25:  0  27 796 943 401 592 894 783 425 326 940 523 286   8 198 229 636 702 509 
    26:  0  28 110 295  95 114  66 871  57 989 475 766 940 929 759 376 940 943 688 
    27:  0  29 557 109 343 650 105 219 889 460 148  18 183 213 313 700 998 231 515 
    28:  0  30 296 196 704 905 178 308 377  51 542 352 819 922 371 855 996 923 542 
    29:  0  32 442 885  57 918 597 860 546 727 783 814  97  67 365 817 839 852  74 
    30:  0  32 771 597 742  99 255 632 643 316 317 644 298 256 532  41 594 638 565 
    31:  0  33 829 853 851 720 993 822   6 615 207 313 566 822 317  69 665 429  27 
    32:  0  33 920 848 920 180 651 515 264 420 461 143  67 827 522 216 382 989 748 
    33:  0  35 130 640 960 325 881 218 315 962 453 390 269 124 489 881 648 291 888 
    34:  0  36 149 297 104 248 995 599 644 505 938 811 788 485 302 425 807 345 124 
    35:  0  38 145 488  51 719 269 136 572 914 193 874 845 189 351 786 411 689   6 
    36:  0  38 837 687 725 958 108 502 753 944 794 298 799 326 428 118 636 828 392 
    37:  0  39  17 818 886  43 629 135 656 657 563 340 437 664 281 735 105 870 434 
    38:  0  39 215 110 391 412 970 531 564 589 967 247 158 259 569 331  14 598 607 
    39:  0  39 312 757 916 764  84 497 901 660  18  99 268 832 946 102 857 302 966 
    40:  0  39 735 769 860  92  76 150 852 699 646 618 938 542 895 829 256 258 225 
    41:  0  41 996 250 385 420  35 814  80 551 931 869 269 322 467 955 265 367 652 
    42:  0  42 600  55 374 355 294 176 893 338 284  30 887 894 742 663 411 157 589 
    43:  0  43 668 426 329 264 299 577 309 623 788 447 224  71 933 250 800 474   8 
    44:  0  43 989  71 332 696 433 770 549 501 581 131 731  51  62 243 609 396 285 
    45:  0  44 485 672 957  33 198 423 190 280  32 379 907 409 854 971 399 827 798 
    46:  0  46 565 684 262 715 952 357 769 142   1 865 681 928 930 606 817 346 441 
    47:  0  47 282 598 719  73 363  76 541 103  96  96 224 104 926 446 330  17 861 
    48:  0  47 389 253 895 940 753  57 486 288 265 443 918 581 678 782 572 132 293 
    49:  0  47 559 174 135 228 645 777 669 345 798 491 742 553 907 401 179 818 539 
    50:  0  48 840 318  31  66 518 893 533 158 922 718 419 981 538 777 462 363 702
    Records sorted: 1,000,000 Columns: 52 Int's limited to 999 here, but will range up to 300,000 at least, in the actual data.

    This takes me about 18 seconds, (on an i7) with Quicksort doing the first column, but since it's not aware of the fixed data constraint, I have a custom code block that handles all the rest of the columns data.

    I can't use an algorithm that requires a large amount of auxiliary memory. "Large" is not defined, however.

    I tried changing the data integers into strings, and sorting them that way - the times were no better, however. (using an optimized Quicksort + Insertion sort). Radix should do a better job, but Radix has some peculiarities that can make me review my list of 4 letter words.

    Column sorting like this is new to me, and Google searches have not been helpful. It's not the same thing as a "stable" sort that's needed here. If you know the correct terminology for this, please let me know. It's a multi-key sort, with 51 sub keys, I guess.

    Any algorithmic tips? I'm not looking for code.

    Thanks, and Best Wishes to all in 2013!
    Last edited by Adak; 01-03-2013 at 12:17 AM.

  2. #2
    Registered User
    Join Date
    Dec 2012
    Posts
    45
    Stable sort has some relation with this. If you use a stable sort, you can sort everything by the las column, then sort everything again by the first to last, and so on.

    Though, I would recommend sorting only once with a special comparison operation (similar to the comparison of strings).

    Regarding the algorithm, what are yout constraints? Where are the data? Do they fit in RAM?

    I guess you will need to implement some sorting algorithm to work in the disc, and resort to RAM for small sizes... It will take a lot of time!

    Can you at least map the file in memory (mmap)? Or are the data in some database?

  3. #3
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    It's not the same thing as a "stable" sort that's needed here.
    O_o

    Actually, using a stable sort here would work just fine; you just need to be creative in its application.

    That isn't really relevant though because you, at least it seems that way, have strictly ordered keys; you only need to sort by C1, sort the range of C1 elements, all data with the same C1 value, by C2, and continue that process until you've consumed all data.

    Honestly though, that isn't likely to be as effective as a simple application of "Quicksort" using multiple keys by order of "weight", essentially treating the array as a single larger value, but that isn't likely to be much of an improvement.

    However, your performance may be trumped by misses which means that a variation of merge sort (The algorithm I have in mind is generally academic in nature it has been proven to have good performance in some cases.) may work better.

    That said, my feeling is that your problem is your choice of data structure. (You have a simple array of data that is itself an array of keys.) If sorting this data is significant, I assert that you should be keeping it in a data structure where sorting would not be an issue.

    And all of that said, I question the need to sort such a large chunk of data more that a few times in the first place. What are you doing with this data that it needs to be sorted all at once often enough that the cost of sorting is significant? If this isn't the case, which I felt was implied, then I'd like a better explanation of how the data is generated/consumed.

    Soma

  4. #4
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by comocomocomo View Post
    Stable sort has some relation with this. If you use a stable sort, you can sort everything by the las column, then sort everything again by the first to last, and so on.
    Yes, I've learned that's true. Insertion sort will do this, beautifully. But it takes FOREVER - WAY too long, and correct me if I'm wrong, but isn't this going to entail a HUGE amount of extra swaps, because what's "sorted" at the last column is of least relation to the final sorted order?

    Though, I would recommend sorting only once with a special comparison operation (similar to the comparison of strings).
    Yes, I've learned this IS the same thing as multi-key sorting, so I'm working with that now. I've done that in simple sorting functions, but never in Quicksort -- practically sinful to tear up those beautiful tight loops in it's partitioning and swapping.

    Regarding the algorithm, what are yout constraints? Where are the data? Do they fit in RAM?
    I'm a helper bee, only. The data is on servers, and no, it can't all fit in RAM at once, but I have coded up a multi-tier Quicksort and Merge function that is da bomb which I'm thinking of using.

    I guess you will need to implement some sorting algorithm to work in the disc, and resort to RAM for small sizes... It will take a lot of time!
    Nope - "lotta time" won't cut the mustard on this. RAM it or perish.

    Can you at least map the file in memory (mmap)? Or are the data in some database?
    A database.

    Well, ah - that would be NO. I intend to work everything in RAM as much as possible, so that's mmap function sounds like a winner. Thanks, I'll read up on it.

    Right now I'm writing up some multi-key comparison code in Quicksort.

    Thanks for your reply.

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > Records sorted: 1,000,000 Columns: 52 Int's limited to 999 here
    18 seconds?!

    My machine (Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz) takes just over 1/2 second to sort 1Mx52 records with naive qsort.
    If I use -O2 and sort pointers to records (to cut out massive memcopies), it's down to 1/4 second.

    My compare function is just this
    Code:
    int cmp1 ( const void *a, const void *b ) {
      const struct person *pa = (const struct person *)a;
      const struct person *pb = (const struct person *)b;
      for ( int i = 0 ; i < LEN ; i++ ) {
        if ( pa->data[i] < pb->data[i] ) return -1;
        if ( pa->data[i] > pb->data[i] ) return +1;
      }
      return 0;
    }
    I think you need to post a testable reference code, because this level of discrepancy suggests something else is going on.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    A database.
    Define "database".

    If this data already lives in a professional database engine, create an index and a view so you can have it fed to you a bit at a time. You will, believe me, not get a faster implementation.

    I think you need to post a testable reference code, because this level of discrepancy suggests something else is going on.
    I'm thinking that he probably measured the time to read the million records from storage as part of the sorting process.

    Soma
    Last edited by phantomotap; 01-03-2013 at 02:20 AM.

  7. #7
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by phantomotap View Post
    O_o

    Actually, using a stable sort here would work just fine; you just need to be creative in its application.
    Yes, in experimenting with Insertion sort, I've learned how to do it, -- but that's the wrong way to sort this, is it not? Don't I want to go from left column to right column to get the priority right with the fewest comparisons and swaps?


    That isn't really relevant though because you, at least it seems that way, have strictly ordered keys; you only need to sort by C1, sort the range of C1 elements, all data with the same C1 value, by C2, and continue that process until you've consumed all data.

    Honestly though, that isn't likely to be as effective as a simple application of "Quicksort" using multiple keys by order of "weight", essentially treating the array as a single larger value, but that isn't likely to be much of an improvement.
    Not so far. I changed all the keys into a single string for each record for one test, but the sort was not much better. I know I could improve the string sorting time with MSD Radix, but it still leaves the problem of changing the key data over to strings, and then back again. Quite doable, but I hate the idea of changing A to B, only to have to change B back to A, with this big a job. Chafes my hide.
    However, your performance may be trumped by misses which means that a variation of merge sort (The algorithm I have in mind is generally academic in nature it has been proven to have good performance in some cases.) may work better.
    I thought of merge sort, right away, but the fast ones I know of, want a large auxiliary array - which is out of the question.
    That said, my feeling is that your problem is your choice of data structure. (You have a simple array of data that is itself an array of keys.) If sorting this data is significant, I assert that you should be keeping it in a data structure where sorting would not be an issue.
    It's quite unsorted at the moment, and the numbers will be serving to mark (hopefully), similar characteristics of a large group of people. This was my bright idea, to keep these characteristics indicated by the key numbers, with the record, and then sort all the records by that array of integers. Hopefully, people with similar interests will sort nearly adjacent to each other, in the final output array.

    All those numbers are actually indices, into a huge array, which will be kept on some servers.

    And all of that said, I question the need to sort such a large chunk of data more that a few times in the first place. What are you doing with this data that it needs to be sorted all at once often enough that the cost of sorting is significant?
    Mum's the word for now, but the project won't be approved if it can't meet good performance timelines. As the unofficial helper bee, I couldn't be more in the dark about the specifics, but I'm learning a lot and having fun. It's not a money gig for me.

    If this isn't the case, which I felt was implied, then I'd like a better explanation of how the data is generated/consumed.
    It's data mining, like what we have for this forum - maybe you saw the post on it. It's like that, except instead of a forum's clientele, it's a much bigger group. I'm strictly on the outside, trying to assist with some idea's and maybe code. I have a fascination with big computer projects.

  8. #8
    Registered User
    Join Date
    Dec 2012
    Posts
    45
    Quote Originally Posted by Adak View Post
    [...]

    A database.

    Well, ah - that would be NO. I intend to work everything in RAM as much as possible, so that's mmap function sounds like a winner. Thanks, I'll read up on it.

    Right now I'm writing up some multi-key comparison code in Quicksort.

    Thanks for your reply.
    Then mmap() won't help, unless you want to use a raw binary file in your computer as a big cache of the database.

    mmap() would be the way to go if the data were in a huge binary file instead of a database.

    Since you have a database, the ideal would be to read/write large blocks of data and work on them. If there is some transparent way to do this buffering, use it.

    Since you are using quicksort, you might want to have two buffers (or sliding windows): one for the lower extreme and one for the upper extreme.

    One stupid question: Can't your database sort the data for you?

  9. #9
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    I have a fascination with big computer projects.
    This isn't a big computer project.

    Sure, it isn't small, but at least one member of the board has bragged about having a development desktop where the entire chunk of data could fit into RAM.

    When a single consumer machine can crush your data, you don't have a big problem.

    I hate the idea of changing A to B, only to have to change B back to A, with this big a job.
    Don't!?

    You don't need a particular flavor of "digit" for "MSD Radixsort" to work. The "digits" you have will work fine.

    This was my bright idea, to keep these characteristics indicated by the key numbers, with the record, and then sort all the records by that array of integers. Hopefully, people with similar interests will sort nearly adjacent to each other, in the final output array.
    Your thoughts in that direction was flawed.

    An array of an array is the wrong data structure for analyzing multiple collisions especially if you venture into fuzzy matching.

    You would use more memory, but with a variation of a "B-Tree" structure you "buy" the relationships you are saying you are interested in with the cost of building and updating the structure. If you build it in reverse (factors as parent nodes referencing human identifiers interested in those factors) you have the potential to find people who share multiple interests by only examining the interests of the target individual.

    Can't your database sort the data for you?
    I thought the same thing.

    I think he is using "database" as "huge binary file" instead of "database engine" (like OracleDB or something).

    Soma

  10. #10
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by Salem View Post
    > Records sorted: 1,000,000 Columns: 52 Int's limited to 999 here
    18 seconds?!

    My machine (Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz) takes just over 1/2 second to sort 1Mx52 records with naive qsort.
    If I use -O2 and sort pointers to records (to cut out massive memcopies), it's down to 1/4 second.

    My compare function is just this
    Code:
    int cmp1 ( const void *a, const void *b ) {
      const struct person *pa = (const struct person *)a;
      const struct person *pb = (const struct person *)b;
      for ( int i = 0 ; i < LEN ; i++ ) {
        if ( pa->data[i] < pb->data[i] ) return -1;
        if ( pa->data[i] > pb->data[i] ) return +1;
      }
      return 0;
    }
    I think you need to post a testable reference code, because this level of discrepancy suggests something else is going on.
    The problem isn't with Quicksort - it will sort the million records in just under half a second - but that's only one column. I couldn't get my head around the term "multi-key", with Quicksort.

    And of course, this one doesn't have a compare function - but it will!

    Code:
    void quicksortOpt(int lo, int hi, int c) {  
       int i, j, k,pivot;
       user temp;
       if(lo == hi) return; 
    
       if(hi - lo < INSERT) {         //The optimizer. Value of INSERT
          insertionSort(lo, hi+1,c);  //will vary depending on your system
         return;
       }
       i=lo; 
       j=hi;
       pivot=users[lo+(hi-lo)/2].n[c];
    
       do {                           // Partition the array 
          while (users[i].n[c] < pivot) i++; 
          while (users[j].n[c] > pivot) j--;
          if (i<=j) {  
             temp=users[i];           //swap two structs
             users[i]=users[j];
             users[j]=temp;
             i++;
             j--;
          }
       } while(i<=j);
      
       if (lo < j) quicksortOpt(lo, j,c); //take the smaller sub array first
       if (i < hi) quicksortOpt(i, hi,c);
    }
    The "data" I'm working with, is just mock up stuff that I've created, for this. The entire slow down is the "multi-key" part I coded up, that has a speed like turtles on Galapagos. It was just a place holder, but "Column sorting" didn't work on Google, for sure. I was thinking of the way the data was being represented, instead of how it was being used.

    The time was accurate, I assure you. No disc load time was included, because there was no disk load, at all - it's generated first, using rand(). Just before the call to quicksort(), the timer starts, not before. It stops the moment the function completes AND my "pile driver" block of code has fixed all the column sorting issues.

    But I won't trouble you with the "pile driver" block of code. It's just a workaround, until I get more time.

    And good ideas - thanks, Salem!

  11. #11
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    But I won't trouble you with the "pile driver" block of code.
    O_o

    Well, now that we are here my curiosity is acting up.

    How was your sorting invitation complete before the "pile driver" did its job?

    Really, this is kind of bugging me now, where did the "pile driver" live that it did the job of helping your "Quicksort" implementation sort multiple keys, however you approached it, without being a part of the sorting process from the sorting invocation?

    Is that the "convert a string back to the number sequence" from earlier?

    Soma

  12. #12
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    So basically, you were sorting all 50+ columns in sequence, rather than bailing out at the first different column.

    Given the data distribution, it's pretty much all over by the 3rd column.
    Code:
      0   1 144 944 427 527 937 456 551 242 795 573 218 674 809 842 752 463 252 859 
      0   1 688 363 222 350 903 121  36 920 213 539 389 774 325 292 164 994 312 369 
      0   1 875 937 107 528 418 201 686 462 864 737 957  10 972 472 832 912 827 755 
      0   3  49 326 108 425  82  36 417 455 995 560  87 230 650 460 422 440 998 841 
      0   4 266 143 354 983  49 749 628 153 413  16 473 971 833 642 132 728 505 500 
      0   4 545 748 563 385 818 830 243 257 605 979 852 503 378 167 139 907 133  76 
      0   4 863 536 140 393 648 747 101  10 789  77 565 492 350 932 703 812 923 691 
      0   7  34 514   1 577 893 841 930  38  19 474   6 179 502 142 525 219 608 502 
      0   7 354 639 481 913   7 834 300  39   0 259 420 192 510 656 666 933 356  95 
      0   8 397 649 516 915 287 393 738 321 806  83 231 325 941 224  46 675 998 620 
      0   8 802 816 964 307 442   4  99 346 333  59  32 802 805 209 210 492   1 489
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  13. #13
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    So basically, you were sorting all 50+ columns in sequence, rather than bailing out at the first different column.
    OOoohhh.

    Well... I'll just be over there...

    Soma

  14. #14
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by phantomotap View Post
    O_o

    Well, now that we are here my curiosity is acting up.

    How was your sorting invitation complete before the "pile driver" did its job?
    Only the first column was correctly sorted.

    [quote]
    Really, this is kind of bugging me now, where did the "pile driver" live that it did the job of helping your "Quicksort" implementation sort multiple keys, however you approached it, without being a part of the sorting process from the sorting invocation?

    First, Quicksort handled the first column. Quicksort was done. Then pile driver started in, and went down all 52 columns, and handled any keys out of place, swapping records as it went. It's slow and definitely not something to brag about. I call it "pile driver", because it works just like a real pile driver, as it goes down the columns.

    Such is the fertile imagination of the simple mind.

    Is that the "convert a string back to the number sequence" from earlier?

    Soma
    No. The string sorting of the keys, was an idea I wanted to try, but since the sorting time was just so-so, I knew it wasn't what I wanted.

    @Salem: Yes, all the remaining 51 columns. My contact mentioned up to 300 columns, but I didn't see how that would be possible. She also mentioned millions more possible records. I remain skeptical of that.

    The data you're looking at, is just mock up random int's from me, so I had SOMETHING to work with. All I've seen so far is a few pretty pictures. No data.
    Last edited by Adak; 01-03-2013 at 05:04 AM.

  15. #15
    Registered User
    Join Date
    Dec 2012
    Posts
    45
    Quote Originally Posted by Adak View Post
    The problem isn't with Quicksort - it will sort the million records in just under half a second - but that's only one column. I couldn't get my head around the term "multi-key", with Quicksort.


    And of course, this one doesn't have a compare function - but it will!


    Code:
    void quicksortOpt(int lo, int hi, int c) {  
       //...
    
    
       do {                           // Partition the array 
          while (users[i].n[c] < pivot) i++; 
          while (users[j].n[c] > pivot) j--;
          if (i<=j) {  
             temp=users[i];           //swap two structs
             //...
    The "data" I'm working with, is just mock up stuff that I've created, for this. The entire slow down is the "multi-key" part I coded up, that has a speed like turtles on Galapagos. It was just a place holder, but "Column sorting" didn't work on Google, for sure. I was thinking of the way the data was being represented, instead of how it was being used.


    The time was accurate, I assure you. No disc load time was included, because there was no disk load, at all - it's generated first, using rand(). Just before the call to quicksort(), the timer starts, not before. It stops the moment the function completes AND my "pile driver" block of code has fixed all the column sorting issues.


    But I won't trouble you with the "pile driver" block of code. It's just a workaround, until I get more time.


    And good ideas - thanks, Salem!



    First, you should use the standard qsort(). Your code searches for a pair of elements to swap them, taking three assignments per swap. The most extended implementation of quicksort manages to reduce the number of assignments. The trick is similar to that used in insertion sort: instead of doing swaps, you copy one item in an auxiliary variable, creating a 'hole' in the array; then you copy other value of the array filling this hole, but creating a new hole, and so on; in the end you copy the initial value in the final location of the hole. On average, you save up to a 33% of the assignments.


    Code:
           #include <stdlib.h>
    
    
           void qsort(void *base, size_t nmemb, size_t size,
                      int(*compar)(const void *, const void *));

    In addition, the standard qsort() might be more careful with the stack by using tail recursion. Your code uses too much stack in the 'evil' worst case of quicksort.


    Second, the structures are quite large, so don't sort them. Sort an array of indices (or pointers) instead. This will speed up the assignments a lot (maybe up to x50).


    Third, quicksort is not a stable sort. You can't use the trick of sorting by one column, then sorting by other column... You have to sort only once, but with a slightly more complicated comparison. Don't worry: it will be way faster, since most times the comparisons will be decided after comparing the first element. Again, an additional x50 speed up.


    The function provided by Salem looks fine to me. I would just add one level of indirection in order to sort an array of pointers to structs, and not the structs themselves.




    Quote Originally Posted by Adak View Post
    [..] It's quite unsorted at the moment, and the numbers will be serving to mark (hopefully), similar characteristics of a large group of people. This was my bright idea, to keep these characteristics indicated by the key numbers, with the record, and then sort all the records by that array of integers. Hopefully, people with similar interests will sort nearly adjacent to each other, in the final output array.


    All those numbers are actually indices, into a huge array, which will be kept on some servers.
    [..]



    Ooooow. Then I have bad news for you :-(


    Suppose two individuals with nearly the same value in every field, except for the first one. They will be far form each other.


    You need some classification algorithm: Statistical classification - Wikipedia, the free encyclopedia


    The choice of what to use is completely dependent on the nature of your data and what you want to achieve.


    This is a BIG problem. Bigger than sorting ;-)

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Problem with sorting
    By preeengles in forum C Programming
    Replies: 35
    Last Post: 04-22-2009, 07:45 PM
  2. problem with sorting
    By pinkpenguin in forum C Programming
    Replies: 2
    Last Post: 11-18-2005, 11:06 AM
  3. Help! Sorting Problem
    By zz3 in forum C++ Programming
    Replies: 5
    Last Post: 06-19-2004, 02:48 AM
  4. Sorting problem
    By stimpyzu in forum C++ Programming
    Replies: 4
    Last Post: 11-21-2002, 01:07 AM
  5. Can anyone help with sorting problem???
    By DanTheMan in forum C Programming
    Replies: 1
    Last Post: 04-04-2002, 08:58 AM