Thread: Initializing a 4-D array

  1. #1
    Registered User
    Join Date
    Feb 2008
    Posts
    77

    Initializing a 4-D array

    Hello to all,

    I need help initializing a 4-D array that will hold all instances from [0,0,0,0] to [4,4,4,4] (only using 1-4). I need it for a comparison chart. What I am trying to do is scan a 5 million character string for each particular 4-mer that occurs and keep count. Any help would be great. Below is the code I have so far.

    Code:
    char t1, t2, t3, t4 ;
            int index ; 
            int tupleCount[4][4][4][4] ;
    
     for( index = 0 ; index < strlen(seqData) - 3 ; ++index )
                {
                    t1 = seqData[index] ;
                    t2 = seqData[index + 1] ;
                    t3 = seqData[index + 2] ;
                    t4 = seqData[index + 3] ;
                    // count the different 4-mers 
                }
    Sorry, my 4-D array should hold (1,1,1,1) to (4,4,4,4).
    Last edited by gkoenig; 03-01-2008 at 05:53 PM. Reason: error in logic

  2. #2
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,786
    from [0,0,0,0] to [4,4,4,4] (only using 1-4).
    I do not get it
    0-4 is 5 values
    using 1-4 values you cannot get [0,0,0,0] vector

    so what do you want to do?
    how seqData is related to what you need?

    to fill 4D array - use 4 loops one inside the other
    Code:
    for(i=0;i<4;i++)
       for(j=0;j<4;j++)
          for(k=0;k<4;k++)
             for(l=0;l<4;l++)
                tupleCount[i][j][k][l] = ...;
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  3. #3
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Sorry, my 4-D array should hold (1,1,1,1) to (4,4,4,4).
    Not (0,0,0,0)? Hmm. In that case, you either waste space or subtract one every time you index the array.

    Code:
    for( index = 0 ; index < strlen(seqData) - 3 ; ++index )
    Computing strlen() is quite expensive. You probably shouldn't do it in a loop. Consider calculating this value beforehand and storing it in a variable.

    Now. Unfortunately, this code is broken.
    Code:
                    t1 = seqData[index] ;
                    t2 = seqData[index + 1] ;
                    t3 = seqData[index + 2] ;
                    t4 = seqData[index + 3] ;
    When you use seqData, you get a char ****. When you use seqData[x], you get a char ***. And so on, until you get a char when using seqData[x][y][z][a].

    t1, t2, t3, and t4 are all chars. Therefore, you can't the sort of assignment you have.

    I'm not really sure what you're trying to do here. Perhaps you could explain what a 4-mer is, for a start.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  4. #4
    Registered User
    Join Date
    Feb 2008
    Posts
    77
    Sorry for the confusion; I was rushing when I posted. You are right. I should have said (0,0,0,0) to (3,3,3,3).


    I guess it will be easier if I elucidate a little. I am trying to scan a sequence of DNA from the first letter to the last letter -3 for all the 4-mers that occur. I need to collect all the 4-mers so that I can see the total distribution. The array is to hold my count of each 4-mer as I progress through the sequence. I was also matching 0->a, 1->c, 2->g, an 3->t. I am not sure if this is a redundant step because when I print my distribution it will contain letters, not numbers. Hopefully this helps.

  5. #5
    uint64_t...think positive xuftugulus's Avatar
    Join Date
    Feb 2008
    Location
    Pacem
    Posts
    355
    Of course it does. If you name your array as tupleCount, and use a suggested method to map a,c,g,t to 0,1,2,3, which i will name acgt_to_int() then:
    Code:
    /* TODO: Verify that the seqData pointer points to enough data */
    t1 = acgt_to_int(seqData);
    t2 = acgt_to_int(seqData + 1);
    t3 = acgt_to_int(seqData + 2);
    t4 = acgt_to_int(seqData + 3);
    /* TODO: Validate that t1..t4 are correctly in range */
    tupleCount[t1][t2][t3][t4]++;
    I leave the TODO tags to you.
    Code:
    ...
        goto johny_walker_red_label;
    johny_walker_blue_label: exit(-149$);
    johny_walker_red_label : exit( -22$);
    A typical example of ...cheap programming practices.

  6. #6
    Registered User
    Join Date
    Feb 2008
    Posts
    77
    Is there a simple c function to translate a,c,g,t to 0,1,2,3 ??

  7. #7
    uint64_t...think positive xuftugulus's Avatar
    Join Date
    Feb 2008
    Location
    Pacem
    Posts
    355
    That is up to you to create. Using a function will mask the gory details of your conversion.
    You can do something like:
    Code:
    int acgt_to_int(char *x)
    {
        if(*x == 'a')
            return 0;
        ....
    }
    Guess the rest. Also note the fact that you should handle case-sensitivity, and find a return value to indicate error.
    I'd suggest -1.
    Code:
    ...
        goto johny_walker_red_label;
    johny_walker_blue_label: exit(-149$);
    johny_walker_red_label : exit( -22$);
    A typical example of ...cheap programming practices.

  8. #8
    Registered User
    Join Date
    Feb 2008
    Posts
    77
    I am not sure what I did wrong but, I am getting all kinds of errors and warnings. I've looked through the code and can't pick up on anything. Any help would be great.

    Code:
    // A Program to count the count of 4-mers in a nucleotide sequence.
    
    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    #include <stdlib.h>
    
    FILE *input ; 
            FILE *outpur ;
            char buffer[1000] ;        
            int i = 0 ;
            int c ;
            int w,x,y,z ;
            char tr ;
             
            char *seqData ;
           
            seqData = (char *)malloc(10000000) ;
            
           
            char t1, t2, t3, t4 ;
            int index ; 
    
    int acgt_to_0123(char *tr)
        {
            if(*tr == 'a')
                {
                return 0 ;
                }
            else if(*tr == 'c')
                {
                return 1 ;
                }
            else if(*tr == 'g')
                {
                return 2 ;
                }
            else if(*tr == 't')
                {
                return 3 ;
                }
            else
                {
                printf("Non acgt character\n") ;
                exit(0) ;
                }
        }
    
    main( int argc, char **argv )
        {
                    
            
            // Initialize 4-D array
            int tupleCount[4][4][4][4] ;
            for(w = 0 ; w < 4 ; w++ )
                for(x = 0 ; x < 4 ; x++ )
                    for(y = 0 ; y < 4 ; y++ ) 
                        for(z = 0 ; z < 4 ; z++ )
                            {
                                tupleCount[w][x][y][z] ;
                            }
            
            // Open input file to read from        
            if( ! ( input = fopen( argv[1], "r" ) ) )
                { 
                    printf( "COULD NOT OPEN FILE %s - Exit!\n", argv[1]) ; 
                    exit(1) ; 
                }        
            
            
            // Collect sequence from GenBank file
            while(fgets(buffer, 1000, input))
                            {
                            // start obtaining bases after ORIGIN
                            if(strstr(buffer, "ORIGIN")) 
                                {                                                  
                                   while((c=getc(input)) != '/' && c != EOF)
                                        {
                                        if(c >= 'a' && c <= 'z')
                                            {
                                            seqData[i++] = c ;
                                            }
                                        }                            
                                }           
                            }
                            
            
            
            
            // Scan DNA sequence for each 4-mer 
            for( index = 0 ; index < strlen(seqData) - 3 ; ++index )
                {
                    t1 = acgt_to_0123(seqData[index]) ;
                    t2 = acgt_to_0123(seqData[index + 1]) ;
                    t3 = acgt_to_0123(seqData[index + 2]) ;
                    t4 = acgt_to_0123(seqData[index + 3]) ;
                    
                    
                    // Accumulate a count to find distribution
                    tupleCount[t1][t2][t3][t4]++ ;
                    
                }
        
                       
            fclose(input) ;
        
            printf("Here is the distribution of 4-mers:\n\n%s", tupleCount ) ;     
            free(seqData) ; 
            return(0) ;
        }
    thanks

  9. #9
    uint64_t...think positive xuftugulus's Avatar
    Join Date
    Feb 2008
    Location
    Pacem
    Posts
    355
    Code:
            char tr ;
    Unused variable. Remove it.
    Code:
            seqData = (char *)malloc(10000000) ;
    It is not allowed. Global variables must be initialized in block scope ( inside main() ).
    Why so many globals? Move those declarations to the start of the main() block.
    Code:
            // Initialize 4-D array
            int tupleCount[4][4][4][4] ;
            for(w = 0 ; w < 4 ; w++ )
                for(x = 0 ; x < 4 ; x++ )
                    for(y = 0 ; y < 4 ; y++ ) 
                        for(z = 0 ; z < 4 ; z++ )
                            {
                                tupleCount[w][x][y][z] ;
                            }
    Exactly what is that supposed to mean? Forgot the =0 maybe?
    Code:
            // Scan DNA sequence for each 4-mer 
            for( index = 0 ; index < strlen(seqData) - 3 ; ++index )
                {
                    t1 = acgt_to_0123(seqData[index]) ;
                    t2 = acgt_to_0123(seqData[index + 1]) ;
                    t3 = acgt_to_0123(seqData[index + 2]) ;
                    t4 = acgt_to_0123(seqData[index + 3]) ;
                    
                    
                    // Accumulate a count to find distribution
                    tupleCount[t1][t2][t3][t4]++ ;
                    
                }
    In the definition of your function acgt_to_0123, i proposed a pointer to character to be passed.
    You instead pass the character itself. Change the function to work with characters as it seems to be more straightforward to you. (hint: remove all * from acgt_to_0123).

    And a sidenote on correctness. Looking at the for loop in your code and read aloud what is going on. You are moving the index, ONE character at a time, but i presume that the 4-mers are 4 characters long? Or is it considered for your excercise valid to do things like:
    YOUR INPUT: 'acaactgagatc'
    YOUR 4-MERS: 'acaa', 'caac', 'aact', 'actg', 'ctga', 'tgag', 'gaga', 'agat', 'gatc'
    Last edited by xuftugulus; 03-02-2008 at 10:19 AM.
    Code:
    ...
        goto johny_walker_red_label;
    johny_walker_blue_label: exit(-149$);
    johny_walker_red_label : exit( -22$);
    A typical example of ...cheap programming practices.

  10. #10
    Registered User
    Join Date
    Feb 2008
    Posts
    77
    Thanks for the input; will implement. What out said for the 4-mer is exactly right. The input sequence is 5 million bases and all of those examples I must collect. For instance:

    count of 4-mers:

    aaaa = 3
    aaac = 2
    .
    .
    .
    tttt = 7

    thats what i need printed out at the end.

    Again, thanks for the help. I really appreciate it. When I am done this semester I will make sure to go back and get more fluent in the language.

  11. #11
    Registered User
    Join Date
    Feb 2008
    Posts
    77
    Okay, I've gotten it to compile but now it will not stop running. Did I enter an endless loop somewhere?

  12. #12
    uint64_t...think positive xuftugulus's Avatar
    Join Date
    Feb 2008
    Location
    Pacem
    Posts
    355
    Exactly how big is the file being scanned? Post bytes. Because the call to strlen on every iteration of the for loop that does statistics, will do say if N is the size of the file N^2 memory accesses. It should finish, but if N is 1E+6, then after 1E+12 memory accesses. And i think N is bigger in your case. Compute the result of strlen(seqData) in a temporary variable. And save yourself the quadratic performance over N to linear. What an optimization !
    Code:
    ...
        goto johny_walker_red_label;
    johny_walker_blue_label: exit(-149$);
    johny_walker_red_label : exit( -22$);
    A typical example of ...cheap programming practices.

  13. #13
    Registered User
    Join Date
    Feb 2008
    Posts
    77
    Code:
    int lengthDNA = strlen(seqData) ;
            
            // Scan DNA sequence for each 4-mer 
            for( index = 0 ; index < (lengthDNA - 3) ; ++index )
                {
                    t1 = acgt_to_0123(seqData[index]) ;
                    t2 = acgt_to_0123(seqData[index + 1]) ;
                    t3 = acgt_to_0123(seqData[index + 2]) ;
                    t4 = acgt_to_0123(seqData[index + 3]) ;
                    
                    
                    // Accumulate a count to find distribution
                    tupleCount[t1][t2][t3][t4]++ ;                
                }
    The file I used as input was 11MB and the length of the sequence is 5 Million chars. Still keeps running endlessly

  14. #14
    uint64_t...think positive xuftugulus's Avatar
    Join Date
    Feb 2008
    Location
    Pacem
    Posts
    355
    Do a verification printf on lengthDNA, before the loop.
    Code:
    ...
        goto johny_walker_red_label;
    johny_walker_blue_label: exit(-149$);
    johny_walker_red_label : exit( -22$);
    A typical example of ...cheap programming practices.

  15. #15
    Registered User
    Join Date
    Feb 2008
    Posts
    77
    Code:
     int lengthDNA = strlen(seqData) ;
            
            printf("here is the lenght of the DNA: %d", lengthDNA) ;
            
            // Scan DNA sequence for each 4-mer 
            for( index = 0 ; index < (lengthDNA - 3) ; ++index )
                {
                    t1 = acgt_to_0123(seqData[index]) ;
                    t2 = acgt_to_0123(seqData[index + 1]) ;
                    t3 = acgt_to_0123(seqData[index + 2]) ;
                    t4 = acgt_to_0123(seqData[index + 3]) ;
                    
                    
                    // Accumulate a count to find distribution
                    tupleCount[t1][t2][t3][t4]++ ;                
                }
    It stopped the loop and printed out the sentences. Why was that able to stop the loop?
    Also, it did not print out the distribution of 4-mers.

    Code:
    printf("Here is the distribution of 4-mers:\n\n%s", tupleCount ) ;

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 16
    Last Post: 05-29-2009, 07:25 PM
  2. Initializing a 2D Array in C
    By Cell in forum C Programming
    Replies: 20
    Last Post: 03-21-2009, 12:31 PM
  3. question about multidimensional arrays
    By richdb in forum C Programming
    Replies: 22
    Last Post: 02-26-2006, 09:51 AM
  4. Type and nontype parameters w/overloading
    By Mr_LJ in forum C++ Programming
    Replies: 3
    Last Post: 01-02-2004, 01:01 AM
  5. Initializing char * array
    By Tia in forum C Programming
    Replies: 6
    Last Post: 03-11-2003, 05:19 PM