Generating DNA sequence

**Adak** · 04-06-2009

Listen?

Are you joking, Phillip? I'm singing in his choir!

That doesn't mean I can program as well as the two of you.

**ಠ_ಠ** · 04-06-2009

I'm just a 1st year Software Engineering student, I didn't even know how to do what I suggested

**Adak** · 04-06-2009

I've taken one class in BASIC, and one class in C. That was when dinosaurs ruled the earth, I think.

**Teiji** · 04-06-2009

Adak, oh it's a pretty long assignment. I have to read an input file containing thousands of DNA strands and count the value of each case (how many times it occur for base 1, base 2, base 3, etc.), get the min and max value of the case, and write them (the value as well as the case) to an output file.

The reason why I ask for help on generating the sequence is because I would need to compare all the sequence produced to the strands of DNA in the input file.

Btw Adak, in your program, how would you save it into an array to reference later instead of printing to stdout?

Anyway, thanks for all of you guys help.

**Adak** · 04-06-2009

I would use Snufist's or d_d's program, although I haven't checked them out yet.

For my program, right below each print statement, put in your assignment statement into the array, and rem out the print line.

Won't the array be moot due to the huge size of the strings?

**Teiji** · 04-06-2009

Originally Posted by Adak

I would use Snufist's or d_d's program, although I haven't checked them out yet.

For my program, right below each print statement, put in your assignment statement into the array, and rem out the print line.

Won't the array be moot due to the huge size of the strings?

Performance is not the issue here. I'm going for correctness and ease of understanding.

**Adak** · 04-06-2009

Originally Posted by Teiji

Performance is not the issue here. I'm going for correctness and ease of understanding.

You're going to use a char array to save space, with two dimensions? That lets each string be in it's own row.

Perhaps have the base number of the string to correspond with the index number of the char array. That's how I would do it.

You'll want a small auxillary unsigned long long int array of size 4 to count up your sums.

If you know the specifics of your letters array, and want some help getting the letters into it right, let me know.

**Teiji** · 04-06-2009

Originally Posted by Adak

You're going to use a char array to save space, with two dimensions? That lets each string be in it's own row.

Perhaps have the base number of the string to correspond with the index number of the char array. That's how I would do it.

You'll want a small auxillary unsigned long long int array of size 4 to count up your sums.

If you know the specifics of your letters array, and want some help getting the letters into it right, let me know.

This is what I did and it works.

Code:

/* DNA.c generates the base numbers, from the set of (A,C,G,T), from 1 to
5. Adak, April, 2009
*/

#include <stdio.h>
#include <string.h>

char base1[4][1+1];           // +1 for the string termination char
char base2[16][2+1];
char base3[64][3+1];
char base4[256][4+1];
char base5[1024][5+1];

void showIt(int base);
void printBase(int base);

int main(void)  {
   int base;


   base = 1;
   showIt(base);
   printBase(base);
   
   base = 2;
   showIt(base);
   printBase(base);
   
   base = 3;
   showIt(base);
   printBase(base);
   
   base = 4;
   showIt(base);
   printBase(base);
   
   base = 5;
   showIt(base);
   printBase(base);

   return 0;
}
void showIt(int base) {
   int a, b, c, d, e, i; 
   unsigned long perm_num = 0;
   unsigned long perm_mill = 0;
   char dna[] = {"ACGT"};
   

   char tempSequence[base+1];        // +1 for the string termination char

   
   printf("\n"); 

   int k = 0;
   if(base) {
   for(a = 0; a < 4; a++)  {
      if(base == 1) {
         sprintf(tempSequence, "%c\0", dna[a]);
         strcpy(base1[k],tempSequence);
         k++;
         perm_num++;
      }
      if(base > 1) {
      for(b = 0; b < 4; b++) {
         if(base == 2) {
            sprintf(tempSequence, "%c%c\0", dna[a], dna[b]);
            strcpy(base2[k],tempSequence);
            k++;
            perm_num++;
         }
         if(base > 2) {
         for(c = 0; c < 4; c++)  {
            if(base == 3) {
               sprintf(tempSequence, "%c%c%c\0", dna[a], dna[b], dna[c]);
               strcpy(base3[k],tempSequence);
               k++;
               perm_num++;
            }
            if(base > 3) {
            for(d = 0; d < 4; d++) {
               if(base == 4) {
                  sprintf(tempSequence, "%c%c%c%c\0", dna[a], dna[b], dna[c], dna[d]);  // dna[t]);
                  strcpy(base4[k],tempSequence);
                  k++;
                  perm_num++;
               }
               if(base > 4) {
               for(e = 0; e < 4; e++) {
                  if(base == 5) {
                     sprintf(tempSequence, "%c%c%c%c%c\0", dna[a], dna[b], dna[c], dna[d], dna[e]); 
                     strcpy(base5[k],tempSequence); 
                     k++;
                     perm_num++;
                  }
               }
               }
//sprinkle this around where the permutations become > 1,000,000
               if(perm_num > 1000000) {
                  perm_num = 0;
                  perm_mill++;
               }
            }
            }

         }
         }
      }
      }
   }
   }

   
   
   if(perm_mill)
      printf("Permutations found: Millions: %lu %lu \n", perm_mill, perm_num);
   else
      printf("\n Permutations found: %lu \n", perm_num);
}

void printBase(int base) {

   // TEST PRINT BASE 1-5
   int i;
   
   switch(base) {
   
      case 1:
         for (i = 0; i < 4; i++)
            printf("base1[%d] = %s \n", i, base1[i]);
         break;
         
      case 2:
         for (i = 0; i < 16; i++)
            printf("base2[%d] = %s \n", i, base2[i]);
         break;
         
      case 3:
         for (i = 0; i < 64; i++)
            printf("base3[%d] = %s \n", i, base3[i]);
         break;
         
      case 4:
         for (i = 0; i < 256; i++)
            printf("base4[%d] = %s \n", i, base4[i]);
         break;
         
      case 5:
         for (i = 0; i < 1024; i++)
            printf("base5[%d] = %s \n", i, base5[i]);
         break;
         
      default:
         break;
   }
}

But if there is a better way, please point me out.

Also, I want to do thru base 15, so where would I insert another function like you said?

After about 6 nested loops, I'd have this function call the next function, and repeat it with the very same local variables. Pretty much copy and paste, with just the base numbers references, direct or indirect, needing to be changed.

Code:

               if(base > 4) {
               for(e = 0; e < 4; e++) {
                  if(base == 5) {
                     sprintf(tempSequence, "%c%c%c%c%c\0", dna[a], dna[b], dna[c], dna[d], dna[e]); 
                     strcpy(base5[k],tempSequence); 
                     k++;
                     perm_num++;
                  }
                  xxx
               }
               }

Should I insert it at the xxx?

**Adak** · 04-06-2009

Originally Posted by Teiji

This is what I did and it works.

Code:

/* DNA.c generates the base numbers, from the set of (A,C,G,T), from 1 to
5. Adak, April, 2009
*/

#include <stdio.h>
#include <string.h>

/* I suggest using one 2D char array[][]. and have malloc set aside the memory for each
row - which is great because each row will be a completely different length.

After lunch, I'll post up some code as an example of this. It's very clear, and easy.

That allows base 1 to be in array[row1][].

Yes, you want each row to have +1 spaces in it for the string terminator, if you want to
use the string functions, at all.
 
char base1[4][1+1];           // +1 for the string termination char
char base2[16][2+1];
char base3[64][3+1];
char base4[256][4+1];
char base5[1024][5+1];



void showIt(int base);
void printBase(int base);

int main(void)  {
   int base;


/* this should be done in a simple for loop */

   base = 1;
   showIt(base);
   printBase(base);
   
   base = 2;
   showIt(base);
   printBase(base);
   
   base = 3;
   showIt(base);
   printBase(base);
   
   base = 4;
   showIt(base);
   printBase(base);
   
   base = 5;
   showIt(base);
   printBase(base);

   return 0;
}
void showIt(int base) {
   int a, b, c, d, e, i; 
   unsigned long perm_num = 0;
   unsigned long perm_mill = 0;
   char dna[] = {"ACGT"};
   

   char tempSequence[base+1];        // +1 for the string termination char

   
   printf("\n"); 

   int k = 0;
   if(base) {
   for(a = 0; a < 4; a++)  {
      if(base == 1) {
         sprintf(tempSequence, "%c\0", dna[a]);
         strcpy(base1[k],tempSequence);
         k++;
         perm_num++;
      }
      if(base > 1) {
      for(b = 0; b < 4; b++) {
         if(base == 2) {
            sprintf(tempSequence, "%c%c\0", dna[a], dna[b]);
            strcpy(base2[k],tempSequence);
            k++;
            perm_num++;
         }
         if(base > 2) {
         for(c = 0; c < 4; c++)  {
            if(base == 3) {
               sprintf(tempSequence, "%c%c%c\0", dna[a], dna[b], dna[c]);
               strcpy(base3[k],tempSequence);
               k++;
               perm_num++;
            }
            if(base > 3) {
            for(d = 0; d < 4; d++) {
               if(base == 4) {
                  sprintf(tempSequence, "%c%c%c%c\0", dna[a], dna[b], dna[c], dna[d]);  // dna[t]);
                  strcpy(base4[k],tempSequence);
                  k++;
                  perm_num++;
               }
               if(base > 4) {
               for(e = 0; e < 4; e++) {
                  if(base == 5) {
                     sprintf(tempSequence, "%c%c%c%c%c\0", dna[a], dna[b], dna[c], dna[d], dna[e]); 
                     strcpy(base5[k],tempSequence); 
                     k++;
                     perm_num++;
                  }
               }
               }
//sprinkle this around where the permutations become > 1,000,000
               if(perm_num > 1000000) {
                  perm_num = 0;
                  perm_mill++;
               }
            }
            }

         }
         }
      }
      }
   }
   }

   
   
   if(perm_mill)
      printf("Permutations found: Millions: %lu %lu \n", perm_mill, perm_num);
   else
      printf("\n Permutations found: %lu \n", perm_num);
}

void printBase(int base) {

   // TEST PRINT BASE 1-5
   int i;
   
   switch(base) {
   
      case 1:
         for (i = 0; i < 4; i++)
            printf("base1[%d] = %s \n", i, base1[i]);
         break;
         
      case 2:
         for (i = 0; i < 16; i++)
            printf("base2[%d] = %s \n", i, base2[i]);
         break;
         
      case 3:
         for (i = 0; i < 64; i++)
            printf("base3[%d] = %s \n", i, base3[i]);
         break;
         
      case 4:
         for (i = 0; i < 256; i++)
            printf("base4[%d] = %s \n", i, base4[i]);
         break;
         
      case 5:
         for (i = 0; i < 1024; i++)
            printf("base5[%d] = %s \n", i, base5[i]);
         break;
         
      default:
         break;
   }
}

But if there is a better way, please point me out.

I've put my suggestions in the code, as remarks.

Also, I want to do thru base 15, so where would I insert another function like you said?

Code:

               if(base > 4) {
               for(e = 0; e < 4; e++) {
                  if(base == 5) {
                     sprintf(tempSequence, "%c%c%c%c%c\0", dna[a], dna[b], dna[c], dna[d], dna[e]); 
                     strcpy(base5[k],tempSequence); 
                     k++;
                     perm_num++;
                  }
                  xxx
               }
               }

Should I insert it at the xxx?

You won't insert another function, you just call the 2nd function, and that would be the right spot.

Before you add another function, STOP!

Get one function working *exactly* as you want it, because later, when you have more functions, it's tedious to make code corrections!

**Teiji** · 04-06-2009

I suggest using one 2D char array[][]. and have malloc set aside the memory for each
row - which is great because each row will be a completely different length.

I haven't learn about malloc yet, but that sounds like a nice function. Since I feel tired of creating a 2d array for each base.

**Sharke** · 04-06-2009

This reminds me of a guitar player's app I wrote years ago in AMOS on the Amiga. One feature was that it would calculate every possible fingering of a chord given a certain number of strings with certain notes upon a certain range of frets. I ended up writing a nested loop for each number of strings: a 6 nested loop for 6 strings, 5 for 5 and so on. The innermost loop constructed text strings of digit combinations based on the current state of the loops.

What a monstrosity, though it worked. Yet at the time I remember wishing dearly that someone would explain to me how to do it generically for n strings without having to construct a separate loop structure for each n.

**Adak** · 04-06-2009

This is how to use calloc, which is just like malloc, except it sets the memory being allocated to 0, for you. You have to tell it what base number system you're using, atm.
(that's what the 10 is in the calloc parameters).

Code:

//add this include file:

#include <stdlib.h>    //for calloc


int main(void)  {
   int i, base;
   unsigned long int j;
   char **bases;

   if((bases = calloc(15 * sizeof(char*), 10)) == NULL) {  //pointers to char
      printf("bases 1d memory allocation failed - aborting\n");
      return 1;
   }

   printf("\n\n\n");
   for(i = 0, j = 1; i < 6; i++) {      //
      j *= 4; 
      if((bases[i] = calloc(j+1 * 1, 10)) == NULL) {     //sizeof(char) is always 1
         printf("bases 2d memory allocation failed - aborting\n");
         return 1;
      }
      bases[i][j] = '\0';               //set the end of string char
      printf("%lu \n", j);              
        /* for debug only: if(bases[i] == NULL)  break; */
   }

   //for freeing up the memory when it's not going to be used anymore:
   for(i = 0; i < 15; i++)   //free the 2nd dimension
      free(bases[i]);         

   free(bases);              //then the 1st dimension

The only decent way I see to use this program, is to write out the base letters being generated, to a file, and then read them back in from the file, into the allocated array.

I knew you would not be able to put these huge bases into an array, so I didn't structure the program for that.

Even base 15 will have 1,073,nnn,nnn char's in it. < eek! >

Since you're stopping at base 15, the output can all fit into one file.

**Adak** · 04-07-2009

This is the latest version. It puts the data into a file named "bases.txt", and then loads them all up into the bases 2D array. Finally, it prints out the bases array.

Code:

/* DNA.c generates the base letters, from the set of (a,c,g,t), from 1 to
5. Adak, April, 2009
Status: improving
*/
#include <stdio.h>
#include <stdlib.h>     //for calloc
#include <math.h>     //for pow()

void writeIt(FILE *pfile, int base);
void loadIt(FILE *pfile, int base, char **bases);

unsigned long perm_num = 0;
unsigned long perm_mill = 0;
char dna[] = {"ACGT"};


int main(void)  {
   FILE *pfile;
   int i, base, gar;
   unsigned long int j;
   char **bases;

   if((bases = calloc(15 * sizeof(char*), 10)) == NULL) {  //pointers to char
      printf("bases 1d memory allocation failed - aborting\n");
      return 1;
   }

   printf("\n\n\n");
   for(i = 0, j = 1; i < 6; i++) {      //
      j *= 4; 
      if((bases[i] = calloc(j+1 * 1, 10)) == NULL) {     //sizeof(char) is always 1
         printf("bases 2d memory allocation failed - aborting\n");
         return 1;
      }
      bases[i][j] = '\0';               //set the end of string char
      printf("%lu \n", j);              
      if(bases[i] == NULL)
         break;
   }
   //at i = 15, string length will be: 1,073,741,824 char's!

   if((pfile = fopen("bases.txt", "wt")) == NULL) {
      printf("\nError opening bases.txt file - terminating\n");
      exit(1);
   }

   for(base = 1; base < 6; base++)
      writeIt(pfile, base);

   fclose(pfile);
   pfile = fopen("bases.txt", "rt");

   for(base = 1; base < 4; base++)
      loadIt(pfile, base, bases);

   printf("\n\n");
   for(i = 0; i < 4; i++)  {
      printf("\nBase %d: %s\n", i, bases[i]);
      gar++;
   }
   for(i = 0; i < 15; i++)   //free the 2nd dimension
      free(bases[i]);         
   free(bases);              //then the 1st dimension
   fclose(pfile);
   if(perm_mill)
      printf("\n\nPermutations found: Millions: %lu %lu", perm_mill, perm_num);
   else
      printf("\n\nPermutations found: %lu ", perm_num);

   printf("\n\n\t\t\t     press enter when ready ");
   i = getchar(); i++;
   return 0;
}

void loadIt(FILE *pfile, int base, char **bases) {
   unsigned long total = base * pow(4,base);
   unsigned int i; 

   for(i = 0; i < total; i+=base)  {
      fscanf(pfile, "%s", &bases[base][i]); //odd, but it orksw :)
      //gar++;  //for debug only
   }
}

void writeIt(FILE *pfile, int base) {
   int a, b, c, d, e, gar; 

   printf("\n"); 

   for(a = 0; a < 4; a++)  {
      if(base == 1) {
         //printf("%c ", dna[a]);
         fprintf(pfile, "%c\n", dna[a]);
         perm_num++;
      }
      if(base > 1) {
      for(b = 0; b < 4; b++) {
         if(base == 2) {
            //printf("%c%c   ", dna[a], dna[b]);
            fprintf(pfile, "%c%c\n", dna[a], dna[b]);
            perm_num++;
         }
         if(base > 2) {
         for(c = 0; c < 4; c++)  {
            if(base == 3) {
               //printf("%c%c%c  ", dna[a], dna[b], dna[c]);
               fprintf(pfile, "%c%c%c\n", dna[a], dna[b], dna[c]);
               perm_num++;
            }
            if(base > 3) {
            for(d = 0; d < 4; d++) {
               if(base == 4) {
                  //printf("%c%c%c%c ", dna[a], dna[b], dna[c], dna[d]);  // dna[t]);
                  fprintf(pfile, "%c%c%c%c\n", dna[a], dna[b], dna[c], dna[d]);
                  perm_num++;
               }
               if(base > 4) {
               for(e = 0; e < 4; e++) {
                  if(base == 5) {
                     //printf("%c%c%c%c%c   ", dna[a], dna[b], dna[c], dna[d], dna[e]);  
                     fprintf(pfile, "%c%c%c%c%c\n", dna[a],dna[b],dna[c],dna[d],dna[e]);
                     perm_num++;
                  }
//               if(perm_num % 220 == 0) { gar = getchar(); gar++; }
               }//end of for
               }//end of if
//sprinkle this around where the permutations become > 1,000,000
               if(perm_num > 1000000) {
                  perm_num = 0;
                  perm_mill++;
               }
            }
            }

         }
         }
      }
      }
   }
}

This runs ok - the variable "total" in loadIt(), will have to be reduced by nesting another for loop in loadIt() where it's used. Otherwise, it may overflow.

Does it run ok for you?

**KCfromNC** · 04-07-2009

Originally Posted by Teiji

Adak, oh it's a pretty long assignment. I have to read an input file containing thousands of DNA strands and count the value of each case (how many times it occur for base 1, base 2, base 3, etc.), get the min and max value of the case, and write them (the value as well as the case) to an output file.

The reason why I ask for help on generating the sequence is because I would need to compare all the sequence produced to the strands of DNA in the input file.

Maybe I'm missing something or there's more detail in the assignment, but I'm not seeing a need to generate every possible combination of all the bases from 1-15 here. What you want is to go though each input file character by character. For each character, grab the N characters needed to make up the base you're looking at and increment a counter for that particular case. Unless you're making a list of valid and invalid sequences to filter against while you're collecting stats, I don't see any reason to know beforehand what you might see in the input file. The input file itself will tell you that (aside from some basic validation that there's nothing aside from ACGT in the data). Hopefully that makes sense.

You could do most of this with 2 functions similar to the one posted by Snafuist. You'd need something to convert a string of a certain length into a number, and another function to convert that number back into the string. Each base listed can be encoded using 2 bits, so you could store each case up to base 16 in a 32-bit integer. You'd have a hash table for each case which stored the counts for everything you've seen so far.

Code:

read contents of input file into file_buf

const char *cp;
int case;

for (case = 1; case <= 15; case++)
{
    for (cp = file_buf; *(cp + case - 1); cp ++)
    {
        unsigned int encoded_sequence = dna_string_to_number(cp, case);
        hash_increment(hash_map[case], encoded_sequence);
    }
}

dna_string_to_number() is the reverse of Snafuist's code. It takes a string and a case number and returns an integer, 2 bits per input character.

hash_increment will increment the count for the key encoded_sequence in the hash table hash_map. Basically, you'll be doing hash_map[case][encoded_sequence] += 1. The complication here is that the range of encoded_sequence is 4^max_base = lots of entries for the larger cases. Unless you've got a huge file with lots of different sequences, most of these entries will be 0, so it's more efficient to store them as a list of <key, count> pairs instead of just using a huge array. This could be an array holding a bunch of <key, count> pairs that gets dynamically resized as needed, or it could be a linked list or tree.

When you're done, hash_map[case] holds the count of each sequence encountered for that particular sequence length. Get the keys from the hash for a list of sequences seen. For each key, use a dna_number_to_string() function to print out the base, and the count for that key is the number of times you've seen it. You can keep track of max and min here as well and print that out once you've finished the loop, or you can do several passes over the keys of the hash - whatever works best for you.

Depending on file size you might also need to read the file in chunks instead of all at once. That will complicate things a bit but not too much.

Come to think of it, this might be the perfect application for a Trie. You'd be using one single Trie for all of the cases, and the depth of a node in the Trie would tell you the length of the sequence you were looking at. This would eliminate the need to have anything other than a base 1 covert to/from number function.

**Adak** · 04-07-2009

KC, you seem to have a real insight into what Teiji is trying to do. I hope you'll stick around and join into this thread, frequently.

I don't follow how you can encode a base with hundreds or thousands of letters, into just two bits. Can you post a small example of that with say, 16 letters?