Generating DNA sequence

**Teiji** · 04-05-2009

Originally Posted by Maz

I would take a different approach.

I would first calculate the amount of space the sequence requires - with base 1, its 4, otherwise 4^(base-1), right?

Then I would calculate the width of "one subsecuence". Eg with base 1, it is 1 (a or c or g or t). with base 2, it is 2 (aa or ac or ...)

And finally amount of the "subsecuences". For example with allocated space/width of sequence.

Then I would start filling up the space.

first I'd write a<empty for rest of subsec>c<empty for rest of subsec>g<empty for rest of subsec>t<empty for rest of subsec>a<empty for rest of subsec>c<empty for rest of subsec>....
then a<empty for rest of subsec>a<empty for rest of subsec>a<empty for rest of subsec>a<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec> ...

just increasing the amount of repeated writes of same letter for each time - untill the whole area was covered. All of changing variables can be calculated from base. You can build all the loops in function, and then just decide how many loops you go through before breaking out depending on the base.

I don't understand the part in bold. Are you using an array? Can you give an example of how the codes would look like?

**Adak** · 04-05-2009

Maz, you should know that there is a special DNA sequence compression algorithm, made up just to store the info we're processing here.

I read about it briefly on Wikipedia when I was looking for info on generating these sequences. I would recommend that for the compression part.

But first, we need to generate it!

@lolguy: Base here, refers to DNA base #'s, not our binary, decimal or hex bases we usually work with.

So let me see if I understand our sequence, correctly:

The base number indicates the number of letters in our "head" group. So base 1 begins with 'A', and base 5 begins with ACGTA?

For the "tail"group, the sequence is always to add the ACGT, and that sequence never varies, right?

No, Teiji, we don't need to add another for loop for every base number. We're just flushing out the algorithm here.

See I took physiology before Watson & Crick, and before microscopes were invented, so genetics was pretty much pea plants and Mendelson, iirc.

**ಠ_ಠ** · 04-05-2009

Originally Posted by Adak

The base number indicates the number of letters in our "head" group. So base 1 begins with 'A', and base 5 begins with ACGTA?

I believe base 5 begins with AAAAA

**Adak** · 04-05-2009

Sounds inefficient for our DNA, but good for us.

AAAAA....CCCCC...GGGGG...TTTTT, is what we need for base 5, then? Just ordered permutations of that ACGT set?

This is getting easier by the minute.

Teiji, can you just generate these sequences on the fly (on demand), or do you have to store them in files? I would definitely favor the on demand approach, if possible.

**Teiji** · 04-05-2009

Originally Posted by Adak

Sounds inefficient for our DNA, but good for us.

AAAAA....CCCCC...GGGGG...TTTTT, is what we need for base 5, then? Just ordered permutations of that ACGT set?

This is getting easier by the minute.

Teiji, can you just generate these sequences on the fly (on demand), or do you have to store them in files? I would definitely favor the on demand approach, if possible.

No, you got it wrong.

Simple concept but hard to explain. Only 1 letter change at a time starting from the right. Once that letter complete its change (got all the way to T), then it "reset" and the letter left of it change.

I'll bold the letter that change and reset

Base 1: A, C, G, T (total of 4 occurrences)
Base 2: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT (total of 4^2 or 16 occurrences)
.....
Base 5: AAAAA, AAAAC, AAAAG, AAAAT, AAACA, AAACC, AAACG, AAACT, AAAGA, AAAGC, AAAGG, AAAGT, AAATA, AAATC, AAATG, AAATT, AACAA, AACAC, AACAG, AACAT, etc until the last one TTTTT (total of 4^5 or 1024 occurrences)

And I need to store them.

**ಠ_ಠ** · 04-05-2009

maybe use base 4 ( 0, 1, 2, 3 = A, C, G, T) math and add one to it to get the next value

then it would just be a problem of limiting the number of bits you are using (1 base4-bit for base 1, 2 for base 2 so that 0 gives you the appropriate number of 'A's)
which would also control when to stop (when it rolls over)

**Adak** · 04-05-2009

Originally Posted by Teiji

No, you got it wrong.

I got it right, now.

This is not an elegant program, but it shows how I would generate the base sequences. It's a fast runner if you don't have it printing up to the screen.

I recommend putting this data into a file, and then using the DNA compression algorithm mentioned in Wikipedia to compress it, after it's generated. Zip will usually make DNA sequences *larger* when they try to compress this data, according to Wikipedia's article on it.

I would also break the files up into descriptive names so the files are just slightly less in size, than a comfortable number of sectors, on your HD. You'll waste a lot of disk space, otherwise. Having the data all in one big file is risky. In any case, keep a back up.

What do you intend doing with all this data? Seems a bit much for a class assignment.

Code:

/* DNA.c generates the base numbers, from the set of (A,C,G,T), from 1 to
5. Adak, April, 2009
*/

#include <stdio.h>

void showIt(int base);

int main(void)  {
   int i, base;

   base = 5;
   showIt(base);
   
   printf("\n\n\t\t\t      press enter when ready ");
   i = getchar(); i++;
   return 0;
}
void showIt(int base) {
   int a,b, c, d, e, i; 
   unsigned long perm_num = 0;
   unsigned long perm_mill = 0;
   char dna[] = {"ACGT"};
   
   printf("\n"); 

   if(base) {
   for(a = 0; a < 4; a++)  {
      if(base == 1) {
         printf("%c ", dna[a]);
         perm_num++;
      }
      if(base > 1) {
      for(b = 0; b < 4; b++) {
         if(base == 2) {
            printf("%c%c  ", dna[a], dna[b]);
            perm_num++;
         }
         if(base > 2) {
         for(c = 0; c < 4; c++)  {
            if(base == 3) {
               printf("%c%c%c  ", dna[a], dna[b], dna[c]);
               perm_num++;
            }
            if(base > 3) {
            for(d = 0; d < 4; d++) {
               if(base == 4) {
                  printf("%c%c%c%c ", dna[a], dna[b], dna[c], dna[d]);  // dna[t]);
                  perm_num++;
               }
               if(base > 4) {
               for(e = 0; e < 4; e++) {
                  if(base == 5) {
                     printf("%c%c%c%c%c   ", dna[a], dna[b], dna[c], dna[d], dna[e]);  
                     perm_num++;
                  }
               if(perm_num % 100 == 0) i = getchar();
               }
               }
//sprinkle this around where the permutations become > 1,000,000
               if(perm_num > 1000000) {
                  perm_num = 0;
                  perm_mill++;
               }
            }
            }

         }
         }
      }
      }
   }
   }
   if(perm_mill)
      printf("Permutations found: Millions: %lu %lu", perm_mill, perm_num);
   else
      printf("\n Permutations found: %lu ", perm_num);
}

After about 6 nested loops, I'd have this function call the next function, and repeat it with the very same local variables. Pretty much copy and paste, with just the base numbers references, direct or indirect, needing to be changed.

perm_num and perm_mill, I would make global variables, and change them to long, long, if your compiler supports them.

**ಠ_ಠ** · 04-05-2009

Originally Posted by Adak

I got it right, now.

This is not an elegant program, but it shows how I would generate the base sequences. It's a fast runner if you don't have it printing up to the screen.

I recommend putting this data into a file, and then using the DNA compression algorithm mentioned in Wikipedia to compress it, after it's generated. Zip will usually make DNA sequences *larger* when they try to compress this data, according to Wikipedia's article on it.

I would also break the files up into descriptive names so the files are just slightly less in size, than a comfortable number of sectors, on your HD. You'll waste a lot of disk space, otherwise. Having the data all in one big file is risky. In any case, keep a back up.

What do you intend doing with all this data? Seems a bit much for a class assignment.

*CODE*

After about 6 nested loops, I'd have this function call the next function, and repeat it with the very same local variables. Pretty much copy and paste, with just the base numbers references, direct or indirect, needing to be changed.

perm_num and perm_mill, I would make global variables, and change them to long, long, if your compiler supports them.

do you think you could take a crack at my solution (if i made any sense)? I have no idea how to actually code it and I think it would be a useful thing to know

***EDIT***
never mind, I think I got it
**********

**Adak** · 04-05-2009

You're on the right track, imo, because you need 4 (or 4-1), since we have 4 letters in the set to be permuted. Your masking of bits would correspond via bit testing, to my if(base...) statements, in my program, if I get your drift right.

Which is nice, but not much different, really. We'd be checking the bits, or values, instead of checking the array of base letters, through the index variables. I don't see how that gets us out of the nested 64 (eventually), for loops, however.

What would be elegant, would be to use either of our approaches, but make it work recursively. That would be brilliant, but I wouldn't try it under any kind of time pressure.

Teiji faces a lot of work still. Not only does the program have to be expanded to handle all 64 bases, but then there's the problem of output to a file(s), and maybe writing up that DNA compression and de-compression program.

If there is an elegant way to handle this, I'd sure like to see it.

EDIT: ***Could you post it?***

**ಠ_ಠ** · 04-06-2009

just finished, does up to base 3
i'm sure it can be done better

Code:

#include <stdio.h>

typedef struct {
	unsigned int DNA : 2;
}mytype;

int main(int argc, char* argv[])
{
  mytype one;
  mytype two;
  mytype three;
  mytype *foo[3];
  char bar[4] = "ACGT";
  foo[0]=&one;
  foo[1]=&two;
  foo[2]=&three;
  one.DNA = two.DNA = three.DNA = 0;



  do{

    do {
      do {
        do {
        
          printf("%c%c%c\n", bar[foo[0]->DNA], bar[foo[1]->DNA], bar[foo[2]->DNA]);
          foo[2]->DNA++;
        
        }while(foo[2]->DNA);
        foo[1]->DNA++;
       
      }while(foo[1]->DNA);
      foo[0]->DNA++;
        
    }while(foo[0]->DNA);

  }while(foo[0]->DNA && foo[1]->DNA && foo[2]->DNA);

  return 0;
}

**Adak** · 04-06-2009

That looks very clean, d_d! Far more concise than mine, also.

**ಠ_ಠ** · 04-06-2009

Originally Posted by Adak

That looks very clean, d_d! Far more concise than mine, also.

ಠ_ಠ

now to figure out how do do it recursively...

**Adak** · 04-06-2009

Originally Posted by ಠ_ಠ

ಠ_ಠ

now to figure out how do do it recursively...

Sandbagger!

I see how my program could be substantially cleaned up and shortened, as well.

I believe I'll leave that as "an exercise for the student", though. I want to dig into that maze puzzle posted up on the forum.

I will be checking back on your next version of the program of course, and I want to hear what Teiji is going to be doing with all this data!

**ಠ_ಠ** · 04-06-2009

here's an almost working recursive solution, too sleepy to finish today

Code:

#include <stdio.h>

typedef struct {
	unsigned int DNA : 2;
}mytype;

void printDNA(mytype *foo[], int base, int flag);

int main(int argc, char* argv[])
{
  mytype one;
  mytype two;
  mytype three;
  mytype *foo[3];
  foo[0]=&one;
  foo[1]=&two;
  foo[2]=&three;
  one.DNA = two.DNA = three.DNA = 0;

  printDNA(foo, 3, 0);
  system("pause");
  return 0;

}

void printDNA(mytype *foo[], int base, int flag)
{
   int temp1, temp2;
   char bar[4] = "ACGT";

   for (temp1 = 0; temp1 < base; temp1++) {
	  printf("%c",bar[foo[temp1]->DNA]);
   }
   printf("\n");

   for (temp2 = base - 1; temp2 > -1; temp2--) {
	 if (foo[temp2]->DNA == 3 && temp2 != 0) {
	   foo[temp2-1]->DNA++;
	 }else if (foo[temp2]->DNA == 3 && temp2 == 0 && flag == 0) {
	   flag = 1;
	 }else if (foo[temp2]->DNA == 3 && foo[temp2+1]->DNA == 3 && temp2 == 0 && flag == 1) {
	   flag = 2;
	 }
   }
   foo[base-1]->DNA++;
   if (flag != 2) {
	 printDNA(foo, base, flag);
   }
}

**Snafuist** · 04-06-2009

As far as I understand, you simply want to print all permutations of fixed-length strings consisting of the letters a, c, g and t.

Here's a solution:

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>

char *dnaconv(unsigned long num, char *str, size_t size)
{
        char *p;

        if(size < 2)
                return NULL;

        memset(str, 'a', size - 1);
        p = &str[size - 1];
        *p = '\0';

        do {
                *--p = "acgt"[num % 4];
                num /= 4;
        } while(num != 0);

        return str;
}

int main(int argc, char *argv[])
{
        int base, i, num;
        char *s;

        if(argc != 2) {
                fprintf(stderr, "USAGE: %s <base>\n", argv[0]);
                return EXIT_FAILURE;
        }

        base = atoi(argv[argc-1]);

        if((s = malloc(base + 1)) == NULL) {
                fprintf(stderr, "%s: out of memory\n", argv[0]);
                return EXIT_FAILURE;
        }

        num = pow(4, base);

        for(i = 0; i < num; i++) {
                printf("%s\n", dnaconv(i, s, base + 1));
        }

        free(s);
        return 0;
}

Compile with -lm, call with the desired base as the only argument.

dnaconv() is the interesting function. It converts a number to numeral base 4 and uses the letters a, c, g, t instead of the traditional digits. The memset() is merely for padding. main() then simply counts from 0 to 4^base, converts these numbers to numeral base 4 and prints them.

Why did nobody listen to ಠ_ಠ's advice?

Greets,
Philip