Thread: Generating DNA sequence

  1. #16
    Registered User
    Join Date
    Mar 2009
    Posts
    12
    Quote Originally Posted by Maz View Post
    I would take a different approach.

    I would first calculate the amount of space the sequence requires - with base 1, its 4, otherwise 4^(base-1), right?

    Then I would calculate the width of "one subsecuence". Eg with base 1, it is 1 (a or c or g or t). with base 2, it is 2 (aa or ac or ...)

    And finally amount of the "subsecuences". For example with allocated space/width of sequence.

    Then I would start filling up the space.

    first I'd write a<empty for rest of subsec>c<empty for rest of subsec>g<empty for rest of subsec>t<empty for rest of subsec>a<empty for rest of subsec>c<empty for rest of subsec>....
    then a<empty for rest of subsec>a<empty for rest of subsec>a<empty for rest of subsec>a<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec> ...

    just increasing the amount of repeated writes of same letter for each time - untill the whole area was covered. All of changing variables can be calculated from base. You can build all the loops in function, and then just decide how many loops you go through before breaking out depending on the base.
    I don't understand the part in bold. Are you using an array? Can you give an example of how the codes would look like?

  2. #17
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Maz, you should know that there is a special DNA sequence compression algorithm, made up just to store the info we're processing here.

    I read about it briefly on Wikipedia when I was looking for info on generating these sequences. I would recommend that for the compression part.

    But first, we need to generate it!

    @lolguy: Base here, refers to DNA base #'s, not our binary, decimal or hex bases we usually work with.

    So let me see if I understand our sequence, correctly:

    The base number indicates the number of letters in our "head" group. So base 1 begins with 'A', and base 5 begins with ACGTA?

    For the "tail"group, the sequence is always to add the ACGT, and that sequence never varies, right?

    No, Teiji, we don't need to add another for loop for every base number. We're just flushing out the algorithm here.

    See I took physiology before Watson & Crick, and before microscopes were invented, so genetics was pretty much pea plants and Mendelson, iirc.

  3. #18
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    Quote Originally Posted by Adak View Post
    The base number indicates the number of letters in our "head" group. So base 1 begins with 'A', and base 5 begins with ACGTA?
    I believe base 5 begins with AAAAA
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  4. #19
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Sounds inefficient for our DNA, but good for us.

    AAAAA....CCCCC...GGGGG...TTTTT, is what we need for base 5, then? Just ordered permutations of that ACGT set?

    This is getting easier by the minute.

    Teiji, can you just generate these sequences on the fly (on demand), or do you have to store them in files? I would definitely favor the on demand approach, if possible.
    Last edited by Adak; 04-05-2009 at 05:52 PM.

  5. #20
    Registered User
    Join Date
    Mar 2009
    Posts
    12
    Quote Originally Posted by Adak View Post
    Sounds inefficient for our DNA, but good for us.

    AAAAA....CCCCC...GGGGG...TTTTT, is what we need for base 5, then? Just ordered permutations of that ACGT set?

    This is getting easier by the minute.

    Teiji, can you just generate these sequences on the fly (on demand), or do you have to store them in files? I would definitely favor the on demand approach, if possible.
    No, you got it wrong.

    Simple concept but hard to explain. Only 1 letter change at a time starting from the right. Once that letter complete its change (got all the way to T), then it "reset" and the letter left of it change.

    I'll bold the letter that change and reset

    Base 1: A, C, G, T (total of 4 occurrences)
    Base 2: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT (total of 4^2 or 16 occurrences)
    .....
    Base 5: AAAAA, AAAAC, AAAAG, AAAAT, AAACA, AAACC, AAACG, AAACT, AAAGA, AAAGC, AAAGG, AAAGT, AAATA, AAATC, AAATG, AAATT, AACAA, AACAC, AACAG, AACAT, etc until the last one TTTTT (total of 4^5 or 1024 occurrences)

    And I need to store them.

  6. #21
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    maybe use base 4 ( 0, 1, 2, 3 = A, C, G, T) math and add one to it to get the next value

    then it would just be a problem of limiting the number of bits you are using (1 base4-bit for base 1, 2 for base 2 so that 0 gives you the appropriate number of 'A's)
    which would also control when to stop (when it rolls over)
    Last edited by ಠ_ಠ; 04-05-2009 at 10:59 PM.
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  7. #22
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by Teiji View Post
    No, you got it wrong.
    I got it right, now.

    This is not an elegant program, but it shows how I would generate the base sequences. It's a fast runner if you don't have it printing up to the screen.

    I recommend putting this data into a file, and then using the DNA compression algorithm mentioned in Wikipedia to compress it, after it's generated. Zip will usually make DNA sequences *larger* when they try to compress this data, according to Wikipedia's article on it.

    I would also break the files up into descriptive names so the files are just slightly less in size, than a comfortable number of sectors, on your HD. You'll waste a lot of disk space, otherwise. Having the data all in one big file is risky. In any case, keep a back up.

    What do you intend doing with all this data? Seems a bit much for a class assignment.


    Code:
    /* DNA.c generates the base numbers, from the set of (A,C,G,T), from 1 to
    5. Adak, April, 2009
    */
    
    #include <stdio.h>
    
    void showIt(int base);
    
    int main(void)  {
       int i, base;
    
       base = 5;
       showIt(base);
       
       printf("\n\n\t\t\t      press enter when ready ");
       i = getchar(); i++;
       return 0;
    }
    void showIt(int base) {
       int a,b, c, d, e, i; 
       unsigned long perm_num = 0;
       unsigned long perm_mill = 0;
       char dna[] = {"ACGT"};
       
       printf("\n"); 
    
       if(base) {
       for(a = 0; a < 4; a++)  {
          if(base == 1) {
             printf("%c ", dna[a]);
             perm_num++;
          }
          if(base > 1) {
          for(b = 0; b < 4; b++) {
             if(base == 2) {
                printf("%c%c  ", dna[a], dna[b]);
                perm_num++;
             }
             if(base > 2) {
             for(c = 0; c < 4; c++)  {
                if(base == 3) {
                   printf("%c%c%c  ", dna[a], dna[b], dna[c]);
                   perm_num++;
                }
                if(base > 3) {
                for(d = 0; d < 4; d++) {
                   if(base == 4) {
                      printf("%c%c%c%c ", dna[a], dna[b], dna[c], dna[d]);  // dna[t]);
                      perm_num++;
                   }
                   if(base > 4) {
                   for(e = 0; e < 4; e++) {
                      if(base == 5) {
                         printf("%c%c%c%c%c   ", dna[a], dna[b], dna[c], dna[d], dna[e]);  
                         perm_num++;
                      }
                   if(perm_num % 100 == 0) i = getchar();
                   }
                   }
    //sprinkle this around where the permutations become > 1,000,000
                   if(perm_num > 1000000) {
                      perm_num = 0;
                      perm_mill++;
                   }
                }
                }
    
             }
             }
          }
          }
       }
       }
       if(perm_mill)
          printf("Permutations found: Millions: %lu %lu", perm_mill, perm_num);
       else
          printf("\n Permutations found: %lu ", perm_num);
    }
    After about 6 nested loops, I'd have this function call the next function, and repeat it with the very same local variables. Pretty much copy and paste, with just the base numbers references, direct or indirect, needing to be changed.

    perm_num and perm_mill, I would make global variables, and change them to long, long, if your compiler supports them.
    Last edited by Adak; 04-06-2009 at 02:23 AM.

  8. #23
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    Quote Originally Posted by Adak View Post
    I got it right, now.

    This is not an elegant program, but it shows how I would generate the base sequences. It's a fast runner if you don't have it printing up to the screen.

    I recommend putting this data into a file, and then using the DNA compression algorithm mentioned in Wikipedia to compress it, after it's generated. Zip will usually make DNA sequences *larger* when they try to compress this data, according to Wikipedia's article on it.

    I would also break the files up into descriptive names so the files are just slightly less in size, than a comfortable number of sectors, on your HD. You'll waste a lot of disk space, otherwise. Having the data all in one big file is risky. In any case, keep a back up.

    What do you intend doing with all this data? Seems a bit much for a class assignment.


    *CODE*

    After about 6 nested loops, I'd have this function call the next function, and repeat it with the very same local variables. Pretty much copy and paste, with just the base numbers references, direct or indirect, needing to be changed.

    perm_num and perm_mill, I would make global variables, and change them to long, long, if your compiler supports them.
    do you think you could take a crack at my solution (if i made any sense)? I have no idea how to actually code it and I think it would be a useful thing to know

    ***EDIT***
    never mind, I think I got it
    **********
    Last edited by ಠ_ಠ; 04-05-2009 at 11:41 PM.
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  9. #24
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    You're on the right track, imo, because you need 4 (or 4-1), since we have 4 letters in the set to be permuted. Your masking of bits would correspond via bit testing, to my if(base...) statements, in my program, if I get your drift right.

    Which is nice, but not much different, really. We'd be checking the bits, or values, instead of checking the array of base letters, through the index variables. I don't see how that gets us out of the nested 64 (eventually), for loops, however.

    What would be elegant, would be to use either of our approaches, but make it work recursively. That would be brilliant, but I wouldn't try it under any kind of time pressure.

    Teiji faces a lot of work still. Not only does the program have to be expanded to handle all 64 bases, but then there's the problem of output to a file(s), and maybe writing up that DNA compression and de-compression program.

    If there is an elegant way to handle this, I'd sure like to see it.

    EDIT: ***Could you post it?***

  10. #25
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    just finished, does up to base 3
    i'm sure it can be done better

    Code:
    #include <stdio.h>
    
    typedef struct {
    	unsigned int DNA : 2;
    }mytype;
    
    int main(int argc, char* argv[])
    {
      mytype one;
      mytype two;
      mytype three;
      mytype *foo[3];
      char bar[4] = "ACGT";
      foo[0]=&one;
      foo[1]=&two;
      foo[2]=&three;
      one.DNA = two.DNA = three.DNA = 0;
    
    
    
      do{
    
        do {
          do {
            do {
            
              printf("%c%c%c\n", bar[foo[0]->DNA], bar[foo[1]->DNA], bar[foo[2]->DNA]);
              foo[2]->DNA++;
            
            }while(foo[2]->DNA);
            foo[1]->DNA++;
           
          }while(foo[1]->DNA);
          foo[0]->DNA++;
            
        }while(foo[0]->DNA);
    
      }while(foo[0]->DNA && foo[1]->DNA && foo[2]->DNA);
    
      return 0;
    }
    Last edited by ಠ_ಠ; 04-06-2009 at 12:10 AM.
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  11. #26
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    That looks very clean, d_d! Far more concise than mine, also.

  12. #27
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    Quote Originally Posted by Adak View Post
    That looks very clean, d_d! Far more concise than mine, also.
    ಠ_ಠ

    now to figure out how do do it recursively...
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  13. #28
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by ಠ_ಠ View Post
    ಠ_ಠ

    now to figure out how do do it recursively...
    Sandbagger!

    I see how my program could be substantially cleaned up and shortened, as well.

    I believe I'll leave that as "an exercise for the student", though. I want to dig into that maze puzzle posted up on the forum.

    I will be checking back on your next version of the program of course, and I want to hear what Teiji is going to be doing with all this data!

  14. #29
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    here's an almost working recursive solution, too sleepy to finish today
    Code:
    #include <stdio.h>
    
    typedef struct {
    	unsigned int DNA : 2;
    }mytype;
    
    void printDNA(mytype *foo[], int base, int flag);
    
    int main(int argc, char* argv[])
    {
      mytype one;
      mytype two;
      mytype three;
      mytype *foo[3];
      foo[0]=&one;
      foo[1]=&two;
      foo[2]=&three;
      one.DNA = two.DNA = three.DNA = 0;
    
      printDNA(foo, 3, 0);
      system("pause");
      return 0;
    
    }
    
    void printDNA(mytype *foo[], int base, int flag)
    {
       int temp1, temp2;
       char bar[4] = "ACGT";
    
       for (temp1 = 0; temp1 < base; temp1++) {
    	  printf("%c",bar[foo[temp1]->DNA]);
       }
       printf("\n");
    
       for (temp2 = base - 1; temp2 > -1; temp2--) {
    	 if (foo[temp2]->DNA == 3 && temp2 != 0) {
    	   foo[temp2-1]->DNA++;
    	 }else if (foo[temp2]->DNA == 3 && temp2 == 0 && flag == 0) {
    	   flag = 1;
    	 }else if (foo[temp2]->DNA == 3 && foo[temp2+1]->DNA == 3 && temp2 == 0 && flag == 1) {
    	   flag = 2;
    	 }
       }
       foo[base-1]->DNA++;
       if (flag != 2) {
    	 printDNA(foo, base, flag);
       }
    }
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  15. #30
    Complete Beginner
    Join Date
    Feb 2009
    Posts
    312
    As far as I understand, you simply want to print all permutations of fixed-length strings consisting of the letters a, c, g and t.

    Here's a solution:

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <math.h>
    
    char *dnaconv(unsigned long num, char *str, size_t size)
    {
            char *p;
    
            if(size < 2)
                    return NULL;
    
            memset(str, 'a', size - 1);
            p = &str[size - 1];
            *p = '\0';
    
            do {
                    *--p = "acgt"[num % 4];
                    num /= 4;
            } while(num != 0);
    
            return str;
    }
    
    int main(int argc, char *argv[])
    {
            int base, i, num;
            char *s;
    
            if(argc != 2) {
                    fprintf(stderr, "USAGE: %s <base>\n", argv[0]);
                    return EXIT_FAILURE;
            }
    
            base = atoi(argv[argc-1]);
    
            if((s = malloc(base + 1)) == NULL) {
                    fprintf(stderr, "%s: out of memory\n", argv[0]);
                    return EXIT_FAILURE;
            }
    
            num = pow(4, base);
    
            for(i = 0; i < num; i++) {
                    printf("%s\n", dnaconv(i, s, base + 1));
            }
    
            free(s);
            return 0;
    }
    Compile with -lm, call with the desired base as the only argument.

    dnaconv() is the interesting function. It converts a number to numeral base 4 and uses the letters a, c, g, t instead of the traditional digits. The memset() is merely for padding. main() then simply counts from 0 to 4^base, converts these numbers to numeral base 4 and prints them.

    Why did nobody listen to ಠ_ಠ's advice?

    Greets,
    Philip
    All things begin as source code.
    Source code begins with an empty file.
    -- Tao Te Chip

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Generating a sequence of numbers in a random order
    By mirbogat in forum C Programming
    Replies: 15
    Last Post: 08-12-2008, 02:01 PM
  2. Doxygen failing
    By Elysia in forum A Brief History of Cprogramming.com
    Replies: 19
    Last Post: 04-16-2008, 01:24 PM
  3. Immediate programming help! Please!
    By xMEGANx in forum C++ Programming
    Replies: 6
    Last Post: 02-20-2008, 12:52 PM
  4. sequence
    By braddy in forum C Programming
    Replies: 2
    Last Post: 03-30-2006, 02:15 PM
  5. wsprintf and format specifiers
    By incognito in forum Windows Programming
    Replies: 2
    Last Post: 01-03-2004, 10:00 PM