Thread: Generating DNA sequence

  1. #1
    Registered User
    Join Date
    Mar 2009
    Posts
    12

    Generating DNA sequence

    I'm writing a program that generates the DNA sequence given a base number.

    For example, if I give it base 1, the sequence would be: a, c, g, t. If I give it base 2, the sequence would be: aa, ac, ag, at, ca, cc, cg, ct, ga, gc, gg, gt, ta, tc, tg, tt. Etc for other bases....

    I'm thinking the prototype could be something like: void generate (int base);

    Can anyone give me some ideas on how to approach this problem?

  2. #2
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    I would be tempted to use the base number as the subscript number for the sequence, in an array of strings.
    Code:
    const int dnaNum = DNA number; 
    const int maxdnaLength = N;  
    
    char dna[dnaNum][maxdnaLength] = {
    { '\0' }, //base number 0 doesn't exist, so this row wouldn't be used
    { "acgt" },  //base number 1 is in subscript array row 1
    { "aaacagatcacccgctgagcgggttatctgtt" }, //base number 2 in array row 2
    //etc.
    };
    Last edited by Adak; 04-04-2009 at 10:44 PM.

  3. #3
    Registered User
    Join Date
    Mar 2009
    Posts
    12
    Quote Originally Posted by Adak View Post
    I would be tempted to use the base number as the subscript number for the sequence, in an array of strings.
    Code:
    const int dnaNum = DNA number; 
    const int maxdnaLength = N;  
    
    char dna[dnaNum][maxdnaLength] = {
    { '\0' }, //base number 0 doesn't exist, so this row wouldn't be used
    { "acgt" },  //base number 1 is in subscript array row 1
    { "aaacagatcacccgctgagcgggttatctgtt" }, //base number 2 in array row 2
    //etc.
    };
    Well, I want the program to use an algorithm to produce the sequence, instead of me manually input all the sequence. Imagine base 15, that would produce ~1 billion different sequences. That's impossible to manually write them.

  4. #4
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    sounds like recursion to me
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  5. #5
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    I Googled for it, and checked Wikipedia, but was unable to find an algorithm to use. Do you have a pattern that we can follow to generate these long sequences?

  6. #6
    Registered User
    Join Date
    Mar 2009
    Posts
    12
    Well, the pattern is...

    For base 2:
    First letter is A, then match with second letter ACGT (you'd get, AA, AC, AG, AT). Then first letter change to C, and match with second letter ACGT (you'd get CA, CC, CG, CT). Then first letter change to G, and match with second letter ACGT (you'd get GA, GC, GG, GT). Finally first letter change to T, and match with second letter ACGT (you'd get TA, TC, TG, TT).

    For base 3:
    First letter is A, second letter is A, match with third letter ACGT (you'd get AAA, AAC, AAG, AAT). First letter stay A, second letter change to C, match with third letter ACGT (you'd get ACA, ACC, ACG, ACT). First letter would change to C only after second letter finish with T. And so on....for a total of 64 different occurrences.

    And I'd like to generate for up to base 15.

  7. #7
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    at first i thought of string manipulation, but now i think you could use nested loops
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  8. #8
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Yes, nested loops are the ticket.

    the outer loop would be the char you start with:

    Code:
    for(letr = 'A'; letr < 'U'; letr++)  {
       //skip non base letters
       if(letr == 'H')
          letr = 'T';
    
       if(letr != 'A' && letr != 'C' && letr != 'G && letr != 'T") 
          continue;
    
       for(letr2 = 'A'; letr2 < 'U'; letr2++)  {
          //skip non base letters
          if(letr == 'H')
                letr = 'T';
          if(letr != 'A' && letr != 'C' && letr != 'G && letr != 'T") 
             continue;
    
          printf("%c%c  ", letr, letr2);
       }
    }
    This is just rough pseudo-code, not ready to run code, and may be too naive to handle this problem.

    I have used this type of nested loop to handle 81 squares on Sudoku, and generate Trillions and Trillions (and much more than that), grids, so it is both robust and fast, but every problem is different.

  9. #9
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    Quote Originally Posted by Adak View Post
    Yes, nested loops are the ticket.

    the outer loop would be the char you start with:

    Code:
    for(letr = 'A'; letr < 'U'; letr++)  {
       //skip non base letters
       if(letr == 'H')
          letr = 'T';
    
       if(letr != 'A' && letr != 'C' && letr != 'G && letr != 'T") 
          continue;
    
       for(letr2 = 'A'; letr2 < 'U'; letr2++)  {
          //skip non base letters
          if(letr == 'H')
                letr = 'T';
          if(letr != 'A' && letr != 'C' && letr != 'G && letr != 'T") 
             continue;
    
          printf("%c%c  ", letr, letr2);
       }
    }
    This is just rough pseudo-code, not ready to run code, and may be too naive to handle this problem.

    I have used this type of nested loop to handle 81 squares on Sudoku, and generate Trillions and Trillions (and much more than that), grids, so it is both robust and fast, but every problem is different.
    i would have made letr a pointer to an array holding 'A' 'C' 'G' and 'T' for simplicity
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  10. #10
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    yes, that's a very sensible idea.

    Carry on d_d!

  11. #11
    Registered User
    Join Date
    Mar 2009
    Posts
    12
    Ok here's what I got so far.

    This is only for base 2.

    Code:
    #include <stdio.h>
    #include <ctype.h>
    #include <string.h>
    
    // test generate base 2
    int main() {
       
       char base[16][3];
       char temp[3];
       char choice[] = "acgt";
       
       int i, j, k = 0;
    	
       for (i = 0; i < 4; i++) {
          for (j = 0; j < 4; j++) {
             sprintf(temp, "%c%c\0\n", choice[i], choice[j]);    //save the string to temp
             strcpy(base[k],temp);            // save each occurrence
             k++;
          }
       }
       
       for (i = 0; i < 16; i++)
          printf("%s \n", base[i]);
          
       return 0;
    }
    And the output is:
    Code:
    aa 
    ac 
    ag 
    at 
    ca 
    cc 
    cg 
    ct 
    ga 
    gc 
    gg 
    gt 
    ta 
    tc 
    tg 
    tt
    Note: the reason why I saved the occurrences permanently in an array (base) instead of printing it right away (from temp) is because I'm planning on doing something to each of the occurrence later.

    The problem is, for each base number, the char base[16][3] and char temp[3] changes in number (ie, for base 3, it would have to be char base[64][4] and char temp[4]. And, the more bases, the more for loops (ie. base 2 has 2 for loops, base 3 would have 3 for loops, etc.).

    So,
    1) How should I make it a separate method instead of residing inside main()? What should the prototype method be?
    2) How should I make the method works for any base number (without repeatedly typing the same codes 15 times and only adding an additional for loop for each base)?

    P.S. I plan on avoiding continue, skip, break, etc. They're just not my style.
    Last edited by Teiji; 04-05-2009 at 03:20 PM.

  12. #12
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    put the for loops inside an if statement?


    Quote Originally Posted by Adak View Post

    Carry on d_d!
    ಠ_ಠ
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  13. #13
    Registered User Maz's Avatar
    Join Date
    Nov 2005
    Location
    Finland
    Posts
    194
    I would take a different approach.

    I would first calculate the amount of space the sequence requires - with base 1, its 4, otherwise 4^(base-1), right?

    Then I would calculate the width of "one subsecuence". Eg with base 1, it is 1 (a or c or g or t). with base 2, it is 2 (aa or ac or ...)

    And finally amount of the "subsecuences". For example with allocated space/width of sequence.

    Then I would start filling up the space.

    first I'd write a<empty for rest of subsec>c<empty for rest of subsec>g<empty for rest of subsec>t<empty for rest of subsec>a<empty for rest of subsec>c<empty for rest of subsec>....
    then a<empty for rest of subsec>a<empty for rest of subsec>a<empty for rest of subsec>a<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec> ...

    just increasing the amount of repeated writes of same letter for each time - untill the whole area was covered. All of changing variables can be calculated from base. You can build all the loops in function, and then just decide how many loops you go through before breaking out depending on the base.

  14. #14
    Registered User Maz's Avatar
    Join Date
    Nov 2005
    Location
    Finland
    Posts
    194
    And later you could optimize required memory, by using bit shifts and masks to use only 2 bits to mark one character. (Naturally encoding & decoding will slow down the process, but storing handled data will be more efficient.)

  15. #15
    Registered User
    Join Date
    Dec 2008
    Posts
    183
    you can do it like this 2
    Code:
    void base_genrator(int num)
    {
         switch(num){
         case 2:
              binary_base
              codegoeshere
              break;
        case 7:
             hex_base
             codegoeshere;
             break;
        case 9:
             decimal_base;
             codegoeshere;
             break;}
    you specify the base in your main program and the write the code in the switch .

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Generating a sequence of numbers in a random order
    By mirbogat in forum C Programming
    Replies: 15
    Last Post: 08-12-2008, 02:01 PM
  2. Doxygen failing
    By Elysia in forum A Brief History of Cprogramming.com
    Replies: 19
    Last Post: 04-16-2008, 01:24 PM
  3. Immediate programming help! Please!
    By xMEGANx in forum C++ Programming
    Replies: 6
    Last Post: 02-20-2008, 12:52 PM
  4. sequence
    By braddy in forum C Programming
    Replies: 2
    Last Post: 03-30-2006, 02:15 PM
  5. wsprintf and format specifiers
    By incognito in forum Windows Programming
    Replies: 2
    Last Post: 01-03-2004, 10:00 PM