Thread: Generating DNA sequence

  1. #46
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Adak
    I don't follow how you can encode a base with hundreds or thousands of letters, into just two bits.
    KCfromNC stated that each base, not each sequence, can be encoded with two bits, and of course that is true since there are four possible bases.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  2. #47
    Registered User
    Join Date
    Mar 2009
    Posts
    344
    Quote Originally Posted by Adak View Post
    KC, you seem to have a real insight into what Teiji is trying to do. I hope you'll stick around and join into this thread, frequently.

    I don't follow how you can encode a base with hundreds or thousands of letters, into just two bits. Can you post a small example of that with say, 16 letters?
    I agree that you can't encode a base with that many letters into 2 bits. I was saying that it would take 2 bits per letter (4 possible options per letter == 2 bits), and it looked like the longest sequence he was looking for was 15 letters = 30 bits max. I might have misunderstood that goal from the original post, but that's the question I was answering so hopefully it at least makes sense in that context.

    Edit - should have read the next page before responding. Oh well. I'll atone for that by pointing out that if you use a trie, you're only worrying about the encoding of one letter at any particular time regardless of the total sequence length you're processing. Each level of the trie encodes just one letter so you only need 4 possible children for each node. You can add as much depth as you need to get the required sequence length, limited by memory of course.
    Last edited by KCfromNC; 04-07-2009 at 11:55 AM.

  3. #48
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Ok, I see what you meant. You were referring to base letters, and I was thinking base strings. I thought you had lost your mind, for a second there.

    Aw! So encode the letters into bits in 32 bit numbers, then do the work with them, with bit operators and masks.

    I'm not sure what the plan is for processing the strings from the input file.

  4. #49
    Registered User
    Join Date
    Mar 2009
    Posts
    12
    I'd like to thank everyone for all of the help. But the length of this assignment is growing and growing each day (and I have several other assignments to do) and the codes get more complex as it progresses, which is beyond my knowledge of barely a semester learning C. Therefore, I'll stop here and polish up what I currently have.

    You guys can keep on discussing this "project" if you like.

  5. #50
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    I probably don't understand the problem yet... but from the hints it sounds like the problem is to simply count the number of duplicate strings; strings being anywhere from 1 to 15 in length.

    It would be insane to generate all those permutations ahead of time... as if that wasn't bad enough, comparing thousands of input strings to this database would be a nightmare.

    "Base 15", as Teiji calls it for strings of 15 character length would be 1,073,741,824 permutations. It would take 16 gigabytes of memory just to store such an array. Now imagine trying to search that thing 1000s of times.

    It doesn't matter that they only consist letters among the following: 'A', 'C', 'G', and 'T' although that restriction should be checked for for each input string.

    Simply allocate some hash, or sorted list array of encountered strings and count them if there is a clash.
    Last edited by nonoob; 04-08-2009 at 08:44 PM.

  6. #51
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    ** Welcome to the "I don't really understand this whole thing", club! **

    As this assignment moves forward, Teiji may very well discover that she doesn't need these long permutations, after all.

    Nonetheless, Teiji asked for help with this, and we helped, and that's all we could do atm.

  7. #52
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    This will print out every sequence for all bases between 1 and 15
    I pretty much scraped my earlier code, didn't even use recursion
    Code:
    #include <stdio.h>
    #include <math.h>
    
    void printDNA(int base);
    
    int main(int argc, char* argv[])
    {
      int base;
      for (base = 1; base < 16; base ++) {
    	printDNA(base);
      }
      system("pause");
      return 0;
    
    }
    
    void printDNA(int base)
    {
      int pairs, i, tempPair;
      int digits[16];
      char DNA[4] = "ACGT";
      for (pairs = 0; pairs < pow(4, base); pairs++) {
    	tempPair=pairs;
    	for (i = base; i > -1; i--) {
    	  digits[i]=tempPair%4;
    	  tempPair /= 4;
    	}
    	for (i = 1; i <= base; i++) {
    	  printf("%c", DNA[digits[i]]);
    	}
    	printf("\n");
      }
    }
    Last edited by ಠ_ಠ; 04-08-2009 at 08:40 PM.
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  8. #53
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    Quote Originally Posted by ಠ_ಠ View Post
    This will print out every sequence for all bases between 1 and 15
    I pretty much scraped my earlier code, didn't even use recursion
    Excellent!! Especially for a 1st year Software Engineering student. Recognizing it is a base-4 issue... encoding the "DNA sequence" as a multi digit number. 2-loops nested. Done.

    If the inputted strings are to be searched against those arrays it can be accomplished using binary searches in each respective string-length category. A counter of occurrence associated with each entry of course.

    One small bug:
    line
    Code:
    for (i = base; i > -1; i--) {
    should be
    Code:
    for (i = base; i > 0; i--) {
    Otherwise it will always assign a zero into [0] of digits - which does no harm but is useless. The [0]th element is not read in the subsequent loop. Also I would haul the pow(4, base) outside of the for loop so that it doesn't get re-evaluated zillions of times.
    Last edited by nonoob; 04-08-2009 at 09:06 PM.

  9. #54
    Banned ಠ_ಠ's Avatar
    Join Date
    Mar 2009
    Posts
    687
    Quote Originally Posted by nonoob View Post
    Excellent!! Especially for a 1st year Software Engineering student. Recognizing it is a base-4 issue... encoding the "DNA sequence" as a multi digit number. 2-loops nested. Done.

    If the inputted strings are to be searched against those arrays it can be accomplished using binary searches in each respective string-length category. A counter of occurrence associated with each entry of course.
    thanks, I had to ask my professor how to get C to use base 4 and he told me about the modulo operator and to divide by the base
    ╔╗╔══╦╗
    ║║║╔╗║║
    ║╚╣╚╝║╚╗
    ╚═╩══╩═╝

  10. #55
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    Since the base is a nice power of 2, you could change

    Code:
    tempPair /= 4
    to
    Code:
    tempPair >>= 2
    Code:
    pow(4, base)
    can be simplified (maybe sped up is more accurate) to be
    Code:
    1 << (base << 1)
    Last edited by nonoob; 04-08-2009 at 09:15 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Generating a sequence of numbers in a random order
    By mirbogat in forum C Programming
    Replies: 15
    Last Post: 08-12-2008, 02:01 PM
  2. Doxygen failing
    By Elysia in forum A Brief History of Cprogramming.com
    Replies: 19
    Last Post: 04-16-2008, 01:24 PM
  3. Immediate programming help! Please!
    By xMEGANx in forum C++ Programming
    Replies: 6
    Last Post: 02-20-2008, 12:52 PM
  4. sequence
    By braddy in forum C Programming
    Replies: 2
    Last Post: 03-30-2006, 02:15 PM
  5. wsprintf and format specifiers
    By incognito in forum Windows Programming
    Replies: 2
    Last Post: 01-03-2004, 10:00 PM