Thread: Algorithm needed

  1. #1
    Registered User
    Join Date
    Sep 2001
    Posts
    4,912

    Algorithm needed

    I'm making a quick program to help me with another project. I need a program that takes two files, which will both be the DNA sequences of two separate viruses (I'm trying to isolate certain genes in viruses - I got a great idea for my doctoral thesis!), and then it goes through and look for 3 or more consecutive matching codes. For example, if I had:

    TGACCGATTATTG

    and

    GATTCCCATCGCA,

    the program would signal the GAT. It would function slightly different from this, but I'd have to explain the specifics of how cells code DNA/RNA into proteins. So could someone point me in the right direction for an algorithm that would do this kind of thing, and I could just do the appropraite changes. I've searched aihorizon.com and none of them seemed to help. I could probably write one, but it owuld be incredibly slow, so some of the ideas like theirs for search algorithms might help. Thanks for any advice you might be able to give, I appreciate it.

  2. #2
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Nested loops.
    Code:
    for( x = 0; x < strlen( s1 ) - 3; x++ )
    {
        for( y = 0; y < strlen( s2 ) - 3; y++
            if( s1[x] == s2[y] && s1[x+1] == s2[y+1] && s1[x+2] == s2[y+2] )
            {
                match = 1;
                break;
            } 
    }
    Easy stuff.

    Quzah.
    Hope is the first step on the road to disappointment.

  3. #3
    geek SilentStrike's Avatar
    Join Date
    Aug 2001
    Location
    NJ
    Posts
    1,141
    Wouldn't you want to take the strlen calls out of the loops?
    Prove you can code in C++ or C# at TopCoder, referrer rrenaud
    Read my livejournal

  4. #4
    *ClownPimp*
    Guest
    I hope this is helpful, and that you can understand whats happening.

    This is much faster (O(n)) than a nested loop for long sequences (i dont know how long a virus's sequence typically is)

    Code:
    #include <iostream.h>
    
    // just a fast way to get from 'A' -> 0, 'B' -> 1, etc.
    // instead of reserving 256 bytes you can reserve 'T'+1 (whatever
    // value that is) since the highest input would be 'T'
    char toint_tbl[256] = {0};                     
    
    // im assuming there are only four chemicals/whatever in DNA, 
    // i cant remeber if thats true
    const char toascii[4] = {'A', 'C', 'G', 'T'};
    
    inline int toint(char c)
    {
        return toint_tbl[c];
    }
    
    void matseq(char *seq1, char *seq2, char matching[][4][4])
    {   
        // a table of all three letter sequences in a particular DNA sequence.
        // if you want to add detection for four-letter sequences, just add
        // another dimension, and *a4 etc
        char seqtable[4][4][4]; 
        register char *a0, *a1, *a2;
    
        for (a0=seq1, a1=seq1+1, a2=seq1+2; *a2; ++a0, ++a1, ++a2)
            seqtable[toint(*a0)][toint(*a1)][toint(*a2)] = 0x01;
    
        for (a0=seq2, a1=seq2+1, a2=seq2+2; *a2; ++a0, ++a1, ++a2)
            if (seqtable[toint(*a0)][toint(*a1)][toint(*a2)])
                matching[toint(*a0)][toint(*a1)][toint(*a2)] = 0x01;
    }
    
    int main()
    {
        char *seq1 = "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTACGTTTTTTTTTTT";
        char *seq2 = "TTTTTTTTTTTTTTTTTTTTCGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT";
        char matching[4][4][4] = {0x00};
    
        toint_tbl['A'] = 0;  // makes sure an input of 'A' gets a 0, 'C' gets a 1, etc...
        toint_tbl['C'] = 1;
        toint_tbl['G'] = 2;
        toint_tbl['T'] = 3;
        
        matseq(seq1, seq2, matching);
    
        // output the results
        int a, b, c;
        for (a = 0; a < 4; ++a)
          for (b = 0; b < 4; ++b)
            for (c = 0; c < 4; ++c)
              if (matching[a][b][c])
                cout << toascii[a] << toascii[b] << toascii[c] << endl;
        
        return 0;
    }

  5. #5
    "The Oldest Member Here" Xterria's Avatar
    Join Date
    Sep 2001
    Location
    Buffalo, NY
    Posts
    1,039
    *droooools...*

  6. #6
    elad
    Guest
    the sting.h or cstring header file (depending on your compiler) will have a strstr() function that returns a pointer to the first occurance of a given substring in a given search string. If the substring isn't found in the search string the function returns NULL. It is standard and portable.

    I'm pretty sure the string class from STL has a similar method.

  7. #7
    geek SilentStrike's Avatar
    Join Date
    Aug 2001
    Location
    NJ
    Posts
    1,141
    Nice algorithm, definetely appropriate for this problem, though it doesn't seem to scale well with respect to memory requirements. When the number of different position characters, (base), or the length of the matching substring, (power) are large, it could take an enormous amount of memory, as it takes base^power bytes to store the table. But it definetely is O(n) in running time, and 4^3 is only 64 bytes, so it's really a good choice for this problem.

    Here are the slight changes I needed to make to get it to compile warningless on gcc 3.2 with -Wall option.

    Code:
    #include <iostream>
    
    using namespace std;
    
    // just a fast way to get from 'A' -> 0, 'B' -> 1, etc.
    // instead of reserving 256 bytes you can reserve 'T'+1 (whatever
    // value that is) since the highest input would be 'T'
    char toint_tbl[256] = {0};
    
    // im assuming there are only four chemicals/whatever in DNA,
    // i cant remeber if thats true
    const char toascii_tbl[4] = {'A', 'C', 'G', 'T'};
    
    inline int toint(char c)
    {
        return toint_tbl[c];
    }
    
    void matseq(char *seq1, char *seq2, char matching[][4][4])
    {
        // a table of all three letter sequences in a particular DNA sequence.
        // if you want to add detection for four-letter sequences, just add
        // another dimension, and *a4 etc
        char seqtable[4][4][4];
        register char *a0, *a1, *a2;
    
        for (a0=seq1, a1=seq1+1, a2=seq1+2; *a2; ++a0, ++a1, ++a2)
            seqtable[toint(*a0)][toint(*a1)][toint(*a2)] = 0x01;
    
        for (a0=seq2, a1=seq2+1, a2=seq2+2; *a2; ++a0, ++a1, ++a2)
            if (seqtable[toint(*a0)][toint(*a1)][toint(*a2)])
                matching[toint(*a0)][toint(*a1)][toint(*a2)] = 0x01;
    }
    
    int main()
    {
        char *seq1 = " TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTACGTTTTTTTTTTT";
        char *seq2 = " TTTTTTTTTTTTTTTTTTTTCGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT";
        char matching[4][4][4] = { { {0x00} } };
    
        toint_tbl['A'] = 0;  // makes sure an input of 'A' gets a 0, 'C' gets a 1, etc...
        toint_tbl['C'] = 1;
        toint_tbl['G'] = 2;
        toint_tbl['T'] = 3;
    
        matseq(seq1, seq2, matching);
    
        // output the results
        int a, b, c;
        for (a = 0; a < 4; ++a)
          for (b = 0; b < 4; ++b)
            for (c = 0; c < 4; ++c)
              if (matching[a][b][c])
                cout << toascii_tbl[a] << toascii_tbl[b] << toascii_tbl[c] << endl;
    
        return 0;
    }
    Prove you can code in C++ or C# at TopCoder, referrer rrenaud
    Read my livejournal

  8. #8
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    There are a few things here:

    register char *a0, *a1, *a2;

    The register keyword is pointless. There have been huge debates on to its usefullness, but trust me on this, there is no point in adding the keyword 'register'. You cannot, with the register keyword, force your compiler to use registers.

    strstr()

    There is no point really in using strstr here either. You will not get any speed benifit here, which is why I didn't bother using it.

    All strstr does is loop through the string, in the exact same way I did, and compare the substes. If it doesn't use a nested loop, I'd be really surprised.

    As such, mine is probably more effective since it likely has less comparisons per loop than strstr.

    Examine this: strstr will have to:
    1) check the current character in the 'to test' string for null.
    2) check the current character in the 'to test' string for a match with the current character in the 'to match' string.

    Additionally, you have to test null for each character. My loop avoids this all together.

    IIRC, the 'strlen' is only computed once anyway. If you're really worried about it. Make the results of the strlen be stored in a variable and use it instead.

    Quzah.
    Last edited by quzah; 08-20-2002 at 05:12 PM.
    Hope is the first step on the road to disappointment.

  9. #9
    Registered User
    Join Date
    Jan 2002
    Posts
    552
    god I hated that -Wall crap in the programming class I took last semester (we had to use it). It always flagged really annoying things that I and probably most people dont care about

  10. #10
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Originally posted by *ClownPimp*
    god I hated that -Wall crap in the programming class I took last semester (we had to use it). It always flagged really annoying things that I and probably most people dont care about
    Actually you should care about them all. They're there for a reason. If you have warnings, there is a reason. You should by habbit try and create warning free code.

    Quzah.
    Hope is the first step on the road to disappointment.

  11. #11
    Registered User
    Join Date
    Jan 2002
    Posts
    552
    >You cannot, with the register keyword, force your compiler to >use registers.

    maybe so, but at least this way i have a chance that it will use them

  12. #12
    Registered User
    Join Date
    Apr 2002
    Posts
    1,571
    Originally posted by quzah


    Actually you should care about them all. They're there for a reason. If you have warnings, there is a reason. You should by habbit try and create warning free code.

    Quzah.
    Yeah I always up Visual Studio's warning level to 4 instead of the usual 3. Warning free code is good code.
    "...the results are undefined, and we all know what "undefined" means: it means it works during development, it works during testing, and it blows up in your most important customers' faces." --Scott Meyers

  13. #13
    Registered User
    Join Date
    Jan 2002
    Posts
    552
    quzah: good lord, were you just repeatedly pressing refresh until someone responded

  14. #14
    Registered User
    Join Date
    Jan 2002
    Posts
    552
    >If you have warnings, there is a reason
    yeah, the reason is anal-retentive compiler designers (not to mention instructors)

    [edit]
    hrmph... now i have to check the 'level 4 warnings' option
    on VC
    [/edit]
    Last edited by *ClownPimp*; 08-20-2002 at 05:20 PM.

  15. #15
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Originally posted by *ClownPimp*
    >If you have warnings, there is a reason
    yeah, the reason is anal-retentive compiler designers (not to mention instructors)
    So I guess you don't want your compiler to warn on: comparisons between 'int' and a pointer...
    on: not all paths in a function returning something...
    on: ..

    Eh, what's the point. If you can't see why warnings are helpful, that's your problem. It's not my job to enlighten you.

    Quzah.
    Hope is the first step on the road to disappointment.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. algorithm hint needed. please.
    By manav in forum C Programming
    Replies: 27
    Last Post: 02-02-2008, 05:31 AM
  2. Algorithm needed!
    By Warlax in forum C++ Programming
    Replies: 6
    Last Post: 05-11-2006, 12:42 AM
  3. Binary Search Trees Part III
    By Prelude in forum A Brief History of Cprogramming.com
    Replies: 16
    Last Post: 10-02-2004, 03:00 PM
  4. Request for comments
    By Prelude in forum A Brief History of Cprogramming.com
    Replies: 15
    Last Post: 01-02-2004, 10:33 AM