# Algorithm needed

Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last
• 08-19-2002
sean
Algorithm needed
I'm making a quick program to help me with another project. I need a program that takes two files, which will both be the DNA sequences of two separate viruses (I'm trying to isolate certain genes in viruses - I got a great idea for my doctoral thesis!), and then it goes through and look for 3 or more consecutive matching codes. For example, if I had:

TGACCGATTATTG

and

GATTCCCATCGCA,

the program would signal the GAT. It would function slightly different from this, but I'd have to explain the specifics of how cells code DNA/RNA into proteins. So could someone point me in the right direction for an algorithm that would do this kind of thing, and I could just do the appropraite changes. I've searched aihorizon.com and none of them seemed to help. I could probably write one, but it owuld be incredibly slow, so some of the ideas like theirs for search algorithms might help. Thanks for any advice you might be able to give, I appreciate it.
• 08-19-2002
quzah
Nested loops.
Code:

```for( x = 0; x < strlen( s1 ) - 3; x++ ) {     for( y = 0; y < strlen( s2 ) - 3; y++         if( s1[x] == s2[y] && s1[x+1] == s2[y+1] && s1[x+2] == s2[y+2] )         {             match = 1;             break;         } }```
Easy stuff.

Quzah.
• 08-19-2002
SilentStrike
Wouldn't you want to take the strlen calls out of the loops?
• 08-20-2002
*ClownPimp*
I hope this is helpful, and that you can understand whats happening.

This is much faster (O(n)) than a nested loop for long sequences (i dont know how long a virus's sequence typically is)

Code:

```#include <iostream.h> // just a fast way to get from 'A' -> 0, 'B' -> 1, etc. // instead of reserving 256 bytes you can reserve 'T'+1 (whatever // value that is) since the highest input would be 'T' char toint_tbl[256] = {0};                    // im assuming there are only four chemicals/whatever in DNA, // i cant remeber if thats true const char toascii[4] = {'A', 'C', 'G', 'T'}; inline int toint(char c) {     return toint_tbl[c]; } void matseq(char *seq1, char *seq2, char matching[][4][4]) {      // a table of all three letter sequences in a particular DNA sequence.     // if you want to add detection for four-letter sequences, just add     // another dimension, and *a4 etc     char seqtable[4][4][4];     register char *a0, *a1, *a2;     for (a0=seq1, a1=seq1+1, a2=seq1+2; *a2; ++a0, ++a1, ++a2)         seqtable[toint(*a0)][toint(*a1)][toint(*a2)] = 0x01;     for (a0=seq2, a1=seq2+1, a2=seq2+2; *a2; ++a0, ++a1, ++a2)         if (seqtable[toint(*a0)][toint(*a1)][toint(*a2)])             matching[toint(*a0)][toint(*a1)][toint(*a2)] = 0x01; } int main() {     char *seq1 = "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTACGTTTTTTTTTTT";     char *seq2 = "TTTTTTTTTTTTTTTTTTTTCGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT";     char matching[4][4][4] = {0x00};     toint_tbl['A'] = 0;  // makes sure an input of 'A' gets a 0, 'C' gets a 1, etc...     toint_tbl['C'] = 1;     toint_tbl['G'] = 2;     toint_tbl['T'] = 3;         matseq(seq1, seq2, matching);     // output the results     int a, b, c;     for (a = 0; a < 4; ++a)       for (b = 0; b < 4; ++b)         for (c = 0; c < 4; ++c)           if (matching[a][b][c])             cout << toascii[a] << toascii[b] << toascii[c] << endl;         return 0; }```
• 08-20-2002
Xterria
*droooools...*
• 08-20-2002
the sting.h or cstring header file (depending on your compiler) will have a strstr() function that returns a pointer to the first occurance of a given substring in a given search string. If the substring isn't found in the search string the function returns NULL. It is standard and portable.

I'm pretty sure the string class from STL has a similar method.
• 08-20-2002
SilentStrike
Nice algorithm, definetely appropriate for this problem, though it doesn't seem to scale well with respect to memory requirements. When the number of different position characters, (base), or the length of the matching substring, (power) are large, it could take an enormous amount of memory, as it takes base^power bytes to store the table. But it definetely is O(n) in running time, and 4^3 is only 64 bytes, so it's really a good choice for this problem.

Here are the slight changes I needed to make to get it to compile warningless on gcc 3.2 with -Wall option.

Code:

```#include <iostream> using namespace std; // just a fast way to get from 'A' -> 0, 'B' -> 1, etc. // instead of reserving 256 bytes you can reserve 'T'+1 (whatever // value that is) since the highest input would be 'T' char toint_tbl[256] = {0}; // im assuming there are only four chemicals/whatever in DNA, // i cant remeber if thats true const char toascii_tbl[4] = {'A', 'C', 'G', 'T'}; inline int toint(char c) {     return toint_tbl[c]; } void matseq(char *seq1, char *seq2, char matching[][4][4]) {     // a table of all three letter sequences in a particular DNA sequence.     // if you want to add detection for four-letter sequences, just add     // another dimension, and *a4 etc     char seqtable[4][4][4];     register char *a0, *a1, *a2;     for (a0=seq1, a1=seq1+1, a2=seq1+2; *a2; ++a0, ++a1, ++a2)         seqtable[toint(*a0)][toint(*a1)][toint(*a2)] = 0x01;     for (a0=seq2, a1=seq2+1, a2=seq2+2; *a2; ++a0, ++a1, ++a2)         if (seqtable[toint(*a0)][toint(*a1)][toint(*a2)])             matching[toint(*a0)][toint(*a1)][toint(*a2)] = 0x01; } int main() {     char *seq1 = " TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTACGTTTTTTTTTTT";     char *seq2 = " TTTTTTTTTTTTTTTTTTTTCGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT";     char matching[4][4][4] = { { {0x00} } };     toint_tbl['A'] = 0;  // makes sure an input of 'A' gets a 0, 'C' gets a 1, etc...     toint_tbl['C'] = 1;     toint_tbl['G'] = 2;     toint_tbl['T'] = 3;     matseq(seq1, seq2, matching);     // output the results     int a, b, c;     for (a = 0; a < 4; ++a)       for (b = 0; b < 4; ++b)         for (c = 0; c < 4; ++c)           if (matching[a][b][c])             cout << toascii_tbl[a] << toascii_tbl[b] << toascii_tbl[c] << endl;     return 0; }```
• 08-20-2002
quzah
There are a few things here:

register char *a0, *a1, *a2;

The register keyword is pointless. There have been huge debates on to its usefullness, but trust me on this, there is no point in adding the keyword 'register'. You cannot, with the register keyword, force your compiler to use registers.

strstr()

There is no point really in using strstr here either. You will not get any speed benifit here, which is why I didn't bother using it.

All strstr does is loop through the string, in the exact same way I did, and compare the substes. If it doesn't use a nested loop, I'd be really surprised.

As such, mine is probably more effective since it likely has less comparisons per loop than strstr.

Examine this: strstr will have to:
1) check the current character in the 'to test' string for null.
2) check the current character in the 'to test' string for a match with the current character in the 'to match' string.

Additionally, you have to test null for each character. My loop avoids this all together.

IIRC, the 'strlen' is only computed once anyway. If you're really worried about it. Make the results of the strlen be stored in a variable and use it instead.

Quzah.
• 08-20-2002
*ClownPimp*
god I hated that -Wall crap in the programming class I took last semester (we had to use it). It always flagged really annoying things that I and probably most people dont care about
• 08-20-2002
quzah
Quote:

Originally posted by *ClownPimp*
god I hated that -Wall crap in the programming class I took last semester (we had to use it). It always flagged really annoying things that I and probably most people dont care about
Actually you should care about them all. They're there for a reason. If you have warnings, there is a reason. You should by habbit try and create warning free code.

Quzah.
• 08-20-2002
*ClownPimp*
>You cannot, with the register keyword, force your compiler to >use registers.

maybe so, but at least this way i have a chance that it will use them :)
• 08-20-2002
MrWizard
Quote:

Originally posted by quzah

Actually you should care about them all. They're there for a reason. If you have warnings, there is a reason. You should by habbit try and create warning free code.

Quzah.

Yeah I always up Visual Studio's warning level to 4 instead of the usual 3. Warning free code is good code.
• 08-20-2002
*ClownPimp*
quzah: good lord, were you just repeatedly pressing refresh until someone responded :)
• 08-20-2002
*ClownPimp*
>If you have warnings, there is a reason
yeah, the reason is anal-retentive compiler designers :) (not to mention instructors)

hrmph... now i have to check the 'level 4 warnings' option
on VC
[/edit]
• 08-20-2002
quzah
Quote:

Originally posted by *ClownPimp*
>If you have warnings, there is a reason
yeah, the reason is anal-retentive compiler designers :) (not to mention instructors)

So I guess you don't want your compiler to warn on: comparisons between 'int' and a pointer...
on: not all paths in a function returning something...
on: ..

Eh, what's the point. If you can't see why warnings are helpful, that's your problem. It's not my job to enlighten you.

Quzah.
Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last