
finding mismatches
hi gang, having problems with a program i am trying to write..
im going to try my best to explain this. I am terrible at explaining when i am nervous. ops:
In some programs i have written, i have been looking at repeats. Which has been pretty straight forward... But i have decided now to try and looking for mismatches in sequences.
A mismatch for example:
Say you have a sequence:
K=7
S1 = ATGCTATG
S2 = ATGCGAGC
As you can see, from the two sequences there are some characters that match,but some that dont..... ie.. T  G, GC .... this is called a mismatch....
I myself is still trying to get my head around this different way of thinking...
Here is an example of what a program of mismatches might output:
If you had a sequence: "AGTGTAGTAG"
Code:
AG  3 repeats found 1 mismatch //because AG and TG ...G and G in same position
GT  3 repeats found 0 mismatch
AGT  2 repeats found 1 mismatch //coz AGT and TGT GT are similar...
TA  2 repeats found 1 mismatch //coz GA are similar
AGTA  1 repeats found b] 2 mismatch [/b] //AGTG,TGTA , GT's are in position
I hope this makes it a little bit clearer what iam up against...
I have written a program before to find perfect repeats...basically it goes through each sequence and compares each string and outputs the number of times it has been repeated..
i can show my implementation if wanted,but im really stuck and some guidance would be really appreciated
kind regards
JoHnNY

You are looking for a way to calculate the number of repeated characters.. or squence of characters appear in a given string. This is how I would do it:
First.. I would calculate the total number of possiblities.. which is the, "total number of characters.. to the total number of characters power."
Code:
int total_possibilities = pow(strlen(input), strlen(input));
you can use this information to make dynamic arrays and use as a loop counter
The next step, would be to populate a string array with every single possible character combination. This algorithm might seem wicked hard at first.. but if you start out with like a 4 or 5 letter word.. and draw it out on paper.. you can come up with a general algortihm that will work pretty easily.
The next step, would be to create an array of counters the same size as your array of possible character combinations. You will use this array in a direct 1to1 correlation with your array of possible character combinations. (i.e. the 3rd element counter will match up with the 3rd element of your character possibility array.. so you can determine stuff like 'GT' was matched 10 times etc. )
The next step, would be to use nested for loops.. to run through the user input array.. and compare the user input.. against every posibility. If there is a match you can increment the corrosponding counter.
maybe something like this:
Code:
string test = "";
for(int i=0; i<strlen(input); i++)
{
test+=input[i]; //slowly build up a "test" string, one character at a time
for(int j=0; j<total_possibilities; j++)
{
if(test == possibilities[j])
++counters[j]; //notice the 1:1 correlation
}
}
This is probably nothing like your mismatch approach to solving this problem.. but I hope it offers a different perspective.

uh are you talking about edit distance
that means how many insertions or deletions are neccessairy to "convert" string 1 into string 2:
e.g.
ABTATBX
ACBFATB
one possibly "alignment" would be
Code:
ABTATBX
ACBFATB
thus in that case youd have 2 indels.
the goal is to find to minimum number of indels neccessairy
(indel = insertion/deletion(
you might lookup edit distance. of course you can use the brute force approach with exponential complexity (takes forever for long strings)  or use the special algorithms with polynomial complexity (takes long  but not forever)

To me the problem of comparing one string to another and looking for mismatches is a different problem from looking for similarities and calling that a mismatch.
In the former, assign one string to be the "gold standard" and just compare the char in the "pretender" string to the char in the gold standard one by one, counting those that don't match, and, if you want at what location they occur, etc. That type of a protocol should be pretty straightforward.
Looking for similarities seems to be more complicated. First, I think you will need to decide what constitutes a similarity. Once you have that written down in concise terms, then you can try to look at a single string for similarities.