Thread: finding mismatches

  1. #1
    Registered User
    Join Date
    Jan 2005
    Posts
    18

    finding mismatches

    hi gang, having problems with a program i am trying to write..

    im going to try my best to explain this. I am terrible at explaining when i am nervous. ops:

    In some programs i have written, i have been looking at repeats. Which has been pretty straight forward... But i have decided now to try and looking for mismatches in sequences.

    A mismatch for example:

    Say you have a sequence:

    K=7

    S1 = ATGCTATG
    S2 = ATGCGAGC

    As you can see, from the two sequences there are some characters that match,but some that dont..... ie.. T - G, G-C .... this is called a mismatch....

    I myself is still trying to get my head around this different way of thinking...


    Here is an example of what a program of mismatches might output:
    If you had a sequence: "AGTGTAGTAG"


    Code:
    AG - 3 repeats found   1 mismatch    //because AG and TG ...G and G in same position
    
    GT - 3 repeats found     0 mismatch
    
    AGT - 2 repeats found  1 mismatch    //coz AGT and TGT  GT are similar...
    
    TA - 2 repeats found   1 mismatch    //coz GA are similar
    
    AGTA - 1 repeats found b] 2 mismatch   [/b] //AGTG,TGTA , GT's are in position
    I hope this makes it a little bit clearer what iam up against...

    I have written a program before to find perfect repeats...basically it goes through each sequence and compares each string and outputs the number of times it has been repeated..

    i can show my implementation if wanted,but im really stuck and some guidance would be really appreciated

    kind regards
    JoHnNY

  2. #2
    VA National Guard The Brain's Avatar
    Join Date
    May 2004
    Location
    Manassas, VA USA
    Posts
    903
    You are looking for a way to calculate the number of repeated characters.. or squence of characters appear in a given string. This is how I would do it:

    First.. I would calculate the total number of possiblities.. which is the, "total number of characters.. to the total number of characters power."

    Code:
    int total_possibilities = pow(strlen(input), strlen(input));
    you can use this information to make dynamic arrays and use as a loop counter


    The next step, would be to populate a string array with every single possible character combination. This algorithm might seem wicked hard at first.. but if you start out with like a 4 or 5 letter word.. and draw it out on paper.. you can come up with a general algortihm that will work pretty easily.

    The next step, would be to create an array of counters the same size as your array of possible character combinations. You will use this array in a direct 1-to-1 correlation with your array of possible character combinations. (i.e. the 3rd element counter will match up with the 3rd element of your character possibility array.. so you can determine stuff like 'GT' was matched 10 times etc. )

    The next step, would be to use nested for loops.. to run through the user input array.. and compare the user input.. against every posibility. If there is a match you can increment the corrosponding counter.

    maybe something like this:
    Code:
    string test = "";
    for(int i=0; i<strlen(input); i++)
    { 
         test+=input[i];  //slowly build up a "test" string, one character at a time
    
              for(int j=0; j<total_possibilities; j++)
              {
                   if(test == possibilities[j])
                        ++counters[j];  //notice the 1:1 correlation  
              } 
    }
    This is probably nothing like your mismatch approach to solving this problem.. but I hope it offers a different perspective.
    Last edited by The Brain; 02-28-2005 at 09:28 AM.
    • "Problem Solving C++, The Object of Programming" -Walter Savitch
    • "Data Structures and Other Objects using C++" -Walter Savitch
    • "Assembly Language for Intel-Based Computers" -Kip Irvine
    • "Programming Windows, 5th edition" -Charles Petzold
    • "Visual C++ MFC Programming by Example" -John E. Swanke
    • "Network Programming Windows" -Jones/Ohlund
    • "Sams Teach Yourself Game Programming in 24 Hours" -Michael Morrison
    • "Mathmatics for 3D Game Programming & Computer Graphics" -Eric Lengyel

  3. #3
    Registered User
    Join Date
    Aug 2001
    Posts
    244
    uh are you talking about edit distance
    that means how many insertions or deletions are neccessairy to "convert" string 1 into string 2:

    e.g.
    ABTATBX
    ACBFATB

    one possibly "alignment" would be
    Code:
    A-BT-ATBX
    ACB-FATB-
    thus in that case youd have 2 indels.

    the goal is to find to minimum number of indels neccessairy
    (indel = insertion/deletion(

    you might lookup edit distance. of course you can use the brute force approach with exponential complexity (takes forever for long strings) - or use the special algorithms with polynomial complexity (takes long - but not forever)
    Last edited by Raven Arkadon; 02-28-2005 at 08:38 AM.
    signature under construction

  4. #4
    Registered User
    Join Date
    Mar 2002
    Posts
    1,595
    To me the problem of comparing one string to another and looking for mismatches is a different problem from looking for similarities and calling that a mismatch.

    In the former, assign one string to be the "gold standard" and just compare the char in the "pretender" string to the char in the gold standard one by one, counting those that don't match, and, if you want at what location they occur, etc. That type of a protocol should be pretty straightforward.

    Looking for similarities seems to be more complicated. First, I think you will need to decide what constitutes a similarity. Once you have that written down in concise terms, then you can try to look at a single string for similarities.
    You're only born perfect.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. tools for finding memory leaks
    By stanlvw in forum C++ Programming
    Replies: 4
    Last Post: 04-03-2009, 11:41 AM
  2. Finding primes
    By starripper in forum C++ Programming
    Replies: 19
    Last Post: 01-14-2006, 04:17 PM
  3. Path Finding Using Adjacency List.
    By Geolingo in forum C++ Programming
    Replies: 7
    Last Post: 05-16-2005, 02:34 PM
  4. Finding the smallest value......HARD!!!
    By AssistMe in forum C Programming
    Replies: 21
    Last Post: 03-10-2005, 06:23 AM
  5. MFC :: Finding Child Window of a CWnd* Object?
    By SyntaxBubble in forum Windows Programming
    Replies: 2
    Last Post: 09-06-2003, 09:06 AM