Large String Comparisions...
I'll be reading in a chain of 6 letters in two sequences: seq1 and seq2. The sequences could be like:
Code:
seq1: ABCDEFABCDEFABCDEF
seq2: ABBBCDEFBBCCABCDEFABCDEF
I want to find the exact matches between these sequences. The problem is that these sequences could be as large as 2000 letters in each sequence. Exact matches means that if you take A in seq1 then you have exact matches of A at positions 1, 13 and 19. I will organize this according to (x,y) pairs. So, I get for A:
- (1,1), (1,13) and (1, 19)
- (7,1), (7,13) and (7, 19)
- (13,1), (13,13) and (13, 19)
Could someone tell me what's the best way to do this comparision (it has to be fast and it has to work on really large sequences of letters as well). Also, what would be the best data structure to use in this instance.
Thanks!
Sidenote:
Here's the method I had in mind. If I have m letters in one sequence and n in the other, comparing them one-on-one is carrying out n x m computations. I could reduce that by taking the matrix approach. For the example above, these are the steps I follow:
- Create a class of letter A with some sort of a matrix data structure.
- Read seq1 letter by letter. When you read A, store the position of the letter in the row of the matrix.
- Now start to read seq2 and do the same as before.
- Do this for all the letters
As a result, this is what I will throw into my 'A' matrix in sequence:
Code:
row: 1, 7, 13
column: 1, 13, 19
If you look at the entries for the matrix you will get:
- (1,1), (1,13) and (1, 19)
- (7,1), (7,13) and (7, 19)
- (13,1), (13,13) and (13, 19)
So does anyone know how exactly to encode this information. As in which data structure should I use to populate the data I've shown above. I'd prefer a non-array approach.