Large String Comparisions...

Printable View

05-27-2005
alvifarooq

Large String Comparisions...

I'll be reading in a chain of 6 letters in two sequences: seq1 and seq2. The sequences could be like:

Code:

seq1: ABCDEFABCDEFABCDEF seq2: ABBBCDEFBBCCABCDEFABCDEF

I want to find the exact matches between these sequences. The problem is that these sequences could be as large as 2000 letters in each sequence. Exact matches means that if you take A in seq1 then you have exact matches of A at positions 1, 13 and 19. I will organize this according to (x,y) pairs. So, I get for A:

- (1,1), (1,13) and (1, 19)
- (7,1), (7,13) and (7, 19)
- (13,1), (13,13) and (13, 19)

Could someone tell me what's the best way to do this comparision (it has to be fast and it has to work on really large sequences of letters as well). Also, what would be the best data structure to use in this instance.

Thanks!

Sidenote:

Here's the method I had in mind. If I have m letters in one sequence and n in the other, comparing them one-on-one is carrying out n x m computations. I could reduce that by taking the matrix approach. For the example above, these are the steps I follow:

- Create a class of letter A with some sort of a matrix data structure.
- Read seq1 letter by letter. When you read A, store the position of the letter in the row of the matrix.
- Now start to read seq2 and do the same as before.
- Do this for all the letters

As a result, this is what I will throw into my 'A' matrix in sequence:

Code:

row: 1, 7, 13 column: 1, 13, 19

If you look at the entries for the matrix you will get:
- (1,1), (1,13) and (1, 19)
- (7,1), (7,13) and (7, 19)
- (13,1), (13,13) and (13, 19)

So does anyone know how exactly to encode this information. As in which data structure should I use to populate the data I've shown above. I'd prefer a non-array approach.
05-27-2005
alvifarooq

If anyone is interested, I'm trying to make a DOTPLOT. The matches I get from the two sequences would be coordinates on a graph and eventually I would want to find the coordinates based on straight lines. Hence, it's important that I store the matched coordinates in some vector or map. Please look at the attached picture.

The little blocks you see are all matches between the two sequences being compared. I can twist and turn to make my method work, but I thought there could be a way to use "string compare" in C++ to do this and for sequences as large as 2000 letters each.

Thanks!
05-27-2005
Salem

Maybe this, if you're after fast string matching
http://www.cs.utexas.edu/users/moore...ing-searching/
?
05-28-2005
alvifarooq

thanks...i'm looking at this right now...although i must point out that their algorithm knows the word they're looking for in the string...i'm not searching for words...only letter-letter matches in the fastest way to compute the dotplot...
05-29-2005
alvifarooq

is there a container to hold 3 types of data...i want to store cordinates, a char and a float in one container...something like:

(2, 5) A 3

I know I can make objects and store them in a map...but isn't there a way to just store these numbers without going the objects way?
05-30-2005
Iamien

This is about the only way i could think of to store values without using a struct or class, its messy imo.

Code:

#include <iostream> using namespace std; void getValues(void* arr[4], int &x, int &y, float &q,char &c) { // Cast the pointers into their coresponding datatypes and assign x = *(int*)arr[0]; y = *(int*)arr[1]; q = *(float*)arr[2]; c = *(char*)arr[3]; } void main() { // lets assume we know ahead of time that arr will be * int , * int, * char * float int x = 1 ,y = 2, x1, y1; float q = 2.3, q1; char c = 'A' ,c1; void* arr[4]; // will hold 4 pointers to different data types arr[0] = &x; arr[1] = &y; arr[2] = &q; arr[3] = &c; getValues(arr,x1, y1, q1, c1); x = 7; cout <<x1 <<endl; cout <<y1 <<endl; cout <<q1 <<endl; cout <<c1 <<endl; }

with this you could do something like, void * arr[ARR_SIZE][4]
and then store all your values for first piece of data in the first positions 4 pointers, the moves to second postion.
i dont know if this helps you at all but i hope it does

personaly, i'd use a class.
05-30-2005
alvifarooq

thanks...as in my first post, i wanted to do the computation straight-out because i didn't want a performance hit. This means that let's say I've read 'A' occuring in seq1 at 1, 3, and 4.

Now, when I read seq2 and I see that it (A) occurs at 4, 5, 6, and 10 then I don't have to make the combinations myself (meaning (1,4), (1,5), (1, 6), (1,10), (3,4)......)These are automatically updated to a matrix or a table or something. Since I can't find a suitable way to do this. I'm going for the combination method.
05-30-2005
Iamien

So you are doing a letter by letter comparison, <just want to see if i understand you>

assuming

ABAS

ABABAS

seq1
A = 1, 3
B = 2
S = 4

A 1,1 1,3 1,5
3,1 3,3 3,5

B 2,2 2,4
S 4,5

that would be the needed results?
05-30-2005
alvifarooq

exactly...like you say it.
So, reading the first sequence is simply a read. Reading the second sequence is where the match occurs. As soon as I read A in the second sequence (instead of me having to do it), the container should update itself accordingly knowing that this value is the y-coordinate and the x-coordinates are: A = 1, 3 from seq1.

I can do this myself actually. e.g. Store seq1 'A' numbers in a vector and then when I read seq2 and I encounter an A, then for all A's in the seq1 vector, pair it with the seq2 'A' number and put it in some other data structure. I don't find this very appealing!
05-30-2005
alvifarooq

if you notice I'm trying to cut down on the matching costs. In normal iterative procedures if I had n letters in seq1 and m letters in seq2, matching and populating would involve n x m comparisions. Using the methods above I try to cut it down significantly. If I can do the table trick I want solved then I'm maintaining my performance edge. If not, it takes a bit of the thunder away, cause I'll have to iterate through some seq1 letters anyway.
05-30-2005
alvifarooq

another thing Iamien. I don't object to using classes. I just thought representing coordinates as objects of some class and then processing them as objects instead of structs or simple numbers might load the system. I've given you small examples. In truth, the program will compare sequences of letters in the thousands! If you have a class-based implementation in mind which is memory-efficient and quick, then that'll be really good as well.