Thread: Large String Comparisions...

  1. #1
    Registered User
    Join Date
    Sep 2004
    Posts
    124

    Exclamation Large String Comparisions...

    I'll be reading in a chain of 6 letters in two sequences: seq1 and seq2. The sequences could be like:

    Code:
    seq1: ABCDEFABCDEFABCDEF
    seq2: ABBBCDEFBBCCABCDEFABCDEF
    I want to find the exact matches between these sequences. The problem is that these sequences could be as large as 2000 letters in each sequence. Exact matches means that if you take A in seq1 then you have exact matches of A at positions 1, 13 and 19. I will organize this according to (x,y) pairs. So, I get for A:

    - (1,1), (1,13) and (1, 19)
    - (7,1), (7,13) and (7, 19)
    - (13,1), (13,13) and (13, 19)

    Could someone tell me what's the best way to do this comparision (it has to be fast and it has to work on really large sequences of letters as well). Also, what would be the best data structure to use in this instance.

    Thanks!

    Sidenote:

    Here's the method I had in mind. If I have m letters in one sequence and n in the other, comparing them one-on-one is carrying out n x m computations. I could reduce that by taking the matrix approach. For the example above, these are the steps I follow:

    - Create a class of letter A with some sort of a matrix data structure.
    - Read seq1 letter by letter. When you read A, store the position of the letter in the row of the matrix.
    - Now start to read seq2 and do the same as before.
    - Do this for all the letters

    As a result, this is what I will throw into my 'A' matrix in sequence:
    Code:
    row: 1, 7, 13
    column: 1, 13, 19
    If you look at the entries for the matrix you will get:
    - (1,1), (1,13) and (1, 19)
    - (7,1), (7,13) and (7, 19)
    - (13,1), (13,13) and (13, 19)

    So does anyone know how exactly to encode this information. As in which data structure should I use to populate the data I've shown above. I'd prefer a non-array approach.

  2. #2
    Registered User
    Join Date
    Sep 2004
    Posts
    124
    If anyone is interested, I'm trying to make a DOTPLOT. The matches I get from the two sequences would be coordinates on a graph and eventually I would want to find the coordinates based on straight lines. Hence, it's important that I store the matched coordinates in some vector or map. Please look at the attached picture.

    The little blocks you see are all matches between the two sequences being compared. I can twist and turn to make my method work, but I thought there could be a way to use "string compare" in C++ to do this and for sequences as large as 2000 letters each.

    Thanks!

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,656
    Maybe this, if you're after fast string matching
    http://www.cs.utexas.edu/users/moore...ing-searching/
    ?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #4
    Registered User
    Join Date
    Sep 2004
    Posts
    124
    thanks...i'm looking at this right now...although i must point out that their algorithm knows the word they're looking for in the string...i'm not searching for words...only letter-letter matches in the fastest way to compute the dotplot...

  5. #5
    Registered User
    Join Date
    Sep 2004
    Posts
    124
    is there a container to hold 3 types of data...i want to store cordinates, a char and a float in one container...something like:

    (2, 5) A 3

    I know I can make objects and store them in a map...but isn't there a way to just store these numbers without going the objects way?

  6. #6
    Wen Resu
    Join Date
    May 2003
    Posts
    219
    This is about the only way i could think of to store values without using a struct or class, its messy imo.

    Code:
    #include <iostream>
    using namespace std;
    
    void getValues(void* arr[4], int &x, int &y, float &q,char &c) 
    {
    	// Cast the  pointers into their coresponding datatypes  and assign
    	 x = *(int*)arr[0];
    	 y = *(int*)arr[1];
    	 q = *(float*)arr[2];
    	 c = *(char*)arr[3];
    
    }
    
    void main()
    {
    	
    	// lets assume we know ahead of time that arr will be * int , * int, * char * float
    	int x = 1 ,y = 2, x1, y1;
    	float q = 2.3, q1;
    	char c = 'A' ,c1;
    
    	
    	
    	void* arr[4]; // will hold 4 pointers to different data types
    
    	arr[0] = &x;
    	arr[1] = &y;
    	arr[2] = &q;
    	arr[3] = &c;
    	getValues(arr,x1, y1, q1, c1);	
    	x = 7;
    
    	cout <<x1 <<endl;
    	cout <<y1 <<endl;
    	cout <<q1 <<endl;
    	cout <<c1 <<endl;
    
    
    		
    
    }
    with this you could do something like, void * arr[ARR_SIZE][4]
    and then store all your values for first piece of data in the first positions 4 pointers, the moves to second postion.
    i dont know if this helps you at all but i hope it does

    personaly, i'd use a class.
    Last edited by Iamien; 05-30-2005 at 03:08 AM.

  7. #7
    Registered User
    Join Date
    Sep 2004
    Posts
    124
    thanks...as in my first post, i wanted to do the computation straight-out because i didn't want a performance hit. This means that let's say I've read 'A' occuring in seq1 at 1, 3, and 4.

    Now, when I read seq2 and I see that it (A) occurs at 4, 5, 6, and 10 then I don't have to make the combinations myself (meaning (1,4), (1,5), (1, 6), (1,10), (3,4)......)These are automatically updated to a matrix or a table or something. Since I can't find a suitable way to do this. I'm going for the combination method.

  8. #8
    Wen Resu
    Join Date
    May 2003
    Posts
    219
    So you are doing a letter by letter comparison, <just want to see if i understand you>

    assuming

    ABAS

    ABABAS

    seq1
    A = 1, 3
    B = 2
    S = 4

    A 1,1 1,3 1,5
    3,1 3,3 3,5

    B 2,2 2,4
    S 4,5

    that would be the needed results?

  9. #9
    Registered User
    Join Date
    Sep 2004
    Posts
    124
    exactly...like you say it.
    So, reading the first sequence is simply a read. Reading the second sequence is where the match occurs. As soon as I read A in the second sequence (instead of me having to do it), the container should update itself accordingly knowing that this value is the y-coordinate and the x-coordinates are: A = 1, 3 from seq1.

    I can do this myself actually. e.g. Store seq1 'A' numbers in a vector and then when I read seq2 and I encounter an A, then for all A's in the seq1 vector, pair it with the seq2 'A' number and put it in some other data structure. I don't find this very appealing!

  10. #10
    Registered User
    Join Date
    Sep 2004
    Posts
    124
    if you notice I'm trying to cut down on the matching costs. In normal iterative procedures if I had n letters in seq1 and m letters in seq2, matching and populating would involve n x m comparisions. Using the methods above I try to cut it down significantly. If I can do the table trick I want solved then I'm maintaining my performance edge. If not, it takes a bit of the thunder away, cause I'll have to iterate through some seq1 letters anyway.

  11. #11
    Registered User
    Join Date
    Sep 2004
    Posts
    124
    another thing Iamien. I don't object to using classes. I just thought representing coordinates as objects of some class and then processing them as objects instead of structs or simple numbers might load the system. I've given you small examples. In truth, the program will compare sequences of letters in the thousands! If you have a class-based implementation in mind which is memory-efficient and quick, then that'll be really good as well.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Problem collecting large string of data
    By gkoenig in forum C Programming
    Replies: 3
    Last Post: 03-01-2008, 05:34 PM
  2. Another overloading "<<" problem
    By alphaoide in forum C++ Programming
    Replies: 18
    Last Post: 09-30-2003, 10:32 AM
  3. Something is wrong with this menu...
    By DarkViper in forum Windows Programming
    Replies: 2
    Last Post: 12-14-2002, 11:06 PM
  4. Classes inheretance problem...
    By NANO in forum C++ Programming
    Replies: 12
    Last Post: 12-09-2002, 03:23 PM
  5. Warnings, warnings, warnings?
    By spentdome in forum C Programming
    Replies: 25
    Last Post: 05-27-2002, 06:49 PM