String Construction Algorithm Wanted...

**anirban** · 08-06-2007

Suppose I have a string like the following....

mynameisninjailiveinaustraliailikecricket

Now I break the string into overlapping substrings such as

Code:

mynameis      livein 
  nameisninjailive
              liveinaustralia
                    australialike
        ninjaili             likecrick
                       tralialike   cket
                  inaustra

Now my aim is to reconstruct the original string from the overlapping substrings. If not possible to reconstruct the original when multiple solutions exist, then the resultant string must be the shortest. May I get any algorithm to attack this problem please....

**anon** · 08-06-2007

What have you tried?

I think you'll need to match substrings of the fragment with the full string (you got so far). Go from longer to shorter (probably) to pick a larger overlap in case of multiple solutions.

Edit: This is not a programming challenge again? In this case the solution probably isn't as trivial as that...

But doesn't your graph give you some ideas? Finding a way how to align strings and to recurse until you've put together all the pieces...

**anirban** · 08-06-2007

No no why should I post programming challenges just without reason! The assignments which my college Sir has given me are too intricate for me to solve! That is why I am asking for your help! Moreover fortunately when I am searching online for algorithm luckily sometimes I get that these are international problems - now what should I do! Hide from you all the truth? I hope not!

Sorry can you please elaborate it please!!!!!!!

**Happy_Reaper** · 08-06-2007

1) Start with initial string.
2) Take next string, and diff with string, giving you the letters which are the same in both strings.
3) Fill in spaces of the resulting string appropriately.
4) Make resulting string intial string.
5) Repeat until no strings are left.

**anon** · 08-06-2007

I'm afraid the assignment is not as simple as that.

What if the two words don't have any parts in common? (You can't combine them because you'll have no idea what's the best place for the second string. You'd probably need to leave the second string aside for the time being.)

It might also be possible that this approach won't give you the best result. Suppose the first letter of a string == the last letter of another string. If you happily combined them, you might already miss a better solution where these two strings don't share any common characters.

I suppose a brute-force backtracking method might overcome the second problem. You might probably start with the longer strings first (these might have longer overlapping segments?) and leave a search branch early if the resulting string is already longer than the solution found so far. (Some variant of A*?) At least, this way you should be able to crack the problem, if not in an suboptimal way.

This assignment seems to belong to an advanced course on algorithms, so the OP should have better knowledge of available strategies than I do. Therefore ...

**manofsteel972** · 08-06-2007

If you are given the starting string, I think you could just create a result buffer of the same length, go through each of the broken up strings and find if it is a substring of the original. Keep track of which is the shortest. If the string is found then using the position where it was found copy it to the result buffer. Then once you are done going through all the strings, compare the original string to the one you created.

[Edit]

If there are multiple occurances of the substring, you would need to find them all? so you will need to search to the end of the string and find each occurance.

**Rashakil Fol** · 08-06-2007

[Edit: Ok people, he's not given the starting string -- if he merely had to reconstruct that, the problem would be trivial, and he isn't doing that, because he has to find the shortest string that could possibly be the starting string, which makes the starting string useless as it might as well be the concatenation of the substrings he has been given.]

[Edit 2: Consider this post to be a dumb solution.]

First, you need to identify the problem that you really want to solve.

Essentially, you have a finite set, S, of k elements, with a function, f : S \times S -> N (where N is the set of natural numbers >= 0). Of course f (x,y) is the overlap you can get from overlapping x and y with x on the left and y on the right -- with the stipulation that if y is a substring of x, f (x,y) = length(y). For example, f("heyoyo", "yoyofolomado") = 4, and f("abcde","bc") = 2. Then you want to find the sequence [s_1, ..., s_k], where s_1 .. s_k all distinct strings that comprise the set S, that maximizes the sum, f(s_1,s_2) + f(s_2,s_3) + ... + f(s_{k-1}, s_k).

That's your real problem, and it has nothing in particular to do with strings; you're finding the longest path that visits all the nodes of a graph.

**anirban** · 08-06-2007

Well ,great thanks to you!!! But if could please slightly elaborate the procedure a bit more elaborate I shall be greatly benefited! Also if I map the problem to a graph theory problem of finding longest path, what would be the nodes of the graph! Please do elaborate!

**Rashakil Fol** · 08-06-2007

Um, it's all there in the post above.

Edit:

Also, what I've described is not necessarily the best solution.

Edit2:

Definitely not. There's just so much structure to this problem.

Edit 3: Am I just being an idiot?

**Happy_Reaper** · 08-06-2007

I'm actually pretty sure, now that Rashakil Fol has outlined it that way, that the problem seems to be a TSP variant, which makes it NP-Complete, which basically says that there won't be an efficient, all-pupose solution (although you may approximate it rather well).

**anirban** · 08-06-2007

Then what should I do now.....

**Happy_Reaper** · 08-07-2007

I just said there was no efficient all-purpose solution. If you keep the string set small, you're likely never to see much of a problem from an efficiency perspective.

**Rashakil Fol** · 08-07-2007

I don't know. I get the feeling there might be an efficient solution, or at least a very slow growing exponential solution, and it's probably just a backtracking one that doesn't bother going through all the permutations of non-overlapping strings. The graph I described would be incredibly sparse, and it contains structure. For example, if f(x,y) > f(x,z) > 0, then if f(w,y) < f(x,y), then f(w,z) = max {0, f(w,y) - [f(x,y) - f(x,z)]}.

**anon** · 08-07-2007

To begin with you might throw out strings that are entirely contained in some other string. Show some initiative...

**Salem** · 08-07-2007

> Then what should I do now.....
Try it and find out - you're not going to really learn anything by just yapping about it or trying to manoeuvre someone else into doing it by continual begging.

Whether you succeed or fail doesn't matter, you'll still learn a lot from simply trying.

Q: How do I avoid making mistakes?
A: Mistakes are avoided through past experience.
Q: How do I gain experience?
A: Through lessons learnt from your past mistakes.

Thread: String Construction Algorithm Wanted...

Thread Tools

Search Thread

Display

String Construction Algorithm Wanted...

Similar Threads

Please check my C++

Binary Search Trees Part III

Linked List Help

creating class, and linking files

Warnings, warnings, warnings?