# Thread: repeats in a string

1. ## repeats in a string

I need help on how i am going to implement this.

i have a sequence of letters such as ATGCATA

and i want to find the biggest number of repeats which in this case woud be :AT

help would be really appreciative

thanks

2. Could you define the problem a little more? For instance, is a letter allowed to repeat itself? Are you looking for groups of 2, or groups of n? Are there only certain letters in the sequence?

3. thank you for the quick reply,

sorry for the lack of information, a letter is allowed to repeat itself. Im looking for the maximal repeat in a sequence.There are only 4 letters in the sequence.

ATGCATA : so the most repeated would be AT

ATGCGTGCG: the most repeated would be TGCG

I hope this makes sense

4. Would you allow substrings of 1? For instance, in ACGTCATGAGCTA, would A be the most repeated?

5. i have already done a program which does substrings of 1,but this program would be "n" substring.

In this case though ACGTCATGAGCTA , the maximal repeat is A. As there are no other repeats.

6. Well, break the program down into pieces. It seems the first step might be to break down the input string into a bunch of string groupings from 1 to 4 characters in length. For that you could use a set:

Code:
```set<string> strSet;
string data("ATGCATA");

// Find all unique/different 1-4 character groupings from input string.
for( string::iterator it = data.begin(); it != data.end(); ++it )
for( int i = 1; i <= 4 && i <= distance(it,data.end()); ++i )
strSet.insert(string(it,i));

// Output all the groupings to the console.
copy(strSet.begin(),strSet.end(),ostream_iterator<string>(cout,"\n"));```
Should output:

Code:
```A
AT
ATA
ATG
ATGC
C
CA
CAT
CATA
G
GC
GCA
GCAT
T
TA
TG
TGC
TGCA```
There, easy isn't it?

There is more to do of course. Now you would need to find and count how many of each of those above groupings there are back in the original input string and go from there but of course you can do that with your wonderfull coding skills correct? I leave that up to you!

7. hi,thanks for the help and time to write that sample code.but i have already done a program similar to that, basically calculating the number of occurences from the string..

how i would find a repeat of n length?

if the samplestring was big like:

ATGTACAATGTACA - and the biggest repeat was :ATGTACA

8. Wow hk_mp5kpdw, that was pretty awesome. I wish I knew STL a bit better, because that just cut the code I was using by 75%.

jodders, you can search for substrings of n length by changing the second for-loop: instead of looping for up to 4 characters, you want to loop to the length of the string. By using a map<string,int> instead of a set<string>, you can keep track of how many times you find each substring. Then it's just a matter of searching back through your map for the largest number of hits.

9. so how would you loop to the end of the string, instead of looping 4 characters.

everytime i try and compile, i get set undeclared, and strSet undeclared

10. Originally Posted by jodders
so how would you loop to the end of the string, instead of looping 4 characters.
Change "i <= 4" to "i <= data.length()".

Originally Posted by jodders
everytime i try and compile, i get set undeclared, and strSet undeclared
Have you included the appropriate headers <string>, and <set> (or <map> if using pianorain's suggestion) [also <algorithm> and possibly <iterator> if using the copy function]? How are you handling the namespace issue? All of these things: strings, sets, etc... are all in the std namespace.

11. Originally Posted by jodders
I need help on how i am going to implement this.

i have a sequence of letters such as ATGCATA

and i want to find the biggest number of repeats which in this case woud be :AT

help would be really appreciative

thanks
trying to identify genes or something?

12. What exactly are you trying to do?

Here's a algorithm that determines the size of the largest repeating substring in a given string. It doesn't identify what or where the duplicate substrings are, though I believe the techniques it uses could be used to do so.

For a string of size s, there will be z = y + 1 substrings of length n = s - y where y < s. That is; z = s - n + 1.

Code:
```string input = "ATGTACAATGTACA";
int s = input.length();
string substring;
int n; // length of substring
int max = 1; //length of largest duplicate substring in input.
int z; //number of substrings of length n expected
int begin; //where to start substring
int i, index; //
set<string> substrings;//container to hold all unique substrings

for(n = 2; n < s; ++n) //length of substring
{
for(begin = 0; begin <= s - n; ++begin) //where to start each substring
{
index = begin;

//generate substring of length n starting at index = begin
for(i = 0; i < n; ++i)
substring += input[index++];

//insert substring into substrings
substrings.insert(substring);
substring = "";
}
z = s - n + 1;
//if there are no duplicates, then size of substrings will be equal to z.
if(substrings.size() == z)
{
//maximum length of duplicated sequence in input string is n - 1;
max = n - 1;
break;
}
else
{
//duplicate found, so look as next larger n
substrings.clear(); //clear set of substrings
max = n;
}
}
cout << max << endl;```

13. hi guys,i am very very appreciative with the help you have given me.Iam still learning c++ by myself and have found your tips useful. I have found "hk_mp5kpdw " tips outstanding and also pianorain.

I think i will use pianorain's map<string,int> instead of a set<string> to keep track of the substring.

I have got hk_mp5kpdw snippet of code to work,and been messing about trying to use implement the map function inside it, but i get another compilation error. Sorry for my ignorance beforehand! my compiler just goes mad when i change the map<string,int> instead of a set<string>,. I have added the <map> header as told. Am i missing something else?

Once again,thanks for this help.It has been really useful

14. All other things being the same, the insert for the map<string,int> object should probably look like:

Code:
`variablename[string(it,i)]++;`
If using the map, you would probably find it easier to change the copy to a more basic for loop.

Post your code you are using (as much as you possibly can) and any specific errors you get.

Code:
```
int main()
{
int variable;

map<string,int> strSet;
string data("ATGCATAAAAAAAACATGATGACTAGATGACATACGATCGACTGACTGACTGACTGACTGACGAC");

for( string::iterator it = data.begin(); it != data.end(); ++it )
for( int i = 1; i <= data.length() && i <= distance(it,data.end()); ++i )
strSet.insert(variable[string(it,i)]++);

for(strSet.begin(),strSet.end(),ostream_iterator<string>(cout,"\n"));

}```
error:
my own problem is declaring the datatype for my variable. (could you explain wat im doing wrong so i wont make the same mistake in future )

parse error before ')' for loop line