I don't follow how you can encode a base with hundreds or thousands of letters, into just two bits.
KCfromNC stated that each base, not each sequence, can be encoded with two bits, and of course that is true since there are four possible bases.

KC, you seem to have a real insight into what Teiji is trying to do. I hope you'll stick around and join into this thread, frequently.

I don't follow how you can encode a base with hundreds or thousands of letters, into just two bits. Can you post a small example of that with say, 16 letters?
I agree that you can't encode a base with that many letters into 2 bits. I was saying that it would take 2 bits per letter (4 possible options per letter == 2 bits), and it looked like the longest sequence he was looking for was 15 letters = 30 bits max. I might have misunderstood that goal from the original post, but that's the question I was answering so hopefully it at least makes sense in that context.

Edit - should have read the next page before responding. Oh well. I'll atone for that by pointing out that if you use a trie, you're only worrying about the encoding of one letter at any particular time regardless of the total sequence length you're processing. Each level of the trie encodes just one letter so you only need 4 possible children for each node. You can add as much depth as you need to get the required sequence length, limited by memory of course.

3. Ok, I see what you meant. You were referring to base letters, and I was thinking base strings. I thought you had lost your mind, for a second there.

Aw! So encode the letters into bits in 32 bit numbers, then do the work with them, with bit operators and masks.

I'm not sure what the plan is for processing the strings from the input file.

4. I'd like to thank everyone for all of the help. But the length of this assignment is growing and growing each day (and I have several other assignments to do) and the codes get more complex as it progresses, which is beyond my knowledge of barely a semester learning C. Therefore, I'll stop here and polish up what I currently have.

You guys can keep on discussing this "project" if you like.

5. I probably don't understand the problem yet... but from the hints it sounds like the problem is to simply count the number of duplicate strings; strings being anywhere from 1 to 15 in length.

It would be insane to generate all those permutations ahead of time... as if that wasn't bad enough, comparing thousands of input strings to this database would be a nightmare.

"Base 15", as Teiji calls it for strings of 15 character length would be 1,073,741,824 permutations. It would take 16 gigabytes of memory just to store such an array. Now imagine trying to search that thing 1000s of times.

It doesn't matter that they only consist letters among the following: 'A', 'C', 'G', and 'T' although that restriction should be checked for for each input string.

Simply allocate some hash, or sorted list array of encountered strings and count them if there is a clash.

6. ** Welcome to the "I don't really understand this whole thing", club! **

As this assignment moves forward, Teiji may very well discover that she doesn't need these long permutations, after all.

Nonetheless, Teiji asked for help with this, and we helped, and that's all we could do atm.

7. This will print out every sequence for all bases between 1 and 15
I pretty much scraped my earlier code, didn't even use recursion
Code:
```#include <stdio.h>
#include <math.h>

void printDNA(int base);

int main(int argc, char* argv[])
{
int base;
for (base = 1; base < 16; base ++) {
printDNA(base);
}
system("pause");
return 0;

}

void printDNA(int base)
{
int pairs, i, tempPair;
int digits[16];
char DNA[4] = "ACGT";
for (pairs = 0; pairs < pow(4, base); pairs++) {
tempPair=pairs;
for (i = base; i > -1; i--) {
digits[i]=tempPair%4;
tempPair /= 4;
}
for (i = 1; i <= base; i++) {
printf("%c", DNA[digits[i]]);
}
printf("\n");
}
}```

8. Originally Posted by ಠ_ಠ
This will print out every sequence for all bases between 1 and 15
I pretty much scraped my earlier code, didn't even use recursion
Excellent!! Especially for a 1st year Software Engineering student. Recognizing it is a base-4 issue... encoding the "DNA sequence" as a multi digit number. 2-loops nested. Done.

If the inputted strings are to be searched against those arrays it can be accomplished using binary searches in each respective string-length category. A counter of occurrence associated with each entry of course.

One small bug:
line
Code:
`for (i = base; i > -1; i--) {`
should be
Code:
`for (i = base; i > 0; i--) {`
Otherwise it will always assign a zero into [0] of digits - which does no harm but is useless. The [0]th element is not read in the subsequent loop. Also I would haul the pow(4, base) outside of the for loop so that it doesn't get re-evaluated zillions of times.

9. Originally Posted by nonoob
Excellent!! Especially for a 1st year Software Engineering student. Recognizing it is a base-4 issue... encoding the "DNA sequence" as a multi digit number. 2-loops nested. Done.

If the inputted strings are to be searched against those arrays it can be accomplished using binary searches in each respective string-length category. A counter of occurrence associated with each entry of course.
thanks, I had to ask my professor how to get C to use base 4 and he told me about the modulo operator and to divide by the base

10. Since the base is a nice power of 2, you could change

Code:
`tempPair /= 4`
to
Code:
`tempPair >>= 2`
Code:
`pow(4, base)`
can be simplified (maybe sped up is more accurate) to be
Code:
`1 << (base << 1)`