# Generating DNA sequence

This is a discussion on Generating DNA sequence within the C Programming forums, part of the General Programming Boards category; Originally Posted by Maz I would take a different approach. I would first calculate the amount of space the sequence ...

1. Originally Posted by Maz
I would take a different approach.

I would first calculate the amount of space the sequence requires - with base 1, its 4, otherwise 4^(base-1), right?

Then I would calculate the width of "one subsecuence". Eg with base 1, it is 1 (a or c or g or t). with base 2, it is 2 (aa or ac or ...)

And finally amount of the "subsecuences". For example with allocated space/width of sequence.

Then I would start filling up the space.

first I'd write a<empty for rest of subsec>c<empty for rest of subsec>g<empty for rest of subsec>t<empty for rest of subsec>a<empty for rest of subsec>c<empty for rest of subsec>....
then a<empty for rest of subsec>a<empty for rest of subsec>a<empty for rest of subsec>a<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec>c<empty for rest of subsec> ...

just increasing the amount of repeated writes of same letter for each time - untill the whole area was covered. All of changing variables can be calculated from base. You can build all the loops in function, and then just decide how many loops you go through before breaking out depending on the base.
I don't understand the part in bold. Are you using an array? Can you give an example of how the codes would look like?

2. Maz, you should know that there is a special DNA sequence compression algorithm, made up just to store the info we're processing here.

I read about it briefly on Wikipedia when I was looking for info on generating these sequences. I would recommend that for the compression part.

But first, we need to generate it!

@lolguy: Base here, refers to DNA base #'s, not our binary, decimal or hex bases we usually work with.

So let me see if I understand our sequence, correctly:

The base number indicates the number of letters in our "head" group. So base 1 begins with 'A', and base 5 begins with ACGTA?

For the "tail"group, the sequence is always to add the ACGT, and that sequence never varies, right?

No, Teiji, we don't need to add another for loop for every base number. We're just flushing out the algorithm here.

See I took physiology before Watson & Crick, and before microscopes were invented, so genetics was pretty much pea plants and Mendelson, iirc.

The base number indicates the number of letters in our "head" group. So base 1 begins with 'A', and base 5 begins with ACGTA?
I believe base 5 begins with AAAAA

4. Sounds inefficient for our DNA, but good for us.

AAAAA....CCCCC...GGGGG...TTTTT, is what we need for base 5, then? Just ordered permutations of that ACGT set?

This is getting easier by the minute.

Teiji, can you just generate these sequences on the fly (on demand), or do you have to store them in files? I would definitely favor the on demand approach, if possible.

Sounds inefficient for our DNA, but good for us.

AAAAA....CCCCC...GGGGG...TTTTT, is what we need for base 5, then? Just ordered permutations of that ACGT set?

This is getting easier by the minute.

Teiji, can you just generate these sequences on the fly (on demand), or do you have to store them in files? I would definitely favor the on demand approach, if possible.
No, you got it wrong.

Simple concept but hard to explain. Only 1 letter change at a time starting from the right. Once that letter complete its change (got all the way to T), then it "reset" and the letter left of it change.

I'll bold the letter that change and reset

Base 1: A, C, G, T (total of 4 occurrences)
Base 2: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT (total of 4^2 or 16 occurrences)
.....
Base 5: AAAAA, AAAAC, AAAAG, AAAAT, AAACA, AAACC, AAACG, AAACT, AAAGA, AAAGC, AAAGG, AAAGT, AAATA, AAATC, AAATG, AAATT, AACAA, AACAC, AACAG, AACAT, etc until the last one TTTTT (total of 4^5 or 1024 occurrences)

And I need to store them.

6. maybe use base 4 ( 0, 1, 2, 3 = A, C, G, T) math and add one to it to get the next value

then it would just be a problem of limiting the number of bits you are using (1 base4-bit for base 1, 2 for base 2 so that 0 gives you the appropriate number of 'A's)
which would also control when to stop (when it rolls over)

7. Originally Posted by Teiji
No, you got it wrong.
I got it right, now.

This is not an elegant program, but it shows how I would generate the base sequences. It's a fast runner if you don't have it printing up to the screen.

I recommend putting this data into a file, and then using the DNA compression algorithm mentioned in Wikipedia to compress it, after it's generated. Zip will usually make DNA sequences *larger* when they try to compress this data, according to Wikipedia's article on it.

I would also break the files up into descriptive names so the files are just slightly less in size, than a comfortable number of sectors, on your HD. You'll waste a lot of disk space, otherwise. Having the data all in one big file is risky. In any case, keep a back up.

What do you intend doing with all this data? Seems a bit much for a class assignment.

Code:
```/* DNA.c generates the base numbers, from the set of (A,C,G,T), from 1 to
*/

#include <stdio.h>

void showIt(int base);

int main(void)  {
int i, base;

base = 5;
showIt(base);

printf("\n\n\t\t\t      press enter when ready ");
i = getchar(); i++;
return 0;
}
void showIt(int base) {
int a,b, c, d, e, i;
unsigned long perm_num = 0;
unsigned long perm_mill = 0;
char dna[] = {"ACGT"};

printf("\n");

if(base) {
for(a = 0; a < 4; a++)  {
if(base == 1) {
printf("%c ", dna[a]);
perm_num++;
}
if(base > 1) {
for(b = 0; b < 4; b++) {
if(base == 2) {
printf("%c%c  ", dna[a], dna[b]);
perm_num++;
}
if(base > 2) {
for(c = 0; c < 4; c++)  {
if(base == 3) {
printf("%c%c%c  ", dna[a], dna[b], dna[c]);
perm_num++;
}
if(base > 3) {
for(d = 0; d < 4; d++) {
if(base == 4) {
printf("%c%c%c%c ", dna[a], dna[b], dna[c], dna[d]);  // dna[t]);
perm_num++;
}
if(base > 4) {
for(e = 0; e < 4; e++) {
if(base == 5) {
printf("%c%c%c%c%c   ", dna[a], dna[b], dna[c], dna[d], dna[e]);
perm_num++;
}
if(perm_num % 100 == 0) i = getchar();
}
}
//sprinkle this around where the permutations become > 1,000,000
if(perm_num > 1000000) {
perm_num = 0;
perm_mill++;
}
}
}

}
}
}
}
}
}
if(perm_mill)
printf("Permutations found: Millions: %lu %lu", perm_mill, perm_num);
else
printf("\n Permutations found: %lu ", perm_num);
}```
After about 6 nested loops, I'd have this function call the next function, and repeat it with the very same local variables. Pretty much copy and paste, with just the base numbers references, direct or indirect, needing to be changed.

perm_num and perm_mill, I would make global variables, and change them to long, long, if your compiler supports them.

I got it right, now.

This is not an elegant program, but it shows how I would generate the base sequences. It's a fast runner if you don't have it printing up to the screen.

I recommend putting this data into a file, and then using the DNA compression algorithm mentioned in Wikipedia to compress it, after it's generated. Zip will usually make DNA sequences *larger* when they try to compress this data, according to Wikipedia's article on it.

I would also break the files up into descriptive names so the files are just slightly less in size, than a comfortable number of sectors, on your HD. You'll waste a lot of disk space, otherwise. Having the data all in one big file is risky. In any case, keep a back up.

What do you intend doing with all this data? Seems a bit much for a class assignment.

*CODE*

After about 6 nested loops, I'd have this function call the next function, and repeat it with the very same local variables. Pretty much copy and paste, with just the base numbers references, direct or indirect, needing to be changed.

perm_num and perm_mill, I would make global variables, and change them to long, long, if your compiler supports them.
do you think you could take a crack at my solution (if i made any sense)? I have no idea how to actually code it and I think it would be a useful thing to know

***EDIT***
never mind, I think I got it
**********

9. You're on the right track, imo, because you need 4 (or 4-1), since we have 4 letters in the set to be permuted. Your masking of bits would correspond via bit testing, to my if(base...) statements, in my program, if I get your drift right.

Which is nice, but not much different, really. We'd be checking the bits, or values, instead of checking the array of base letters, through the index variables. I don't see how that gets us out of the nested 64 (eventually), for loops, however.

What would be elegant, would be to use either of our approaches, but make it work recursively. That would be brilliant, but I wouldn't try it under any kind of time pressure.

Teiji faces a lot of work still. Not only does the program have to be expanded to handle all 64 bases, but then there's the problem of output to a file(s), and maybe writing up that DNA compression and de-compression program.

If there is an elegant way to handle this, I'd sure like to see it.

EDIT: ***Could you post it?***

10. just finished, does up to base 3
i'm sure it can be done better

Code:
```#include <stdio.h>

typedef struct {
unsigned int DNA : 2;
}mytype;

int main(int argc, char* argv[])
{
mytype one;
mytype two;
mytype three;
mytype *foo[3];
char bar[4] = "ACGT";
foo[0]=&one;
foo[1]=&two;
foo[2]=&three;
one.DNA = two.DNA = three.DNA = 0;

do{

do {
do {
do {

printf("%c%c%c\n", bar[foo[0]->DNA], bar[foo[1]->DNA], bar[foo[2]->DNA]);
foo[2]->DNA++;

}while(foo[2]->DNA);
foo[1]->DNA++;

}while(foo[1]->DNA);
foo[0]->DNA++;

}while(foo[0]->DNA);

}while(foo[0]->DNA && foo[1]->DNA && foo[2]->DNA);

return 0;
}```

11. That looks very clean, d_d! Far more concise than mine, also.

That looks very clean, d_d! Far more concise than mine, also.
ಠ_ಠ

now to figure out how do do it recursively...

13. Originally Posted by ಠ_ಠ
ಠ_ಠ

now to figure out how do do it recursively...
Sandbagger!

I see how my program could be substantially cleaned up and shortened, as well.

I believe I'll leave that as "an exercise for the student", though. I want to dig into that maze puzzle posted up on the forum.

I will be checking back on your next version of the program of course, and I want to hear what Teiji is going to be doing with all this data!

14. here's an almost working recursive solution, too sleepy to finish today
Code:
```#include <stdio.h>

typedef struct {
unsigned int DNA : 2;
}mytype;

void printDNA(mytype *foo[], int base, int flag);

int main(int argc, char* argv[])
{
mytype one;
mytype two;
mytype three;
mytype *foo[3];
foo[0]=&one;
foo[1]=&two;
foo[2]=&three;
one.DNA = two.DNA = three.DNA = 0;

printDNA(foo, 3, 0);
system("pause");
return 0;

}

void printDNA(mytype *foo[], int base, int flag)
{
int temp1, temp2;
char bar[4] = "ACGT";

for (temp1 = 0; temp1 < base; temp1++) {
printf("%c",bar[foo[temp1]->DNA]);
}
printf("\n");

for (temp2 = base - 1; temp2 > -1; temp2--) {
if (foo[temp2]->DNA == 3 && temp2 != 0) {
foo[temp2-1]->DNA++;
}else if (foo[temp2]->DNA == 3 && temp2 == 0 && flag == 0) {
flag = 1;
}else if (foo[temp2]->DNA == 3 && foo[temp2+1]->DNA == 3 && temp2 == 0 && flag == 1) {
flag = 2;
}
}
foo[base-1]->DNA++;
if (flag != 2) {
printDNA(foo, base, flag);
}
}```

15. As far as I understand, you simply want to print all permutations of fixed-length strings consisting of the letters a, c, g and t.

Here's a solution:

Code:
```#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>

char *dnaconv(unsigned long num, char *str, size_t size)
{
char *p;

if(size < 2)
return NULL;

memset(str, 'a', size - 1);
p = &str[size - 1];
*p = '\0';

do {
*--p = "acgt"[num % 4];
num /= 4;
} while(num != 0);

return str;
}

int main(int argc, char *argv[])
{
int base, i, num;
char *s;

if(argc != 2) {
fprintf(stderr, "USAGE: %s <base>\n", argv[0]);
return EXIT_FAILURE;
}

base = atoi(argv[argc-1]);

if((s = malloc(base + 1)) == NULL) {
fprintf(stderr, "%s: out of memory\n", argv[0]);
return EXIT_FAILURE;
}

num = pow(4, base);

for(i = 0; i < num; i++) {
printf("%s\n", dnaconv(i, s, base + 1));
}

free(s);
return 0;
}```
Compile with -lm, call with the desired base as the only argument.

dnaconv() is the interesting function. It converts a number to numeral base 4 and uses the letters a, c, g, t instead of the traditional digits. The memset() is merely for padding. main() then simply counts from 0 to 4^base, converts these numbers to numeral base 4 and prints them.

Why did nobody listen to ಠ_ಠ's advice?

Greets,
Philip

Page 2 of 4 First 1234 Last