Maz, you should know that there is a special DNA sequence compression algorithm, made up just to store the info we're processing here.
I read about it briefly on Wikipedia when I was looking for info on generating these sequences. I would recommend that for the compression part.
But first, we need to generate it!
@lolguy: Base here, refers to DNA base #'s, not our binary, decimal or hex bases we usually work with.
So let me see if I understand our sequence, correctly:
The base number indicates the number of letters in our "head" group. So base 1 begins with 'A', and base 5 begins with ACGTA?
For the "tail"group, the sequence is always to add the ACGT, and that sequence never varies, right?
No, Teiji, we don't need to add another for loop for every base number. We're just flushing out the algorithm here.
See I took physiology before Watson & Crick, and before microscopes were invented, so genetics was pretty much pea plants and Mendelson, iirc.
Sounds inefficient for our DNA, but good for us.
AAAAA....CCCCC...GGGGG...TTTTT, is what we need for base 5, then? Just ordered permutations of that ACGT set?
This is getting easier by the minute.
Teiji, can you just generate these sequences on the fly (on demand), or do you have to store them in files? I would definitely favor the on demand approach, if possible.
Last edited by Adak; 04-05-2009 at 05:52 PM.
No, you got it wrong.
Simple concept but hard to explain. Only 1 letter change at a time starting from the right. Once that letter complete its change (got all the way to T), then it "reset" and the letter left of it change.
I'll bold the letter that change and reset
Base 1: A, C, G, T (total of 4 occurrences)
Base 2: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT (total of 4^2 or 16 occurrences)
.....
Base 5: AAAAA, AAAAC, AAAAG, AAAAT, AAACA, AAACC, AAACG, AAACT, AAAGA, AAAGC, AAAGG, AAAGT, AAATA, AAATC, AAATG, AAATT, AACAA, AACAC, AACAG, AACAT, etc until the last one TTTTT (total of 4^5 or 1024 occurrences)
And I need to store them.
maybe use base 4 ( 0, 1, 2, 3 = A, C, G, T) math and add one to it to get the next value
then it would just be a problem of limiting the number of bits you are using (1 base4-bit for base 1, 2 for base 2 so that 0 gives you the appropriate number of 'A's)
which would also control when to stop (when it rolls over)
Last edited by ಠ_ಠ; 04-05-2009 at 10:59 PM.
╔╗╔══╦╗
║║║╔╗║║
║╚╣╚╝║╚╗
╚═╩══╩═╝
I got it right, now.
This is not an elegant program, but it shows how I would generate the base sequences. It's a fast runner if you don't have it printing up to the screen.
I recommend putting this data into a file, and then using the DNA compression algorithm mentioned in Wikipedia to compress it, after it's generated. Zip will usually make DNA sequences *larger* when they try to compress this data, according to Wikipedia's article on it.
I would also break the files up into descriptive names so the files are just slightly less in size, than a comfortable number of sectors, on your HD. You'll waste a lot of disk space, otherwise. Having the data all in one big file is risky. In any case, keep a back up.
What do you intend doing with all this data? Seems a bit much for a class assignment.
After about 6 nested loops, I'd have this function call the next function, and repeat it with the very same local variables. Pretty much copy and paste, with just the base numbers references, direct or indirect, needing to be changed.Code:/* DNA.c generates the base numbers, from the set of (A,C,G,T), from 1 to 5. Adak, April, 2009 */ #include <stdio.h> void showIt(int base); int main(void) { int i, base; base = 5; showIt(base); printf("\n\n\t\t\t press enter when ready "); i = getchar(); i++; return 0; } void showIt(int base) { int a,b, c, d, e, i; unsigned long perm_num = 0; unsigned long perm_mill = 0; char dna[] = {"ACGT"}; printf("\n"); if(base) { for(a = 0; a < 4; a++) { if(base == 1) { printf("%c ", dna[a]); perm_num++; } if(base > 1) { for(b = 0; b < 4; b++) { if(base == 2) { printf("%c%c ", dna[a], dna[b]); perm_num++; } if(base > 2) { for(c = 0; c < 4; c++) { if(base == 3) { printf("%c%c%c ", dna[a], dna[b], dna[c]); perm_num++; } if(base > 3) { for(d = 0; d < 4; d++) { if(base == 4) { printf("%c%c%c%c ", dna[a], dna[b], dna[c], dna[d]); // dna[t]); perm_num++; } if(base > 4) { for(e = 0; e < 4; e++) { if(base == 5) { printf("%c%c%c%c%c ", dna[a], dna[b], dna[c], dna[d], dna[e]); perm_num++; } if(perm_num % 100 == 0) i = getchar(); } } //sprinkle this around where the permutations become > 1,000,000 if(perm_num > 1000000) { perm_num = 0; perm_mill++; } } } } } } } } } if(perm_mill) printf("Permutations found: Millions: %lu %lu", perm_mill, perm_num); else printf("\n Permutations found: %lu ", perm_num); }
perm_num and perm_mill, I would make global variables, and change them to long, long, if your compiler supports them.
Last edited by Adak; 04-06-2009 at 02:23 AM.
Last edited by ಠ_ಠ; 04-05-2009 at 11:41 PM.
╔╗╔══╦╗
║║║╔╗║║
║╚╣╚╝║╚╗
╚═╩══╩═╝
You're on the right track, imo, because you need 4 (or 4-1), since we have 4 letters in the set to be permuted. Your masking of bits would correspond via bit testing, to my if(base...) statements, in my program, if I get your drift right.
Which is nice, but not much different, really. We'd be checking the bits, or values, instead of checking the array of base letters, through the index variables. I don't see how that gets us out of the nested 64 (eventually), for loops, however.
What would be elegant, would be to use either of our approaches, but make it work recursively. That would be brilliant, but I wouldn't try it under any kind of time pressure.
Teiji faces a lot of work still. Not only does the program have to be expanded to handle all 64 bases, but then there's the problem of output to a file(s), and maybe writing up that DNA compression and de-compression program.
If there is an elegant way to handle this, I'd sure like to see it.
EDIT: ***Could you post it?***
just finished, does up to base 3
i'm sure it can be done better
Code:#include <stdio.h> typedef struct { unsigned int DNA : 2; }mytype; int main(int argc, char* argv[]) { mytype one; mytype two; mytype three; mytype *foo[3]; char bar[4] = "ACGT"; foo[0]=&one; foo[1]=&two; foo[2]=&three; one.DNA = two.DNA = three.DNA = 0; do{ do { do { do { printf("%c%c%c\n", bar[foo[0]->DNA], bar[foo[1]->DNA], bar[foo[2]->DNA]); foo[2]->DNA++; }while(foo[2]->DNA); foo[1]->DNA++; }while(foo[1]->DNA); foo[0]->DNA++; }while(foo[0]->DNA); }while(foo[0]->DNA && foo[1]->DNA && foo[2]->DNA); return 0; }
Last edited by ಠ_ಠ; 04-06-2009 at 12:10 AM.
╔╗╔══╦╗
║║║╔╗║║
║╚╣╚╝║╚╗
╚═╩══╩═╝
That looks very clean, d_d! Far more concise than mine, also.
Sandbagger!
I see how my program could be substantially cleaned up and shortened, as well.
I believe I'll leave that as "an exercise for the student", though. I want to dig into that maze puzzle posted up on the forum.
I will be checking back on your next version of the program of course, and I want to hear what Teiji is going to be doing with all this data!
here's an almost working recursive solution, too sleepy to finish today
Code:#include <stdio.h> typedef struct { unsigned int DNA : 2; }mytype; void printDNA(mytype *foo[], int base, int flag); int main(int argc, char* argv[]) { mytype one; mytype two; mytype three; mytype *foo[3]; foo[0]=&one; foo[1]=&two; foo[2]=&three; one.DNA = two.DNA = three.DNA = 0; printDNA(foo, 3, 0); system("pause"); return 0; } void printDNA(mytype *foo[], int base, int flag) { int temp1, temp2; char bar[4] = "ACGT"; for (temp1 = 0; temp1 < base; temp1++) { printf("%c",bar[foo[temp1]->DNA]); } printf("\n"); for (temp2 = base - 1; temp2 > -1; temp2--) { if (foo[temp2]->DNA == 3 && temp2 != 0) { foo[temp2-1]->DNA++; }else if (foo[temp2]->DNA == 3 && temp2 == 0 && flag == 0) { flag = 1; }else if (foo[temp2]->DNA == 3 && foo[temp2+1]->DNA == 3 && temp2 == 0 && flag == 1) { flag = 2; } } foo[base-1]->DNA++; if (flag != 2) { printDNA(foo, base, flag); } }
╔╗╔══╦╗
║║║╔╗║║
║╚╣╚╝║╚╗
╚═╩══╩═╝
As far as I understand, you simply want to print all permutations of fixed-length strings consisting of the letters a, c, g and t.
Here's a solution:
Compile with -lm, call with the desired base as the only argument.Code:#include <stdio.h> #include <stdlib.h> #include <string.h> #include <math.h> char *dnaconv(unsigned long num, char *str, size_t size) { char *p; if(size < 2) return NULL; memset(str, 'a', size - 1); p = &str[size - 1]; *p = '\0'; do { *--p = "acgt"[num % 4]; num /= 4; } while(num != 0); return str; } int main(int argc, char *argv[]) { int base, i, num; char *s; if(argc != 2) { fprintf(stderr, "USAGE: %s <base>\n", argv[0]); return EXIT_FAILURE; } base = atoi(argv[argc-1]); if((s = malloc(base + 1)) == NULL) { fprintf(stderr, "%s: out of memory\n", argv[0]); return EXIT_FAILURE; } num = pow(4, base); for(i = 0; i < num; i++) { printf("%s\n", dnaconv(i, s, base + 1)); } free(s); return 0; }
dnaconv() is the interesting function. It converts a number to numeral base 4 and uses the letters a, c, g, t instead of the traditional digits. The memset() is merely for padding. main() then simply counts from 0 to 4^base, converts these numbers to numeral base 4 and prints them.
Why did nobody listen to ಠ_ಠ's advice?
Greets,
Philip
All things begin as source code.
Source code begins with an empty file.
-- Tao Te Chip