Listen?
Are you joking, Phillip? I'm singing in his choir!
That doesn't mean I can program as well as the two of you.
Listen?
Are you joking, Phillip? I'm singing in his choir!
That doesn't mean I can program as well as the two of you.
I'm just a 1st year Software Engineering student, I didn't even know how to do what I suggested
╔╗╔══╦╗
║║║╔╗║║
║╚╣╚╝║╚╗
╚═╩══╩═╝
I've taken one class in BASIC, and one class in C. That was when dinosaurs ruled the earth, I think.
Adak, oh it's a pretty long assignment. I have to read an input file containing thousands of DNA strands and count the value of each case (how many times it occur for base 1, base 2, base 3, etc.), get the min and max value of the case, and write them (the value as well as the case) to an output file.
The reason why I ask for help on generating the sequence is because I would need to compare all the sequence produced to the strands of DNA in the input file.
Btw Adak, in your program, how would you save it into an array to reference later instead of printing to stdout?
Anyway, thanks for all of you guys help.
Last edited by Teiji; 04-06-2009 at 12:53 PM.
I would use Snufist's or d_d's program, although I haven't checked them out yet.
For my program, right below each print statement, put in your assignment statement into the array, and rem out the print line.
Won't the array be moot due to the huge size of the strings?
You're going to use a char array to save space, with two dimensions? That lets each string be in it's own row.
Perhaps have the base number of the string to correspond with the index number of the char array. That's how I would do it.
You'll want a small auxillary unsigned long long int array of size 4 to count up your sums.
If you know the specifics of your letters array, and want some help getting the letters into it right, let me know.
This is what I did and it works.
But if there is a better way, please point me out.Code:/* DNA.c generates the base numbers, from the set of (A,C,G,T), from 1 to 5. Adak, April, 2009 */ #include <stdio.h> #include <string.h> char base1[4][1+1]; // +1 for the string termination char char base2[16][2+1]; char base3[64][3+1]; char base4[256][4+1]; char base5[1024][5+1]; void showIt(int base); void printBase(int base); int main(void) { int base; base = 1; showIt(base); printBase(base); base = 2; showIt(base); printBase(base); base = 3; showIt(base); printBase(base); base = 4; showIt(base); printBase(base); base = 5; showIt(base); printBase(base); return 0; } void showIt(int base) { int a, b, c, d, e, i; unsigned long perm_num = 0; unsigned long perm_mill = 0; char dna[] = {"ACGT"}; char tempSequence[base+1]; // +1 for the string termination char printf("\n"); int k = 0; if(base) { for(a = 0; a < 4; a++) { if(base == 1) { sprintf(tempSequence, "%c\0", dna[a]); strcpy(base1[k],tempSequence); k++; perm_num++; } if(base > 1) { for(b = 0; b < 4; b++) { if(base == 2) { sprintf(tempSequence, "%c%c\0", dna[a], dna[b]); strcpy(base2[k],tempSequence); k++; perm_num++; } if(base > 2) { for(c = 0; c < 4; c++) { if(base == 3) { sprintf(tempSequence, "%c%c%c\0", dna[a], dna[b], dna[c]); strcpy(base3[k],tempSequence); k++; perm_num++; } if(base > 3) { for(d = 0; d < 4; d++) { if(base == 4) { sprintf(tempSequence, "%c%c%c%c\0", dna[a], dna[b], dna[c], dna[d]); // dna[t]); strcpy(base4[k],tempSequence); k++; perm_num++; } if(base > 4) { for(e = 0; e < 4; e++) { if(base == 5) { sprintf(tempSequence, "%c%c%c%c%c\0", dna[a], dna[b], dna[c], dna[d], dna[e]); strcpy(base5[k],tempSequence); k++; perm_num++; } } } //sprinkle this around where the permutations become > 1,000,000 if(perm_num > 1000000) { perm_num = 0; perm_mill++; } } } } } } } } } if(perm_mill) printf("Permutations found: Millions: %lu %lu \n", perm_mill, perm_num); else printf("\n Permutations found: %lu \n", perm_num); } void printBase(int base) { // TEST PRINT BASE 1-5 int i; switch(base) { case 1: for (i = 0; i < 4; i++) printf("base1[%d] = %s \n", i, base1[i]); break; case 2: for (i = 0; i < 16; i++) printf("base2[%d] = %s \n", i, base2[i]); break; case 3: for (i = 0; i < 64; i++) printf("base3[%d] = %s \n", i, base3[i]); break; case 4: for (i = 0; i < 256; i++) printf("base4[%d] = %s \n", i, base4[i]); break; case 5: for (i = 0; i < 1024; i++) printf("base5[%d] = %s \n", i, base5[i]); break; default: break; } }
Also, I want to do thru base 15, so where would I insert another function like you said?
After about 6 nested loops, I'd have this function call the next function, and repeat it with the very same local variables. Pretty much copy and paste, with just the base numbers references, direct or indirect, needing to be changed.Should I insert it at the xxx?Code:if(base > 4) { for(e = 0; e < 4; e++) { if(base == 5) { sprintf(tempSequence, "%c%c%c%c%c\0", dna[a], dna[b], dna[c], dna[d], dna[e]); strcpy(base5[k],tempSequence); k++; perm_num++; } xxx } }
Last edited by Teiji; 04-06-2009 at 04:09 PM.
You won't insert another function, you just call the 2nd function, and that would be the right spot.
Before you add another function, STOP!
Get one function working *exactly* as you want it, because later, when you have more functions, it's tedious to make code corrections!
I haven't learn about malloc yet, but that sounds like a nice function. Since I feel tired of creating a 2d array for each base.I suggest using one 2D char array[][]. and have malloc set aside the memory for each
row - which is great because each row will be a completely different length.
This reminds me of a guitar player's app I wrote years ago in AMOS on the Amiga. One feature was that it would calculate every possible fingering of a chord given a certain number of strings with certain notes upon a certain range of frets. I ended up writing a nested loop for each number of strings: a 6 nested loop for 6 strings, 5 for 5 and so on. The innermost loop constructed text strings of digit combinations based on the current state of the loops.
What a monstrosity, though it worked. Yet at the time I remember wishing dearly that someone would explain to me how to do it generically for n strings without having to construct a separate loop structure for each n.
This is how to use calloc, which is just like malloc, except it sets the memory being allocated to 0, for you. You have to tell it what base number system you're using, atm.
(that's what the 10 is in the calloc parameters).
The only decent way I see to use this program, is to write out the base letters being generated, to a file, and then read them back in from the file, into the allocated array.Code://add this include file: #include <stdlib.h> //for calloc int main(void) { int i, base; unsigned long int j; char **bases; if((bases = calloc(15 * sizeof(char*), 10)) == NULL) { //pointers to char printf("bases 1d memory allocation failed - aborting\n"); return 1; } printf("\n\n\n"); for(i = 0, j = 1; i < 6; i++) { // j *= 4; if((bases[i] = calloc(j+1 * 1, 10)) == NULL) { //sizeof(char) is always 1 printf("bases 2d memory allocation failed - aborting\n"); return 1; } bases[i][j] = '\0'; //set the end of string char printf("%lu \n", j); /* for debug only: if(bases[i] == NULL) break; */ } //for freeing up the memory when it's not going to be used anymore: for(i = 0; i < 15; i++) //free the 2nd dimension free(bases[i]); free(bases); //then the 1st dimension
I knew you would not be able to put these huge bases into an array, so I didn't structure the program for that.
Even base 15 will have 1,073,nnn,nnn char's in it. < eek! >
Since you're stopping at base 15, the output can all fit into one file.
Last edited by Adak; 04-06-2009 at 10:41 PM.
This is the latest version. It puts the data into a file named "bases.txt", and then loads them all up into the bases 2D array. Finally, it prints out the bases array.
This runs ok - the variable "total" in loadIt(), will have to be reduced by nesting another for loop in loadIt() where it's used. Otherwise, it may overflow.Code:/* DNA.c generates the base letters, from the set of (a,c,g,t), from 1 to 5. Adak, April, 2009 Status: improving */ #include <stdio.h> #include <stdlib.h> //for calloc #include <math.h> //for pow() void writeIt(FILE *pfile, int base); void loadIt(FILE *pfile, int base, char **bases); unsigned long perm_num = 0; unsigned long perm_mill = 0; char dna[] = {"ACGT"}; int main(void) { FILE *pfile; int i, base, gar; unsigned long int j; char **bases; if((bases = calloc(15 * sizeof(char*), 10)) == NULL) { //pointers to char printf("bases 1d memory allocation failed - aborting\n"); return 1; } printf("\n\n\n"); for(i = 0, j = 1; i < 6; i++) { // j *= 4; if((bases[i] = calloc(j+1 * 1, 10)) == NULL) { //sizeof(char) is always 1 printf("bases 2d memory allocation failed - aborting\n"); return 1; } bases[i][j] = '\0'; //set the end of string char printf("%lu \n", j); if(bases[i] == NULL) break; } //at i = 15, string length will be: 1,073,741,824 char's! if((pfile = fopen("bases.txt", "wt")) == NULL) { printf("\nError opening bases.txt file - terminating\n"); exit(1); } for(base = 1; base < 6; base++) writeIt(pfile, base); fclose(pfile); pfile = fopen("bases.txt", "rt"); for(base = 1; base < 4; base++) loadIt(pfile, base, bases); printf("\n\n"); for(i = 0; i < 4; i++) { printf("\nBase %d: %s\n", i, bases[i]); gar++; } for(i = 0; i < 15; i++) //free the 2nd dimension free(bases[i]); free(bases); //then the 1st dimension fclose(pfile); if(perm_mill) printf("\n\nPermutations found: Millions: %lu %lu", perm_mill, perm_num); else printf("\n\nPermutations found: %lu ", perm_num); printf("\n\n\t\t\t press enter when ready "); i = getchar(); i++; return 0; } void loadIt(FILE *pfile, int base, char **bases) { unsigned long total = base * pow(4,base); unsigned int i; for(i = 0; i < total; i+=base) { fscanf(pfile, "%s", &bases[base][i]); //odd, but it orksw :) //gar++; //for debug only } } void writeIt(FILE *pfile, int base) { int a, b, c, d, e, gar; printf("\n"); for(a = 0; a < 4; a++) { if(base == 1) { //printf("%c ", dna[a]); fprintf(pfile, "%c\n", dna[a]); perm_num++; } if(base > 1) { for(b = 0; b < 4; b++) { if(base == 2) { //printf("%c%c ", dna[a], dna[b]); fprintf(pfile, "%c%c\n", dna[a], dna[b]); perm_num++; } if(base > 2) { for(c = 0; c < 4; c++) { if(base == 3) { //printf("%c%c%c ", dna[a], dna[b], dna[c]); fprintf(pfile, "%c%c%c\n", dna[a], dna[b], dna[c]); perm_num++; } if(base > 3) { for(d = 0; d < 4; d++) { if(base == 4) { //printf("%c%c%c%c ", dna[a], dna[b], dna[c], dna[d]); // dna[t]); fprintf(pfile, "%c%c%c%c\n", dna[a], dna[b], dna[c], dna[d]); perm_num++; } if(base > 4) { for(e = 0; e < 4; e++) { if(base == 5) { //printf("%c%c%c%c%c ", dna[a], dna[b], dna[c], dna[d], dna[e]); fprintf(pfile, "%c%c%c%c%c\n", dna[a],dna[b],dna[c],dna[d],dna[e]); perm_num++; } // if(perm_num % 220 == 0) { gar = getchar(); gar++; } }//end of for }//end of if //sprinkle this around where the permutations become > 1,000,000 if(perm_num > 1000000) { perm_num = 0; perm_mill++; } } } } } } } } }
Does it run ok for you?
Maybe I'm missing something or there's more detail in the assignment, but I'm not seeing a need to generate every possible combination of all the bases from 1-15 here. What you want is to go though each input file character by character. For each character, grab the N characters needed to make up the base you're looking at and increment a counter for that particular case. Unless you're making a list of valid and invalid sequences to filter against while you're collecting stats, I don't see any reason to know beforehand what you might see in the input file. The input file itself will tell you that (aside from some basic validation that there's nothing aside from ACGT in the data). Hopefully that makes sense.
You could do most of this with 2 functions similar to the one posted by Snafuist. You'd need something to convert a string of a certain length into a number, and another function to convert that number back into the string. Each base listed can be encoded using 2 bits, so you could store each case up to base 16 in a 32-bit integer. You'd have a hash table for each case which stored the counts for everything you've seen so far.
dna_string_to_number() is the reverse of Snafuist's code. It takes a string and a case number and returns an integer, 2 bits per input character.Code:read contents of input file into file_buf const char *cp; int case; for (case = 1; case <= 15; case++) { for (cp = file_buf; *(cp + case - 1); cp ++) { unsigned int encoded_sequence = dna_string_to_number(cp, case); hash_increment(hash_map[case], encoded_sequence); } }
hash_increment will increment the count for the key encoded_sequence in the hash table hash_map. Basically, you'll be doing hash_map[case][encoded_sequence] += 1. The complication here is that the range of encoded_sequence is 4^max_base = lots of entries for the larger cases. Unless you've got a huge file with lots of different sequences, most of these entries will be 0, so it's more efficient to store them as a list of <key, count> pairs instead of just using a huge array. This could be an array holding a bunch of <key, count> pairs that gets dynamically resized as needed, or it could be a linked list or tree.
When you're done, hash_map[case] holds the count of each sequence encountered for that particular sequence length. Get the keys from the hash for a list of sequences seen. For each key, use a dna_number_to_string() function to print out the base, and the count for that key is the number of times you've seen it. You can keep track of max and min here as well and print that out once you've finished the loop, or you can do several passes over the keys of the hash - whatever works best for you.
Depending on file size you might also need to read the file in chunks instead of all at once. That will complicate things a bit but not too much.
Come to think of it, this might be the perfect application for a Trie. You'd be using one single Trie for all of the cases, and the depth of a node in the Trie would tell you the length of the sequence you were looking at. This would eliminate the need to have anything other than a base 1 covert to/from number function.
KC, you seem to have a real insight into what Teiji is trying to do. I hope you'll stick around and join into this thread, frequently.
I don't follow how you can encode a base with hundreds or thousands of letters, into just two bits. Can you post a small example of that with say, 16 letters?