bioinformatics processing file assoc

**zcrself** · 08-12-2009

Now there are two files:
file1:
cigar::50 HWI-EAS-249_35:6:1:6:1154#0/2 76 32 - chromosome06 8365857 8365901 +
....
....
....

file2:
@HWI-EAS-249_35:6:1:6:1154#0/2
GGGGGGCTGAGAAGGTTGAGACAAGTAAGGTATTTCTACGTGATACTAGT GTTATTTCTCCTTACTCGCTCCTTCT
+
BBBBBC@CC@B?@C@;A>:<58@><4;>>@*8@AB;B;6>7>3=66?315 <8@8@@?/>47?;88@%%%%%%%%%%
....
....
....
....

these two files have more than 3,000,000 lines ,I want to produce the structure(as follows)

HWI-EAS-249_35:6:1:6:1154#0/2 76 32 - chromosome06 8365857 8365901 + GGGGGGCTGAGAAGGTTGAGACAAGTAAGGTATTTCTACGTGATACTAGT GTTATTTCTCCTTACTCGCTCCTTCT
.....
......
.....

according to keyword search the sequences,

As I think , I want to create hash table to store the data so that giving me a keyword I can find its sequences quickly!

Have best method to complete this job?

the keyword is the second column every line of the file1

**Dino** · 08-12-2009

Doesn't seem too difficult.

Besides the keyword starting in col 2 for every record in file 2 (I think you meant file 2, but you said file 1), what else do we know about the record layouts?

What have you coded so far?

**zcrself** · 08-17-2009

Originally Posted by Dino

Doesn't seem too difficult.

Besides the keyword starting in col 2 for every record in file 2 (I think you meant file 2, but you said file 1), what else do we know about the record layouts?

What have you coded so far?

[code]
f
for example:

Code:

file1:
1 ss tt
2 ss ss
3 e f

file2:
1 8
2 7
3 55

I want to create file3 as follows:
1 ss tt 8
2 ss ss 7
3 e f 55

**Dino** · 08-17-2009

I think we understand you have 2 files and you want to read a record (or more?) from each file, combine them into a single long record and write that to a 3rd file.

What are the rules for parsing? Column positions? How many lines need to be read from each file to make it the full output line?

What have you coded so far?

I'm pulling teeth here.

**zcrself** · 08-17-2009

Originally Posted by Dino

I think we understand you have 2 files and you want to read a record (or more?) from each file, combine them into a single long record and write that to a 3rd file.

What are the rules for parsing? Column positions? How many lines need to be read from each file to make it the full output line?

What have you coded so far?

I'm pulling teeth here.

the first column of these two files is the keyword!
According to the first column , combining these two files.

every file has nearly 400000000 lines.

I have no codes about this job.

Code:

file1:
HWI-EAS-249_35:6:1:6:1153#0/2 76 32 - chromosome01 8365857 8365911 +
HWI-EAS-249_35:6:1:6:1154#0/2 76 32 - chromosome03 8365837 8365901 +
HWI-EAS-249_35:6:1:6:1155#0/2 76 32 - chromosome06 8365351 8365902 +
.......

file2:

HWI-EAS-249_35:6:1:6:1153#0/2   TTTTTCCCTAGGGAATACCGTCATAAAATTCATGTCAGCCTGTCTCAACTATCAAGATAA
HWI-EAS-249_35:6:1:6:1154#0/2   CCAAGTGCCTCTGTTAGTATTTTGGTGAATTGACTTATTTATAACTCATATTTCAAAGTT
HWI-EAS-249_35:6:1:6:1155#0/2   AAACCCAGTAAATTAAACGGTTCAATCATTCTATTGTGGTTCTTTTCGTTATGCTGTACA
.........



the first column is the keyword

**sean** · 08-17-2009

Well the topics you'll want to learn about are string parsing, and data structures. It's hard for us to know what you mean by keyword, as all the files you've posted are completely different in structure, and there seem to be no comprehensible 'columns'. We can't write the program for you - all we can do is tell you to read the data in from a file, break it into a structure, and then do whatever you want with the data. Without more details, we really can't help you with anything. Why don't you try writing part of the program, and if you run into specific problems, you can post what you have and your question.

**zcrself** · 08-17-2009

Originally Posted by sean

Well the topics you'll want to learn about are string parsing, and data structures. It's hard for us to know what you mean by keyword, as all the files you've posted are completely different in structure, and there seem to be no comprehensible 'columns'. We can't write the program for you - all we can do is tell you to read the data in from a file, break it into a structure, and then do whatever you want with the data. Without more details, we really can't help you with anything. Why don't you try writing part of the program, and if you run into specific problems, you can post what you have and your question.

My questtion is which data struct is best!hashing table or Structures and so on

Code:

if "the first column  of the file2" == "the first column of the file1",combine them

**sean** · 08-18-2009

If a certain value only appears once in each column, then the simplest method is easily just 2 arrays of strings. If a value in one column may have multiple corresponding values in the other column, than a hashtable would be better (remember you'll have to implement it yourself in C, though). If both column have values that may correspond to many values in the other column, you can still do arrays of strings, but search efficiency would be low. In that case what I would do is assign unique "index" numbers to all possible values in each column, and maintain arrays listing the corresponding pairs, much like one would in a relational database

**zcrself** · 08-18-2009

Originally Posted by sean

If a certain value only appears once in each column, then the simplest method is easily just 2 arrays of strings. If a value in one column may have multiple corresponding values in the other column, than a hashtable would be better (remember you'll have to implement it yourself in C, though). If both column have values that may correspond to many values in the other column, you can still do arrays of strings, but search efficiency would be low. In that case what I would do is assign unique "index" numbers to all possible values in each column, and maintain arrays listing the corresponding pairs, much like one would in a relational database

I see !

a certain value only appears once in each column!

**lizz** · 09-28-2010

hello everyone i have some view on the bio informatics actually it is the application of biotechnology. it has very great demand.Many pharmaceutical companies require the candidate for this field.

**Adak** · 09-28-2010

@lizz: Welcome to the forum, Lizz!

@zcrself:

That's a LOT of data - are you sure you don't want a professionally designed database for all that data?

MySQL doesn't interest you, why?

**laserlight** · 09-28-2010

Originally Posted by lizz

hello everyone i have some view on the bio informatics actually it is the application of biotechnology. it has very great demand.Many pharmaceutical companies require the candidate for this field.

Great to hear that. However, you have added nothing of value to a thread in which the last post was made a year ago.

*thread closed*

Thread: bioinformatics processing file assoc

Thread Tools

Search Thread

Display

bioinformatics processing file assoc

introduction

Similar Threads

File Writing Problem

sequential file program

Can we have vector of vector?

Encryption program

System