Thread: bioinformatics processing file assoc

  1. #1
    Registered User
    Join Date
    Aug 2009
    Posts
    168

    bioinformatics processing file assoc

    Now there are two files:
    file1:
    cigar::50 HWI-EAS-249_35:6:1:6:1154#0/2 76 32 - chromosome06 8365857 8365901 +
    ....
    ....
    ....

    file2:
    @HWI-EAS-249_35:6:1:6:1154#0/2
    GGGGGGCTGAGAAGGTTGAGACAAGTAAGGTATTTCTACGTGATACTAGT GTTATTTCTCCTTACTCGCTCCTTCT
    +
    BBBBBC@CC@B?@C@;A>:<58@><4;>>@*8@AB;B;6>7>3=66?315 <8@8@@?/>47?;88@%%%%%%%%%%
    ....
    ....
    ....
    ....

    these two files have more than 3,000,000 lines ,I want to produce the structure(as follows)

    HWI-EAS-249_35:6:1:6:1154#0/2 76 32 - chromosome06 8365857 8365901 + GGGGGGCTGAGAAGGTTGAGACAAGTAAGGTATTTCTACGTGATACTAGT GTTATTTCTCCTTACTCGCTCCTTCT
    .....
    ......
    .....

    according to keyword search the sequences,

    As I think , I want to create hash table to store the data so that giving me a keyword I can find its sequences quickly!

    Have best method to complete this job?

    the keyword is the second column every line of the file1

  2. #2
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Doesn't seem too difficult.

    Besides the keyword starting in col 2 for every record in file 2 (I think you meant file 2, but you said file 1), what else do we know about the record layouts?

    What have you coded so far?
    Mainframe assembler programmer by trade. C coder when I can.

  3. #3
    Registered User
    Join Date
    Aug 2009
    Posts
    168
    Quote Originally Posted by Dino View Post
    Doesn't seem too difficult.

    Besides the keyword starting in col 2 for every record in file 2 (I think you meant file 2, but you said file 1), what else do we know about the record layouts?

    What have you coded so far?
    [code]
    f
    for example:
    Code:
    file1:
    1 ss tt
    2 ss ss
    3 e f
    
    file2:
    1 8
    2 7
    3 55
    
    I want to create file3 as follows:
    1 ss tt 8
    2 ss ss 7
    3 e f 55

  4. #4
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    I think we understand you have 2 files and you want to read a record (or more?) from each file, combine them into a single long record and write that to a 3rd file.

    What are the rules for parsing? Column positions? How many lines need to be read from each file to make it the full output line?

    What have you coded so far?

    I'm pulling teeth here.
    Mainframe assembler programmer by trade. C coder when I can.

  5. #5
    Registered User
    Join Date
    Aug 2009
    Posts
    168
    Quote Originally Posted by Dino View Post
    I think we understand you have 2 files and you want to read a record (or more?) from each file, combine them into a single long record and write that to a 3rd file.

    What are the rules for parsing? Column positions? How many lines need to be read from each file to make it the full output line?

    What have you coded so far?

    I'm pulling teeth here.
    the first column of these two files is the keyword!
    According to the first column , combining these two files.

    every file has nearly 400000000 lines.

    I have no codes about this job.

    Code:
    file1:
    HWI-EAS-249_35:6:1:6:1153#0/2 76 32 - chromosome01 8365857 8365911 +
    HWI-EAS-249_35:6:1:6:1154#0/2 76 32 - chromosome03 8365837 8365901 +
    HWI-EAS-249_35:6:1:6:1155#0/2 76 32 - chromosome06 8365351 8365902 +
    .......
    
    file2:
    
    HWI-EAS-249_35:6:1:6:1153#0/2   TTTTTCCCTAGGGAATACCGTCATAAAATTCATGTCAGCCTGTCTCAACTATCAAGATAA
    HWI-EAS-249_35:6:1:6:1154#0/2   CCAAGTGCCTCTGTTAGTATTTTGGTGAATTGACTTATTTATAACTCATATTTCAAAGTT
    HWI-EAS-249_35:6:1:6:1155#0/2   AAACCCAGTAAATTAAACGGTTCAATCATTCTATTGTGGTTCTTTTCGTTATGCTGTACA
    .........
    
    
    
    the first column is the keyword
    Last edited by zcrself; 08-17-2009 at 10:16 PM.

  6. #6
    Registered User
    Join Date
    Sep 2001
    Posts
    4,912
    Well the topics you'll want to learn about are string parsing, and data structures. It's hard for us to know what you mean by keyword, as all the files you've posted are completely different in structure, and there seem to be no comprehensible 'columns'. We can't write the program for you - all we can do is tell you to read the data in from a file, break it into a structure, and then do whatever you want with the data. Without more details, we really can't help you with anything. Why don't you try writing part of the program, and if you run into specific problems, you can post what you have and your question.

  7. #7
    Registered User
    Join Date
    Aug 2009
    Posts
    168
    Quote Originally Posted by sean View Post
    Well the topics you'll want to learn about are string parsing, and data structures. It's hard for us to know what you mean by keyword, as all the files you've posted are completely different in structure, and there seem to be no comprehensible 'columns'. We can't write the program for you - all we can do is tell you to read the data in from a file, break it into a structure, and then do whatever you want with the data. Without more details, we really can't help you with anything. Why don't you try writing part of the program, and if you run into specific problems, you can post what you have and your question.
    My questtion is which data struct is best!hashing table or Structures and so on

    Code:
    if "the first column  of the file2" == "the first column of the file1",combine them

  8. #8
    Registered User
    Join Date
    Sep 2001
    Posts
    4,912
    If a certain value only appears once in each column, then the simplest method is easily just 2 arrays of strings. If a value in one column may have multiple corresponding values in the other column, than a hashtable would be better (remember you'll have to implement it yourself in C, though). If both column have values that may correspond to many values in the other column, you can still do arrays of strings, but search efficiency would be low. In that case what I would do is assign unique "index" numbers to all possible values in each column, and maintain arrays listing the corresponding pairs, much like one would in a relational database

  9. #9
    Registered User
    Join Date
    Aug 2009
    Posts
    168
    Quote Originally Posted by sean View Post
    If a certain value only appears once in each column, then the simplest method is easily just 2 arrays of strings. If a value in one column may have multiple corresponding values in the other column, than a hashtable would be better (remember you'll have to implement it yourself in C, though). If both column have values that may correspond to many values in the other column, you can still do arrays of strings, but search efficiency would be low. In that case what I would do is assign unique "index" numbers to all possible values in each column, and maintain arrays listing the corresponding pairs, much like one would in a relational database
    I see !

    a certain value only appears once in each column!

  10. #10
    Registered User
    Join Date
    Sep 2010
    Posts
    1

    introduction

    hello everyone i have some view on the bio informatics actually it is the application of biotechnology. it has very great demand.Many pharmaceutical companies require the candidate for this field.

  11. #11
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    @lizz: Welcome to the forum, Lizz!

    @zcrself:

    That's a LOT of data - are you sure you don't want a professionally designed database for all that data?

    MySQL doesn't interest you, why?

  12. #12
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by lizz
    hello everyone i have some view on the bio informatics actually it is the application of biotechnology. it has very great demand.Many pharmaceutical companies require the candidate for this field.
    Great to hear that. However, you have added nothing of value to a thread in which the last post was made a year ago.

    *thread closed*
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. File Writing Problem
    By polskash in forum C Programming
    Replies: 3
    Last Post: 02-13-2009, 10:47 AM
  2. sequential file program
    By needhelpbad in forum C Programming
    Replies: 80
    Last Post: 06-08-2008, 01:04 PM
  3. Can we have vector of vector?
    By ketu1 in forum C++ Programming
    Replies: 24
    Last Post: 01-03-2008, 05:02 AM
  4. Encryption program
    By zeiffelz in forum C Programming
    Replies: 1
    Last Post: 06-15-2005, 03:39 AM
  5. System
    By drdroid in forum C++ Programming
    Replies: 3
    Last Post: 06-28-2002, 10:12 PM