from perl to C

**prettydainty** · 01-04-2009

Dear all,
I am a newbie to C. I want to learn C very badly. i know the theory part of C, but i m not so good when it comes to writting real time programs in C. I m good at perl, but when i use the below program for a small file, it works fine. Not the same for a 10 gig file. So, i felt that writing the program in C would help. Please help me solve the problem, n also to learn C. Here is my problem.

I am comparing 2 files, i take information of left and right values from file 2, and extract numbers at the beginning(left value) and end(right value) of every string-numbers (based on their unique ID which starts with '>') from file 1.

Code:

file1:

>AAAT3R length=110
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 
40 40 40 40 40 40 40 40 40 40 40 40 40 40 38 38 38 38 40 40 39 40 
40 40 40 40 40 40 40 40 40 38 38 37 39 36 36 40 36 35 35 35 38 40 
35 35 33 35 35 35 40 40 40 40 37 37 38 38 38 40 40 40 40 40 40 40 
40 40 40 40 40 40 40 40 40 37 36 36 31 22 22 22 20 20 20 20 20 14
>AAA2OJ length=70
18 18 18 21 35 35 35 32 32 32 33 35 38 39 37 37 39 39 39 39 39 40 
40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 
40 39 39 37 35 35 39 37 37 37 37 37 37 37 37 37 37 33 32 32 30 20 
17 17 17 0

file2:

>AAAT3R_left length=6
TACATA
>AAAT3R_right length=62 ACTACTGATTTGATTATCTTTGATCTCTGTCGAACTAACTATATCTTAGTATGATCTTTAAT
>AAA2OJ_left length=14
TTTTGGACTATCTG
>AAA2OJ_right length=14
AGGCTGTTCTTTTN

result file(expected)

>AAAT3R_left length=6
40 40 40 40 40 40
>AAAT3R_right length=62
40 40 40 38 38 37 39 36 36 40 36 35 35 35 38 40 35 35 33 35 35 35 40 40
 40 40 37 37 38 38 38 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 37
 36 36 31 22 22 22 20 20 20 20 20 14 >AAA2OJ_left length=14
18 18 18 21 35 35 35 32 32 32 33 35 38 39
>AAA2OJ_right length=14
37 37 37 37 37 33 32 32 30 20 17 17 17 0
[download]

This is the code i have written so far, to get the desired output.

#!/usr/bin/perl -w
use strict;

our ($File1, $File2) = qw/file1 file2/;
open File1 or die "$File1: $!\n";
open File2 or die "$File2: $!\n";

my ($key, %results);
while (<File1>){
    next if /^\s*$/;
    chomp;
    if (/^>\s*(\S+)/){
        $key = $1;
    }
    else {
        $results{$key} = [ split ];
    }
}
close File1;

my ($len, $side, $str);
while (<File2>){
    next if /^\s*$/;
    if (/^>([^_]+)_(left|right).*?(\d+)\s*$/){
        print;
        $str = $1;
        $side = $2;
        $len = $3;
    }
    else {
        my @list;
        @list = @{$results{$str}};
        if ($side eq 'left'){
            die "$str is too short for a left slice of $len!\n"
            unless @list >= $len;
            print "@list[0..$len-1]\n";
        }
        else {
            die "$str is too short for a right slice of $len!\n"
            unless @list >= $len;
            print "@list[@list-$len..$#list]\n";
        }
    }
}
close File2;

Please help me do this in C.

**Adak** · 01-04-2009

Originally Posted by prettydainty

Dear all,
I am a newbie to C. I want to learn C very badly. i know the theory part of C, but i m not so good when it comes to writting
real time programs in C. I m good at perl, but when i use the below program for a small file, it works fine. Not the same
for a 10 gig file. So, i felt that writing the program in C would help. Please help me solve the problem, n also to
learn C. Here is my problem.

I am comparing 2 files, i take information of left and right values from file 2, and extract
numbers at the beginning(left value)
and end(right value) of every string-numbers (based on their unique ID which starts with '>') from file 1.

Code:

file1:

>AAAT3R length=110
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
38 38 38 38 40 40 39 40 40 40 40 40 40 40 40 40 40 38 38 37 39 36 36 40 36 35 35 35 38 40 35 35 33 35 35 35
40 40 40 40 37 37 38 38 38 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 37 36 36 31 22 22 22 20 20 20 20
20 14
>AAA2OJ length=70
18 18 18 21 35 35 35 32 32 32 33 35 38 39 37 37 39 39 39 39 39 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 40 39 39 37 35 35 39 37 37 37 37 37 37 37 37 37 37 33 32 32 30 20 17 17 17 0

file2:

>AAAT3R_left length=6
TACATA
>AAAT3R_right length=62 ACTACTGATTTGATTATCTTTGATCTCTGTCGAACTAACTATATCTTAGTATGATCTTTAAT
>AAA2OJ_left length=14
TTTTGGACTATCTG
>AAA2OJ_right length=14
AGGCTGTTCTTTTN

result file(expected)

>AAAT3R_left length=6
40 40 40 40 40 40
>AAAT3R_right length=62
40 40 40 38 38 37 39 36 36 40 36 35 35 35 38 40 35 35 33 35 35 35 40 4 +0 40 40 37 37 38 38 38 40 40 40 
40 40 40 40 40 40 40 40 40 40 40 40 40 37 36 36 31 22 22 22 20 20 20 20 20 14 
>AAA2OJ_left length=14
18 18 18 21 35 35 35 32 32 32 33 35 38 39
>AAA2OJ_right length=14
37 37 37 37 37 33 32 32 30 20 17 17 17 0
[download]

This is the code i have written so far, to get the desired output.

#!/usr/bin/perl -w
use strict;

our ($File1, $File2) = qw/file1 file2/;
open File1 or die "$File1: $!\n";
open File2 or die "$File2: $!\n";

my ($key, %results);
while (<File1>){
next if /^\s*$/;
chomp;
if (/^>\s*(\S+)/){
$key = $1;
}
else {
$results{$key} = [ split ];
}
}
close File1;

my ($len, $side, $str);
while (<File2>){
next if /^\s*$/;
if (/^>([^_]+)_(left|right).*?(\d+)\s*$/){
print;
$str = $1;
$side = $2;
$len = $3;
}
else {
my @list;
@list = @{$results{$str}};
if ($side eq 'left'){
die "$str is too short for a left slice of $len!\n"
unless @list >= $len;
print "@list[0..$len-1]\n";
}
else {
die "$str is too short for a right slice of $len!\n"
unless @list >= $len;
print "@list[@list-$len..$#list]\n";
}
}
}
close File2;

Please help me do this in C.

Well, I don't know stink about Perl, so ... might be quite helpful if you could elaborate on your selection criteria
for the numbers you extract from the two files. I'm not clear on that, at least.

I also strongly suggest you edit your post, as I have done here to your post, so that
the program no longer "breaks the forum tables" (page width).

It is very annoying to have to constantly scroll to Timbuktu and back, to read your post!

And welcome to the forum!

**prettydainty** · 01-04-2009

Dear all,
I am comparing 2 files, i take information of left and right
values from file 2, and extract numbers at the beginning
(left value) and end(right value) of every string-numbers
(based on their unique ID which starts with '>') from file 1.

- Both the files are compared with their unique id starting with >.
- The numbers have to be split based on Length mentioned in file2
depending on left(split at the begining) or right(split at the end).

Code:

file1:

>AAAT3R length=110
40 40 40 40 40 40 40 40 40 40 40 40 40 
40 40 40 40 40 40 40 40 40 40 40 40 40 
40 40 40 40 40 40 40 40 40 40 38 38 38 
38 40 40 39 40 40 40 40 40 40 40 40 40 
40 38 38 37 39 36 36 40 36 35 35 35 38 
40 35 35 33 35 35 35 40 40 40 40 37 37 
38 38 38 40 40 40 40 40 40 40 40 40 40 
40 40 40 40 40 40 37 36 36 31 22 22 22 
20 20 20 20 20 14
>AAA2OJ length=70
18 18 18 21 35 35 35 32 32 32 33 35 38 
39 37 37 39 39 39 39 39 40 40 39 39 39 
40 40 40 40 40 40 40 40 40 40 40 40 40 
40 40 40 40 40 40 39 39 37 35 35 39 37 
37 37 37 37 37 37 37 37 37 33 32 32 30 
20 17 17 17 0

file2:

>AAAT3R_left length=6
TACATA
>AAAT3R_right length=62
ACTACTGATTTGATTATCTTTGATCTCTGTC
GAACTAACTATATCTTAGTATGATCTTTAAT
>AAA2OJ_left length=14
TTTTGGACTATCTG
>AAA2OJ_right length=14
AGGCTGTTCTTTTN

result file(expected)

>AAAT3R_left length=6
40 40 40 40 40 40
>AAAT3R_right length=62
40 40 40 38 38 37 39 36 36 40 36 35 35 35 38 40
35 35 33 35 35 35 40 40 40 40 37 37 38 38 38 40
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 37
36 36 31 22 22 22 20 20 20 20 20 14 
>AAA2OJ_left length=14
18 18 18 21 35 35 35 32 32 32 33 35 38 39
>AAA2OJ_right length=14
37 37 37 37 37 33 32 32 30 20 17 17 17 0
[download]

Please help me do in C++.

**Adak** · 01-05-2009

I understand what you want, but now you say you need it done in C++, and I don't program in C++.

Why don't you post this in the C++ forum, (which is also quite active), on this same website? Just click on "General Programming Boards" way up at the top of this page (in red), and the C++ forum will be on the top of the page you are sent to.

To avoid cross posting (surely, the forum admin won't like that), you can ask to have this thread, moved. (seems to have been done once, oddly enough).

Good luck!

**Salem** · 01-05-2009

Dunno Adak, first post says C, second post says C++.
All rather vague to make a call.

IMO, prettydainty should read the "how to optimise perl" information which is available in the books / on the web.

In particular, benchmark this first:

Code:

#!/usr/bin/perl -w
use strict;

our ($File1, $File2) = qw/file1 file2/;
open File1 or die "$File1: $!\n";
open File2 or die "$File2: $!\n";

my $c1 = 0;
my $c2 = 0;
while (<File1>){
  $c1++;
}
close File1;

while (<File2>){
  $c2++;
}
close File2;

print "$c1 $c2\n";

Just reading the file will take time.
Assuming you implement the algorithm in zero time, you're never going to get any better than this.

What's more, reading the same in C with

Code:

while ( fgets( buff, sizeof buff, fp ) ) {
  c1++;
}

isn't likely to be a whole lot better. Perl after all is all about reading files, so you can be pretty sure they've nailed the performance of that part.

If you find yourself in the situation where the benchmark takes say 20 seconds, and all your code makes it 25 seconds, then you're pretty much stuffed in terms of making it any quicker. 80% of the time is in something you can do nothing about (basic file I/O). C won't read the file much quicker, and even assuming an ideal 50% saving on the processing, you might get the total down to say 20 seconds. What you're never going to get to is say 5 seconds.

Then there's the whole issue of how to deal with regexes and hashes in C (which has neither). You can get a PCRE library, but you would need a different approach to your results hash, or a lot of new C code.

In short, this isn't the kind of exercise I would suggest you start learning C with.

**MK27** · 01-05-2009

Some things to know about C if you know perl:

the perl interpreter is written in C
a perl reference (eg, $ref=\@someray;) is really a C pointer, tho there is more to know about pointers than references. You can get away programming perl without using references, but you must understand pointers to use C.
strings in C are accessed as arrays of characters, so in your example rather than looking for the id with if ($_ =~ /^>/) you would use if (someray[0]=='>')

Thread: from perl to C

Thread Tools

Search Thread

Display

from perl to C

from perl to C++ again

Similar Threads

C structure in perl typemap

de facto perl book

perl program question

From Perl to C

perl need help pls.....