Thread: strtok: line split || cannot unhinge previous split

  1. #1
    Registered User
    Join Date
    Oct 2013
    Posts
    87

    strtok: line split || cannot unhinge previous split

    Hi there,

    Please bear with me. I have a file with multiple columns. I would like split and perform some operations.
    When I read file and split using strtok on tab separate it works fine, but I get first split from previous line in print.


    Code:
    int main(int argc, char *argv[])
    {
        // gcc -Wpedantic -Wextra -Wall *.c -o parse
    
    
        FILE *fp = fopen("test_cuts.txt", "r");
        if (fp == NULL)
            exit(EXIT_FAILURE);
    
    
        char *line = NULL;
        size_t len = 0;
        while ((getline(&line, &len, fp)) != -1)
        {
    
    
            char *token = NULL;
            token = strtok(line, "\t");
            int count_split = 1;
            while (token != NULL && ++count_split <= 8)
            {
    
    
                printf("split %s\n", token); //printing each token
                token = strtok(NULL, "\t");
            }
            // using printf() in all tests for consistency
            printf("%s", line);
        }
        fclose(fp);
    
    
        if (line)
            free(line);
    
    
        return 0;
    }
    I have input file as


    Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene
    21 10000766 10000766 T C intergenic LINC01667,BAGE dist=179705;dist=412754 . .
    21 10001677 10001677 G T intergenic LINC01667,BAGE dist=180616;dist=411843 . .
    21 10002110 10002110 G A intergenic LINC01667,BAGE dist=181049;dist=411410 . .
    21 10003899 10003899 T C intergenic LINC01667,BAGE dist=182838;dist=409621 . .
    21 10005858 10005858 T C intergenic LINC01667,BAGE dist=184797;dist=407662 . .
    21 10005978 10005978 G T intergenic LINC01667,BAGE dist=184917;dist=407542 . .
    21 10006274 10006274 G T intergenic LINC01667,BAGE dist=185213;dist=407246 . .
    21 10010233 10010233 T C intergenic LINC01667,BAGE dist=189172;dist=403287 . .
    21 10014276 10014276 T C intergenic LINC01667,BAGE dist=193215;dist=399244 . .

    When I run code, I get unwanted stored :


    split Chr
    split Start
    split End
    split Ref
    split Alt
    split Func.refGene
    split Gene.refGene
    Chrsplit 21
    split 10000766
    split 10000766
    split T
    split C
    split intergenic
    split LINC01667,BAGE
    21split 21
    split 10001677
    split 10001677
    split G
    split T
    split intergenic
    split LINC01667,BAGE
    21split 21
    I do not want
    Chrsplit 2121split 21

    rather, split 21 only.
    Where is this basic code going wrong?

  2. #2
    Registered User
    Join Date
    May 2010
    Posts
    4,595
    Is there a reason you're parsing that first line? It might be easier to just read the first line and then discard it.

    Also are you sure your file has tab characters within it? The "file" you posted doesn't contain any tabs.

    Also what data are your trying to extract from the file?

  3. #3
    TEIAM - problem solved
    Join Date
    Apr 2012
    Location
    Melbourne Australia
    Posts
    1,905
    Code:
            // using printf() in all tests for consistency
            printf("%s", line);
    It is printing what is in "line"

  4. #4
    Registered User
    Join Date
    Feb 2019
    Posts
    851
    Just a warning... When you do:
    Code:
    line = NULL, len = 0;
    while ( getline( &line, &len, fp ) != -1 )
    {
      ...
    }
    You are asking to getline allocate enough space to hold the first line, but if the next line is bigger you can get an segmentation fault or buffer overrun because line won't be NULL and len won't be zero and a new allocation won't be made. At the end of your loop you should free "line" and set line back to NULL and len back to zero.

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    38,621
    That's not how getline works.
    getline(3): delimited string input - Linux man page
    Alternatively, before calling getline(), *lineptr can contain a pointer to a malloc(3)-allocated buffer *n bytes in size. If the buffer is not large enough to hold the line, getline() resizes it with realloc(3), updating *lineptr and *n as necessary.

    In either case, on a successful call, *lineptr and *n will be updated to reflect the buffer address and allocated size respectively.
    You just keep calling it in a loop, and if there is a longer line somewhere in the input, getline will realloc as necessary.
    But it will keep hold of the (now longer) buffer.

    The test for NULL before calling free() however is a waste of time.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    Registered User
    Join Date
    Oct 2013
    Posts
    87
    Quote Originally Posted by jimblumberg View Post
    Is there a reason you're parsing that first line? It might be easier to just read the first line and then discard it.

    Also are you sure your file has tab characters within it? The "file" you posted doesn't contain any tabs.

    Also what data are your trying to extract from the file?
    Thank you for your reply.
    I do not have any reason to parse first line (header). The posted data may not have tabs, but the file has. Attached is a small file with first few columns.

    My goals are multiple: I would like to split 7th column that contains genes (LINC01667,BAGE). The 7th column will be split by comma.
    Each row has a SNP (genomic variant), that needs to be constructed as:

    chr21:10000766:T:C
    chr21:10001677:G:T

    SNP/variant names are constructed with column 1, 2, 4 and 5. I have to add chr before column 1.

    Now, within each gene, there can be multiple SNPs. For example, BAGE gene has 21:10000766, 21:10001677.
    Attached Files Attached Files

  7. #7
    Registered User
    Join Date
    Oct 2013
    Posts
    87
    Quote Originally Posted by Click_here View Post
    Code:
            // using printf() in all tests for consistency
            printf("%s", line);
    It is printing what is in "line"
    Thank you.

    actually I didn't have a new line character in printing that is why poor ptinting.

    Code:
    void remove_trailingspaces(char *newlines)
    {
        size_t length = strlen(newlines);
        newlines[length - 1] = '\0';
    }
    
    
        while ((getline(&line, &len, fp)) != -1)
        {
    
    
            remove_trailingspaces(line);
            char *token = NULL;
            token = strtok(line, "\t");
            int count_split = 1;
            while (token != NULL && ++count_split <= 8)
            {
         
                printf("split %s\n", token); //printing each token
                token = strtok(NULL, "\t");
            }
            // using printf() in all tests for consistency
            printf("line is %s\n", line);
        }

  8. #8
    Registered User
    Join Date
    Oct 2013
    Posts
    87
    Quote Originally Posted by Salem View Post
    That's not how getline works.
    getline(3): delimited string input - Linux man page

    You just keep calling it in a loop, and if there is a longer line somewhere in the input, getline will realloc as necessary.
    But it will keep hold of the (now longer) buffer.

    The test for NULL before calling free() however is a waste of time.
    This is first time I am using getline.

    If I code as the following I have to define line char array of a defined length. The column count is fixed in the input file, however, the characters in those columns are variable.

    I struggle with this problem a lot in genetics data. Is there a way I could improve on this?


    Code:
    
    #define MAX_LEN 500
    
    
     FILE *fptr_first; // use this for opening files first 
    
    
        char read_line[MAX_LEN];
        while (fgets(read_line, MAX_LEN, fptr_first))
        {
            if (line_number > 0) // make sure the line excluded header
            {
                if (strcmp(read_line, "\n") != 0)
                {
                    remove_trailingspaces(read_line);
                    get_gene_name(read_line);
                    add_node(&start_one, read_line); //only if not new line
                }
            }
            line_number++;
        }
        fclose(fptr_first);

  9. #9
    Registered User
    Join Date
    Feb 2019
    Posts
    851
    Quote Originally Posted by Salem View Post
    That's not how getline works.
    getline(3): delimited string input - Linux man page

    You just keep calling it in a loop, and if there is a longer line somewhere in the input, getline will realloc as necessary.
    But it will keep hold of the (now longer) buffer.
    Hummmm.... interesting. Thanks.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. split strings (and split thread)
    By mahi in forum C Programming
    Replies: 1
    Last Post: 10-31-2011, 06:56 AM
  2. read a line from file and split it
    By quo in forum C Programming
    Replies: 4
    Last Post: 05-24-2011, 07:28 AM
  3. How to split up CAN data line
    By rkooij in forum C Programming
    Replies: 6
    Last Post: 03-15-2006, 06:24 AM
  4. Split line
    By groorj in forum C Programming
    Replies: 8
    Last Post: 04-24-2005, 12:05 PM
  5. Split line using delimiter
    By groorj in forum C Programming
    Replies: 5
    Last Post: 12-06-2004, 01:23 PM

Tags for this Thread