Thread: Regex does not work, but it should.

  1. #1
    Registered User
    Join Date
    Feb 2008
    Posts
    25

    Regex does not work, but it should.

    I'm trying to do some regex matching from a file, however, it works sometimes but sometimes not. I have a reason to believe that it has to do with the locale.

    Code:
    $ file test.txt 
    test.txt: ASCII text
    $ file original_file.txt: ASCII English text, with CRLF line terminators
    It matches anything that is in test.txt, which is info I've copied straight out of original_file.txt with ctrl-c ctrl-v.

    How do I get original_file.txt to not be "ASCII English text, with CRLF line terminators".
    Keep in mind that the files are huge, and many, so copying and pasting every line is not really optimized. We're talking about 70gb of data.

    Thank you for your input!

  2. #2
    Registered User
    Join Date
    Mar 2009
    Posts
    399
    Quote Originally Posted by n1mda View Post
    How do I get original_file.txt to not be "ASCII English text, with CRLF line terminators".
    Then what encoding do you want it to be then? If the file utility reports it to be ASCII with CRLF line terminators, then that is most likely the case. Your only two options are to open the file and save it with a desired encoding, or change the locale used within your program to be ASCII (which it probably already is since it's the default).

    Either way it's hard to make any assertions since you haven't even posted the code you claim to be correct.

  3. #3
    Registered User
    Join Date
    Feb 2008
    Posts
    25
    I would want it to be ASCII text only, not with the CLRF terminators. I'm not sure about the difference, since the file that works contains newlines as well.

    I've managed to change it by running the following command:
    Code:
    $ perl -pi -e 's/\r$//g' original_file.txt
    The output reads:
    Code:
    $ file original_file.txt
    original_file.txt: : ASCII English text
    To reverse it, I believe you can use the unixdos command

    However, it would be fun to know if there is a way to programmaticly find the file type and change it, without running external commands.

  4. #4
    Registered User
    Join Date
    Mar 2009
    Posts
    399
    CR and LF are actual ASCII characters. Depending on what type of newlines you want, you can of course open the file and search for '\r' and '\n' characters and then do the desired operation (i.e. removing the one you don't want).

    Newline - Wikipedia, the free encyclopedia

  5. #5
    Registered User
    Join Date
    Jan 2010
    Posts
    19
    Since the CR and LF are ASCII-symbols (Carriage Return and Line Feed, number 13 and 10 in the ASCII-table), you'd just have to modify your program to handle those characters as well. Or, write a routine that strips the CR and LF from the file.

  6. #6
    Registered User jeffcobb's Avatar
    Join Date
    Dec 2009
    Location
    Henderson, NV
    Posts
    875
    dos2unix <input_file>?
    C/C++ Environment: GNU CC/Emacs
    Make system: CMake
    Debuggers: Valgrind/GDB

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. getline() don't want to work anymore...
    By mikahell in forum C++ Programming
    Replies: 7
    Last Post: 07-31-2006, 10:50 AM
  2. Why don't the tutorials on this site work on my computer?
    By jsrig88 in forum C++ Programming
    Replies: 3
    Last Post: 05-15-2006, 10:39 PM
  3. Problems in getting OpenGL to work
    By zonf in forum C Programming
    Replies: 5
    Last Post: 02-13-2006, 04:48 AM
  4. DLL __cdecl doesnt seem to work?
    By Xei in forum C++ Programming
    Replies: 6
    Last Post: 08-21-2002, 04:36 PM
  5. How is regex used?
    By Strider in forum C++ Programming
    Replies: 0
    Last Post: 12-14-2001, 08:15 AM