Thread: String extraction from text file

  1. #1
    Registered User
    Join Date
    Jul 2011
    Posts
    4

    Unhappy String extraction from text file

    From the serial data below i have been successful to extract all the fields only the problem is with address fields how should i recognize as the length will not be same, and the spaces between each word is not uniform

    "DOM Name email Phone address city state ZIP A_Name A_Phone SSN I_P L_Am P_Am I_Am CCN Rem"(Format of data)

    " 10/2/2007 Laur McColu [email protected] 2725584962 201 Buckhorn Rd Sulphur OK 28575 James Rutherford 7696798518 3997290963 851272 841303 1283 10995 12345678901234 Excelent "(data)

  2. #2
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Is there a way you can re-export the data in a field-delimited way (e.g. comma-separated)? Or is it already tab-delimited and that's why the "space between each word is not uniform"? Whatever you're trying to show as far as multiple spacing goes, it's not coming through. Try wrapping the file contents in code tags.
    If you understand what you're doing, you're not learning anything.

  3. #3
    Registered User
    Join Date
    Jul 2011
    Posts
    4
    That what I am asking, if i was able to export data in field-delimited way i could have been able to extract fields & about the spaces THEY ARE NOT UNIFORM. Even if the file contents are wrapped, i have already told that i know how to extract other fields but my problem is with address field as the length of that field is not uniform...
    I want to know how to recognize address field,i know to extract other fields

  4. #4
    the hat of redundancy hat nvoigt's Avatar
    Join Date
    Aug 2001
    Location
    Hannover, Germany
    Posts
    3,130
    Search for the term "Buckhorn Rd" in your string. Address found.

    (If that was not helpful, then maybe you need to give more information than a single line of data. We cannot find a pattern with only a single line.)
    hth
    -nv

    She was so Blonde, she spent 20 minutes looking at the orange juice can because it said "Concentrate."

    When in doubt, read the FAQ.
    Then ask a smart question.

  5. #5
    Registered User
    Join Date
    Jul 2011
    Posts
    4
    @nvoigt VERY FUNNY the data is just an example

  6. #6
    the hat of redundancy hat nvoigt's Avatar
    Join Date
    Aug 2001
    Location
    Hannover, Germany
    Posts
    3,130
    Well... my solution is also just an example. If your data is not fixed length and has no field delimiter you need to have either a table of all streets to find out where the street ends or you need a table of all cities to find out where the city starts.

    What does the file description say about field length or delimiter? Somebody must have written the file. Maybe you can ask him.
    hth
    -nv

    She was so Blonde, she spent 20 minutes looking at the orange juice can because it said "Concentrate."

    When in doubt, read the FAQ.
    Then ask a smart question.

  7. #7
    Registered User
    Join Date
    Jul 2011
    Posts
    4
    I had this in mind. But there are more than 10000 cities in US, the processing time incline steeply because of comparisons that is what i am worried about.

  8. #8
    the hat of redundancy hat nvoigt's Avatar
    Join Date
    Aug 2001
    Location
    Hannover, Germany
    Posts
    3,130
    I would have said that 10k is a bit low, Germany has 16372 recognized cities according to the postal service. Some Google-Fu turned up numbers between 19k and 43k. Neither should be a problem. Put all of them into a hash-table or even a simple std::map and you're fine. Performance is not the issue. However, if this is data entered by real human beings, you are bound to have 10-20% spelling mistakes. Those won't match. Also note that a city name is not unique. It could also be part of a street name. Even numbers can be part of the street name. I'm pretty sure there will be a "4th of July Street" somewhere in the states. Texts like "Street", "Road", "Bd", "Blv" normally mark the end of a street name. But then, they may not. Seperating street and city name is a ton of code and reference data in itself, even if the input is correct 100%. It gets really messy if the input was entered by users.

    Do yourself a favor and get a file description that enables you to seperate those fields without "intelligent" program logic. Someone put headers in there. This guy wrote the file. Have him put delimiters or fixed fields in there. From the point of someone who writes address recognition software for european countries take the following advice: bribe the file creator. Hire someone charming to convinve him. Hire a hitman to have him removed from his job. Do all three. It will still be cheaper than writing software that seperates street and city.
    hth
    -nv

    She was so Blonde, she spent 20 minutes looking at the orange juice can because it said "Concentrate."

    When in doubt, read the FAQ.
    Then ask a smart question.

  9. #9
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Quote Originally Posted by ashish_wangi View Post
    I had this in mind. But there are more than 10000 cities in US, the processing time incline steeply because of comparisons that is what i am worried about.
    The central postal services in the US certainly should offer a freely downloadable database of postal codes, along with street addresses. Normally this should come as a series of CSV files you can then import to a database. You will however be still stuck with the problem that whoever gave you that data didn't bother placing delimiters.

    Do as nvoigt suggests and pressure the person behind that. You cannot properly parse this information. Here's the rules, written in stone:

    1. If fields are fixed length, there's no need for field delimiters.
    2. If fields have variable length, you require field delimiters.

    Otherwise there's no chance you can safely parse this data and guarantee results. And since you were given (2.), you must request either delimiters, or that data be transformed to (1.). And if the latter, that the person writes the format specification noting each field length.
    Last edited by Mario F.; 07-08-2011 at 06:16 AM.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  10. #10
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    " 10/2/2007 Laur McColu [email protected] 2725584962 201 Buckhorn Rd Sulphur OK 28575 James Rutherford 7696798518 3997290963 851272 841303 1283 10995 12345678901234 Excelent "(data)
    The end of the address seems to be a 2-letter state abbreviation and a ZIP code.
    It's going to be pretty easy to at least verify if the pair is at least a plausible state+zip code.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  11. #11
    Registered User
    Join Date
    Jul 2011
    Posts
    2
    use regex to get the data between the phone (9 or more digits) and zip (5 digits)

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. problem with string extraction
    By robasc in forum C++ Programming
    Replies: 2
    Last Post: 10-20-2006, 08:42 AM
  2. Data extraction from file
    By peterxor in forum C++ Programming
    Replies: 2
    Last Post: 10-11-2003, 04:44 PM
  3. Problem comparing string from text file with string constant
    By XenoCodex Admin in forum C++ Programming
    Replies: 3
    Last Post: 07-25-2002, 10:17 AM
  4. String extraction
    By ren in forum C Programming
    Replies: 2
    Last Post: 04-08-2002, 10:39 AM
  5. Text extraction
    By Unregistered in forum C Programming
    Replies: 2
    Last Post: 02-27-2002, 04:21 PM

Tags for this Thread