Thread: How to extract characters from pdf/excel/word/other

  1. #1
    Registered User
    Join Date
    Dec 2016
    Posts
    62

    How to extract characters from pdf/excel/word/other

    Hi guys!
    I appreciate everything you did to help me so far. I am on the path of learning "C" for OS' also. Currently I know "C" for microcontrollers.

    Here is my question. I tried to find a way to extract the character, but it seems to be very well hidden. Logically the characters should be the "ASCI" bytes from 0 to 127, but it gives me garbage result. I am not interested in the modern methods for doing this, I need to know how to identify the bytes and print them on the console.

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    27,560
    Easier said than done for these formats, but: interpret the file according to the format and thereby identify what is text.

    If the option is available to you, it would probably be easier to export to a text-based format (e.g., Excel spreadsheet to CSV) and then parse that.

    Quote Originally Posted by ArakelTheDragon
    Logically the characters should be the "ASCI" bytes from 0 to 127
    You meant ASCII, but Unicode encodings have been around for ages.
    Last edited by laserlight; 03-19-2019 at 02:11 PM.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Dec 2016
    Posts
    62
    Yes I did meant ASCII, what if I try extracting Unicode? I know its hard, but I don't see why. I don't want to export, I know it can be done in "C", maybe there is some well hidden function. If not ASCII or Unicode, they still have to identified somehow, are they 8 bit, 16 bit, 32 bit? Maybe there is something before and after them?

  4. #4
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    713
    The Excel, Word, and other Office files (e.g., .docx) are simply ZIP archives, and you can extract them with anything capable of extracting a ZIP file. The main content is stored in one of the XML files. You just have to sift through that XML file to find what you need.

    PDF is a completely different format, and it is also compressed (but not with ZIP), so you'll need to find some way to decompress and extract the text information you're looking for (some PDFs contain images of text or the vector outlines of the characters, and not the actual text that they represent, so you're out of luck with those files unless you can run them through some sort of OCR process, which can get quite complicated and advanced).

  5. #5
    Registered User
    Join Date
    Dec 2016
    Posts
    62
    In order to start the divison, I can check the file extension to see what procedure should be next, is there a way to check the encoding also? Encoding=bytes coded in a way?

  6. #6
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    27,560
    Assume ASCII: you spot a byte with the value of 65. Is that an 'A' or is it part of an integer that means something else? Without reference to the format, you simply cannot know. It is absolutely impossible to tell given just the byte. So yes, you take reference from "something before and after them", i.e., you need to interpret the file according to the format.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  7. #7
    TEIAM - problem solved
    Join Date
    Apr 2012
    Location
    Melbourne Australia
    Posts
    1,859
    Maybe just start with one target and build from there

    I haven't tried it, but maybe you should have a look at libxlsxwriter, or at least look at the code
    Fact - Beethoven wrote his first symphony in C

  8. #8
    Registered User
    Join Date
    Dec 2016
    Posts
    62
    I start with excel, which brings me back to the ".zip" file formula.

    There must be a way to convert pdf, excel and so on to "txt"? Maybe a CMD command with "system()"?

  9. #9
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    27,560
    Some background might be good: your target platform is Windows, and are you doing this for yourself only?
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C output saved in word or excel
    By Brian Joseph in forum C Programming
    Replies: 3
    Last Post: 04-19-2013, 01:01 PM
  2. Extract a word
    By mario_75 in forum C++ Programming
    Replies: 7
    Last Post: 01-22-2009, 05:08 PM
  3. How does MS Excel or Word draw to the screen??
    By Fremontknight1 in forum Windows Programming
    Replies: 2
    Last Post: 07-20-2008, 01:58 PM
  4. programming word and excel templates
    By money in forum Windows Programming
    Replies: 4
    Last Post: 07-29-2003, 07:16 AM
  5. extract characters to '.'
    By VanJay011379 in forum C++ Programming
    Replies: 5
    Last Post: 07-29-2002, 11:32 PM

Tags for this Thread