Thread: Extract text(English & Chinese) from pdf

  1. #1
    Registered User
    Join Date
    Mar 2011
    Posts
    1

    Extract text(English & Chinese) from pdf

    I need to extract text from a pdf using C, on platform Linux.
    I find "stream" and "endstream",send the string between "stream" and "endstream" to zlib,and uncompress some contents.Then I can analyse the pattern and extract english character correctly,but the problem is I don't know how to handle the Chinese character,I have tryed many charset,but I don't know which charset it uses.
    So I need your help about this,such as the detail format of pdf, or the character set and so on.
    Thank you!!
    Please forgive my terrible English.

  2. #2
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    PDF (Portable Document Format)


    Plus there are many third party tools for extracting content from pdf.

    Google is your friend.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How to extract data from PDF files
    By charithf in forum C Programming
    Replies: 2
    Last Post: 01-24-2010, 08:36 PM
  2. PDF reader for C# 2008
    By abitw in forum C# Programming
    Replies: 0
    Last Post: 09-12-2009, 01:17 AM
  3. How to extract (formatted) text from a PDF file?
    By HeinzB in forum C++ Programming
    Replies: 2
    Last Post: 08-19-2008, 02:46 AM
  4. A free TIFF to PDF library
    By rockytriton in forum C++ Programming
    Replies: 0
    Last Post: 01-06-2006, 03:22 PM
  5. PDF Editor
    By Thantos in forum Tech Board
    Replies: 8
    Last Post: 09-30-2005, 04:24 PM

Tags for this Thread