Extract text(English & Chinese) from pdf

**wonderway** · 03-31-2011

I need to extract text from a pdf using C, on platform Linux.
I find "stream" and "endstream",send the string between "stream" and "endstream" to zlib,and uncompress some contents.Then I can analyse the pattern and extract english character correctly,but the problem is I don't know how to handle the Chinese character,I have tryed many charset,but I don't know which charset it uses.
So I need your help about this,such as the detail format of pdf, or the character set and so on.
Thank you!!
Please forgive my terrible English.

**~~CommonTater~~** · 03-31-2011

PDF (Portable Document Format)

Plus there are many third party tools for extracting content from pdf.

Google is your friend.

Thread: Extract text(English & Chinese) from pdf

Thread Tools

Search Thread

Display

Extract text(English & Chinese) from pdf

Similar Threads

How to extract data from PDF files

PDF reader for C# 2008

How to extract (formatted) text from a PDF file?

A free TIFF to PDF library

PDF Editor

Tags for this Thread