Class PdfTextExtractor

java.lang.Object
com.lowagie.text.pdf.parser.PdfTextExtractor

@Deprecated public class PdfTextExtractor extends Object
Deprecated.
Extracts text from a PDF file.
Since:
2.1.4
  • Constructor Details

    • PdfTextExtractor

      public PdfTextExtractor(PdfReader reader)
      Deprecated.
      Creates a new Text Extractor object, using a TextAssembler as the render listener
      Parameters:
      reader - the reader with the PDF
    • PdfTextExtractor

      public PdfTextExtractor(PdfReader reader, boolean usePdfMarkupElements)
      Deprecated.
      Creates a new Text Extractor object, using a TextAssembler as the render listener
      Parameters:
      reader - the reader with the PDF
      usePdfMarkupElements - should we use higher level tags for PDF markup entities?
    • PdfTextExtractor

      public PdfTextExtractor(PdfReader reader, TextAssembler renderListener)
      Deprecated.
      Creates a new Text Extractor object.
      Parameters:
      reader - the reader with the PDF
      renderListener - the render listener that will be used to analyze renderText operations and provide resultant text
  • Method Details

    • getTextFromPage

      public String getTextFromPage(int page) throws IOException
      Deprecated.
      Gets the text from a page.
      Parameters:
      page - the 1-based page number of page
      Returns:
      a String with the content as plain text (without PDF syntax)
      Throws:
      IOException - on error
    • getTextFromPage

      public String getTextFromPage(int page, boolean useContainerMarkup) throws IOException
      Deprecated.
      get the text from the page
      Parameters:
      page - page number we are interested in
      useContainerMarkup - should we put tags in for PDf markup container elements (not really HTML at the moment).
      Returns:
      result of extracting the text, with tags as requested.
      Throws:
      IOException - on error
    • processContent

      public void processContent(byte[] contentBytes, PdfDictionary resources, PdfContentStreamHandler handler)
      Deprecated.
      Processes PDF syntax
      Parameters:
      contentBytes - the bytes of a content stream
      resources - the resources that come with the content stream
      handler - interprets events caused by recognition of operations in a content stream.