Class PDFText2HTML


  • public class PDFText2HTML
    extends PDFTextStripper
    Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.
    • Constructor Detail

      • PDFText2HTML

        public PDFText2HTML()
                     throws java.io.IOException
        Constructor.
        Throws:
        java.io.IOException - If there is an error during initialization.
    • Method Detail

      • writeHeader

        @Deprecated
        protected void writeHeader()
                            throws java.io.IOException
        Deprecated.
        Write the header to the output document. Now also writes the tag defining the character encoding.
        Throws:
        java.io.IOException - If there is a problem writing out the header to the document.
      • startDocument

        protected void startDocument​(PDDocument document)
                              throws java.io.IOException
        Description copied from class: PDFTextStripper
        This method is available for subclasses of this class. It will be called before processing of the document start.
        Overrides:
        startDocument in class PDFTextStripper
        Parameters:
        document - The PDF document that is being processed.
        Throws:
        java.io.IOException - If an IO error occurs.
      • endDocument

        public void endDocument​(PDDocument document)
                         throws java.io.IOException
        This method is available for subclasses of this class. It will be called after processing of the document finishes.
        Overrides:
        endDocument in class PDFTextStripper
        Parameters:
        document - The PDF document that is being processed.
        Throws:
        java.io.IOException - If an IO error occurs.
      • getTitle

        protected java.lang.String getTitle()
        This method will attempt to guess the title of the document using either the document properties or the first lines of text.
        Returns:
        returns the title.
      • startArticle

        protected void startArticle​(boolean isLTR)
                             throws java.io.IOException
        Write out the article separator (div tag) with proper text direction information.
        Overrides:
        startArticle in class PDFTextStripper
        Parameters:
        isLTR - true if direction of text is left to right
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • endArticle

        protected void endArticle()
                           throws java.io.IOException
        Write out the article separator.
        Overrides:
        endArticle in class PDFTextStripper
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • writeString

        protected void writeString​(java.lang.String text,
                                   java.util.List<TextPosition> textPositions)
                            throws java.io.IOException
        Write a string to the output stream, maintain font state, and escape some HTML characters. The font state is only preserved per word.
        Overrides:
        writeString in class PDFTextStripper
        Parameters:
        text - The text to write to the stream.
        textPositions - the corresponding text positions
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • writeString

        protected void writeString​(java.lang.String chars)
                            throws java.io.IOException
        Write a string to the output stream and escape some HTML characters.
        Overrides:
        writeString in class PDFTextStripper
        Parameters:
        chars - String to be written to the stream
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • writeParagraphEnd

        protected void writeParagraphEnd()
                                  throws java.io.IOException
        Writes the paragraph end "</p>" to the output. Furthermore, it will also clear the font state. Write something (if defined) at the end of a paragraph.
        Overrides:
        writeParagraphEnd in class PDFTextStripper
        Throws:
        java.io.IOException - if something went wrong
      • escape

        private static java.lang.String escape​(java.lang.String chars)
        Escape some HTML characters.
        Parameters:
        chars - String to be escaped
        Returns:
        returns escaped String.
      • appendEscaped

        private static void appendEscaped​(java.lang.StringBuilder builder,
                                          char character)