Class PDFTextStripperByArea


  • public class PDFTextStripperByArea
    extends PDFTextStripper
    This will extract text from a specified region in the PDF.
    • Field Detail

      • regions

        private final java.util.List<java.lang.String> regions
      • regionArea

        private final java.util.Map<java.lang.String,​java.awt.geom.Rectangle2D> regionArea
      • regionCharacterList

        private final java.util.Map<java.lang.String,​java.util.ArrayList<java.util.List<TextPosition>>> regionCharacterList
      • regionText

        private final java.util.Map<java.lang.String,​java.io.StringWriter> regionText
    • Constructor Detail

      • PDFTextStripperByArea

        public PDFTextStripperByArea()
                              throws java.io.IOException
        Constructor.
        Throws:
        java.io.IOException - If there is an error loading properties.
    • Method Detail

      • setShouldSeparateByBeads

        public final void setShouldSeparateByBeads​(boolean aShouldSeparateByBeads)
        This method does nothing in this derived class, because beads and regions are incompatible. Beads are ignored when stripping by area.
        Overrides:
        setShouldSeparateByBeads in class PDFTextStripper
        Parameters:
        aShouldSeparateByBeads - The new grouping of beads.
      • addRegion

        public void addRegion​(java.lang.String regionName,
                              java.awt.geom.Rectangle2D rect)
        Add a new region to group text by.
        Parameters:
        regionName - The name of the region.
        rect - The rectangle area to retrieve the text from. The y-coordinates are java coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
      • removeRegion

        public void removeRegion​(java.lang.String regionName)
        Delete a region to group text by. If the region does not exist, this method does nothing.
        Parameters:
        regionName - The name of the region to delete.
      • getRegions

        public java.util.List<java.lang.String> getRegions()
        Get the list of regions that have been setup.
        Returns:
        A list of java.lang.String objects to identify the region names.
      • getTextForRegion

        public java.lang.String getTextForRegion​(java.lang.String regionName)
        Get the text for the region, this should be called after extractRegions().
        Parameters:
        regionName - The name of the region to get the text from.
        Returns:
        The text that was identified in that region.
      • extractRegions

        public void extractRegions​(PDPage page)
                            throws java.io.IOException
        Process the page to extract the region text.
        Parameters:
        page - The page to extract the regions from.
        Throws:
        java.io.IOException - If there is an error while extracting text.
      • processTextPosition

        protected void processTextPosition​(TextPosition text)
        This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
        Overrides:
        processTextPosition in class PDFTextStripper
        Parameters:
        text - The text to process.
      • writePage

        protected void writePage()
                          throws java.io.IOException
        This will print the processed page text to the output stream.
        Overrides:
        writePage in class PDFTextStripper
        Throws:
        java.io.IOException - If there is an error writing the text.