Package org.apache.pdfbox.text
Class PDFTextStripperByArea
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.LegacyPDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- org.apache.pdfbox.text.PDFTextStripperByArea
-
public class PDFTextStripperByArea extends PDFTextStripper
This will extract text from a specified region in the PDF.
-
-
Field Summary
Fields Modifier and Type Field Description private java.util.Map<java.lang.String,java.awt.geom.Rectangle2D>
regionArea
private java.util.Map<java.lang.String,java.util.ArrayList<java.util.List<TextPosition>>>
regionCharacterList
private java.util.List<java.lang.String>
regions
private java.util.Map<java.lang.String,java.io.StringWriter>
regionText
-
Fields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output
-
-
Constructor Summary
Constructors Constructor Description PDFTextStripperByArea()
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addRegion(java.lang.String regionName, java.awt.geom.Rectangle2D rect)
Add a new region to group text by.void
extractRegions(PDPage page)
Process the page to extract the region text.java.util.List<java.lang.String>
getRegions()
Get the list of regions that have been setup.java.lang.String
getTextForRegion(java.lang.String regionName)
Get the text for the region, this should be called after extractRegions().protected void
processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page.void
removeRegion(java.lang.String regionName)
Delete a region to group text by.void
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
This method does nothing in this derived class, because beads and regions are incompatible.protected void
writePage()
This will print the processed page text to the output stream.-
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeWordSeparator
-
Methods inherited from class org.apache.pdfbox.text.LegacyPDFStreamEngine
computeFontHeight, showGlyph
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Field Detail
-
regions
private final java.util.List<java.lang.String> regions
-
regionArea
private final java.util.Map<java.lang.String,java.awt.geom.Rectangle2D> regionArea
-
regionCharacterList
private final java.util.Map<java.lang.String,java.util.ArrayList<java.util.List<TextPosition>>> regionCharacterList
-
regionText
private final java.util.Map<java.lang.String,java.io.StringWriter> regionText
-
-
Method Detail
-
setShouldSeparateByBeads
public final void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
This method does nothing in this derived class, because beads and regions are incompatible. Beads are ignored when stripping by area.- Overrides:
setShouldSeparateByBeads
in classPDFTextStripper
- Parameters:
aShouldSeparateByBeads
- The new grouping of beads.
-
addRegion
public void addRegion(java.lang.String regionName, java.awt.geom.Rectangle2D rect)
Add a new region to group text by.- Parameters:
regionName
- The name of the region.rect
- The rectangle area to retrieve the text from. The y-coordinates are java coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
-
removeRegion
public void removeRegion(java.lang.String regionName)
Delete a region to group text by. If the region does not exist, this method does nothing.- Parameters:
regionName
- The name of the region to delete.
-
getRegions
public java.util.List<java.lang.String> getRegions()
Get the list of regions that have been setup.- Returns:
- A list of java.lang.String objects to identify the region names.
-
getTextForRegion
public java.lang.String getTextForRegion(java.lang.String regionName)
Get the text for the region, this should be called after extractRegions().- Parameters:
regionName
- The name of the region to get the text from.- Returns:
- The text that was identified in that region.
-
extractRegions
public void extractRegions(PDPage page) throws java.io.IOException
Process the page to extract the region text.- Parameters:
page
- The page to extract the regions from.- Throws:
java.io.IOException
- If there is an error while extracting text.
-
processTextPosition
protected void processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Overrides:
processTextPosition
in classPDFTextStripper
- Parameters:
text
- The text to process.
-
writePage
protected void writePage() throws java.io.IOException
This will print the processed page text to the output stream.- Overrides:
writePage
in classPDFTextStripper
- Throws:
java.io.IOException
- If there is an error writing the text.
-
-