class PDF::Reader
The Reader
class serves as an entry point for parsing a PDF
file.
PDF
is a page based file format. There is some data associated with the document (metadata, bookmarks, etc) but all visible content is stored under a Page
object.
In most use cases for extracting and examining the contents of a PDF
it makes sense to traverse the information using page based iteration.
In addition to the documentation here, check out the PDF::Reader::Page
class.
File Metadata¶ ↑
reader = PDF::Reader.new("somefile.pdf") puts reader.pdf_version puts reader.info puts reader.metadata puts reader.page_count
Iterating over page content¶ ↑
reader = PDF::Reader.new("somefile.pdf") reader.pages.each do |page| puts page.fonts puts page.images puts page.text end
Extracting all text¶ ↑
reader = PDF::Reader.new("somefile.pdf") reader.pages.map(&:text)
Extracting content from a single page¶ ↑
reader = PDF::Reader.new("somefile.pdf") page = reader.page(1) puts page.fonts puts page.images puts page.text
Low level callbacks (ala current version of PDF::Reader
)¶ ↑
reader = PDF::Reader.new("somefile.pdf") page = reader.page(1) page.walk(receiver)
Encrypted Files¶ ↑
Depending on the algorithm it may be possible to parse an encrypted file. For standard PDF
encryption you’ll need the :password option
reader = PDF::Reader.new("somefile.pdf", :password => "apples")
typed: strict
typed: strict frozen_string_literal: true
Copyright © 2010 James Healy (jimmy@deefa.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Copyright © 2008 James Healy (jimmy@deefa.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Copyright © 2008 James Healy (jimmy@deefa.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Copyright © 2006 Peter J Jones (pjones@pmade.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Copyright © 2006 Peter J Jones (pjones@pmade.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
typed: strict frozen_string_literal: true
typed: strict frozen_string_literal: true
Copyright © 2011 James Healy (jimmy@deefa.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
typed: strict frozen_string_literal: true
typed: strict frozen_string_literal: true
typed: strict frozen_string_literal: true
typed: strict
Copyright © 2006 Peter J Jones (pjones@pmade.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Copyright © 2006 Peter J Jones (pjones@pmade.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
typed: strict frozen_string_literal: true
Copyright © 2006 Peter J Jones (pjones@pmade.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Copyright © 2010 James Healy (jimmy@deefa.com)
typed: strict frozen_string_literal: true
Copyright © 2006 Peter J Jones (pjones@pmade.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
typed: strict frozen_string_literal: true
Copyright © 2006 Peter J Jones (pjones@pmade.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
typed: strict frozen_string_literal: true
typed: strict frozen_string_literal: true
typed: strict frozen_string_literal: true
typed: strict frozen_string_literal: true
typed: strict frozen_string_literal: true
typed: strict frozen_string_literal: true
Copyright © 2006 Peter J Jones (pjones@pmade.com)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
typed: strict frozen_string_literal: true
Attributes
lowlevel hash-like access to all objects in the underlying PDF
Public Class Methods
Source
# File lib/pdf/reader.rb, line 118 def initialize(input, opts = {}) @cache = PDF::Reader::ObjectCache.new opts.merge!(:cache => @cache) @objects = PDF::Reader::ObjectHash.new(input, opts) end
creates a new document reader for the provided PDF
.
input can be an IO-ish object (StringIO, File, etc) containing a PDF
or a filename
reader = PDF::Reader.new("somefile.pdf") File.open("somefile.pdf","rb") do |file| reader = PDF::Reader.new(file) end
If the source file is encrypted you can provide a password for decrypting
reader = PDF::Reader.new("somefile.pdf", :password => "apples")
Using this method directly is supported, but it’s more common to use ‘PDF::Reader.open`
Source
# File lib/pdf/reader.rb, line 174 def self.open(input, opts = {}, &block) yield PDF::Reader.new(input, opts) end
syntactic sugar for opening a PDF
file and the most common approach. Accepts the same arguments as new().
PDF::Reader.open("somefile.pdf") do |reader| puts reader.pdf_version end
or
PDF::Reader.open("somefile.pdf", :password => "apples") do |reader| puts reader.pdf_version end
Public Instance Methods
Source
# File lib/pdf/reader.rb, line 126 def info dict = @objects.deref_hash(@objects.trailer[:Info]) || {} doc_strings_to_utf8(dict) end
Return a Hash with some basic information about the PDF
file
Source
# File lib/pdf/reader.rb, line 134 def metadata stream = @objects.deref_stream(root[:Metadata]) if stream.nil? nil else xml = stream.unfiltered_data xml.force_encoding("utf-8") xml end end
Return a String with extra XML metadata provided by the author of the PDF
file. Not always present.
Source
# File lib/pdf/reader.rb, line 216 def page(num) num = num.to_i if num < 1 || num > self.page_count raise InvalidPageError, "Valid pages are 1 .. #{self.page_count}" end PDF::Reader::Page.new(@objects, num, :cache => @cache) end
returns a single PDF::Reader::Page
for the specified page. Use this instead of pages method when you need to access just a single page
reader = PDF::Reader.new("somefile.pdf") page = reader.page(10) puts page.text
See the docs for PDF::Reader::Page
to read more about the methods available on each page
Source
# File lib/pdf/reader.rb, line 147 def page_count pages = @objects.deref_hash(root[:Pages]) unless pages.kind_of?(::Hash) raise MalformedPDFError, "Pages structure is missing #{pages.class}" end @page_count ||= @objects.deref_integer(pages[:Count]) || 0 end
To number of pages in this PDF
Source
# File lib/pdf/reader.rb, line 192 def pages return [] if page_count <= 0 (1..self.page_count).map do |num| begin PDF::Reader::Page.new(@objects, num, :cache => @cache) rescue InvalidPageError raise MalformedPDFError, "Missing data for page: #{num}" end end end
returns an array of PDF::Reader::Page
objects, one for each page in the source PDF
.
reader = PDF::Reader.new("somefile.pdf") reader.pages.each do |page| puts page.fonts puts page.rectangles puts page.text end
See the docs for PDF::Reader::Page
to read more about the methods available on each page
Source
# File lib/pdf/reader.rb, line 157 def pdf_version @objects.pdf_version end
The PDF
version this file uses
Private Instance Methods
Source
# File lib/pdf/reader.rb, line 228 def doc_strings_to_utf8(obj) case obj when ::Hash then {}.tap { |new_hash| obj.each do |key, value| new_hash[key] = doc_strings_to_utf8(value) end } when Array then obj.map { |item| doc_strings_to_utf8(item) } when String then if has_utf16_bom?(obj) utf16_to_utf8(obj) else pdfdoc_to_utf8(obj) end else obj end end
recursively convert strings from outside a content stream into UTF-8
Source
# File lib/pdf/reader.rb, line 249 def has_utf16_bom?(str) first_bytes = str[0,2] return false if first_bytes.nil? first_bytes.unpack("C*") == [254, 255] end
Source
Source
# File lib/pdf/reader.rb, line 274 def root @root ||= @objects.deref_hash(@objects.trailer[:Root]) || {} end
Source
# File lib/pdf/reader.rb, line 267 def utf16_to_utf8(obj) str = obj[2, obj.size].to_s str = str.unpack("n*").pack("U*") str.force_encoding("utf-8") str end
one day we’ll all run on a 1.9 compatible VM and I can just do this with String#encode