Pdf Reader Tutorial

broken image


Have a look at the samplefile.In this tutorial we will learn simple methods on- how to open it- navigate pages- exract images and texts.

Prerequisites¶

PDF BOB is a free online PDF editor that requires no user account. Just upload your PDF, make the changes you need to, and then export it to PDF again to finish up. There are several tools here to edit your PDF, like a text tool that lets you select a custom color and font type, an image picker, a colored pencil/marker, and a few shape tools. Tap the PDF attachment icon to open the PDF document in Apple's built-in PDF viewer (Quick Look). Please note that Apple's built-in PDF viewer does not render annotations (sticky notes, highlight, freehand drawing, handwritten signatures, etc.) nor filled form data. Some pages may appear mostly blank or empty.

Before we start, let's make sure that you have the pdfreader distributioninstalled. In the Python shell, the followingshould run without raising an exception:

Tutorial

How to start¶

Note: If you need to extract texts/images or other content from PDF you can skipthese chapters and go directly to How to start extracting PDF content.

The first step when working with pdfreader is to create aPDFDocument instance from a binary file. Doing so is easy:

As pdfreader implements lazy PDF reading (it never reads more then you ask from the file),so it's important to keep the file opened while you are working with the document.Make sure you don't close it until you're done.

It is also possible to use a binary file-like object to create an instance, for example:

Let's check the PDF version of the document and it's metadata

Now we can go ahead to the document catalog and walking through pages.

How to access Document Catalog¶

Catalog (aka Document Root) contains all you need to know to start working withthe document: metadata, reference to pages tree, layout, outlines etc.

Bluestacks android for pc free download. For the full list of document root attributes see PDF-1.7 specificationsection 7.7.2

How to browse document pages¶

There is a generator pages() to browse the pages one by one.It yields Page instances.

You may read all the pages at once

Now we know how many pages are there!

Pdf Reader Tutorial

How to start¶

Note: If you need to extract texts/images or other content from PDF you can skipthese chapters and go directly to How to start extracting PDF content.

The first step when working with pdfreader is to create aPDFDocument instance from a binary file. Doing so is easy:

As pdfreader implements lazy PDF reading (it never reads more then you ask from the file),so it's important to keep the file opened while you are working with the document.Make sure you don't close it until you're done.

It is also possible to use a binary file-like object to create an instance, for example:

Let's check the PDF version of the document and it's metadata

Now we can go ahead to the document catalog and walking through pages.

How to access Document Catalog¶

Catalog (aka Document Root) contains all you need to know to start working withthe document: metadata, reference to pages tree, layout, outlines etc.

Bluestacks android for pc free download. For the full list of document root attributes see PDF-1.7 specificationsection 7.7.2

How to browse document pages¶

There is a generator pages() to browse the pages one by one.It yields Page instances.

You may read all the pages at once

Now we know how many pages are there!

You may wish to get some specific page if your document contains hundreds and thousands.Doing this is just a little bit trickier.To get the 6th page you need to walk through the previous five.

Don't forget, that all PDF viewers start page numbering from 1,however Python lists start their indexes from 0.

Now we can access all page attributes:

It's possible to access parent Pages Tree Node for the page, which is PageTreeNodeinstance, and all it's kids: Adobe acrobat pdf file reader free download.

Our example contains the only one Pages Tree Node. That is not always true.

For the complete list Page and Pages attributes see PDF-1.7 specificationsections 7.7.3.2-7.7.3.3

How to start extracting PDF content¶

It's possible to extract raw data with PDFDocument instance but it just represents rawdocument structure. It can't interpret PDF content operators, that's why it might be hard.

Fortunately there is SimplePDFViewer, which understands a lot.It is a simple PDF interpreter which can 'display' (whatever this means)a page on SimpleCanvas.

Document metadata is also accessible through SimplePDFViewer instance:

The viewer instance gets content you see in your Adobe Acrobat Reader.SimplePDFViewer provides you with SimpleCanvas objectsfor every page. This object contains page content: images, forms, texts.

The code below walks through all document's pages and extracts data:

Also you can navigate to some specific page withnavigate() and call render()

The viewer extracts:
  • page images (XObject)
  • page inline images (BI/ID/EI operators)
  • page forms (XObject)
  • decoded page strings (PDF encodings & CMap support)
  • human (and robot) readable page markdown - original PDF commands containing decoded strings.

Extracting Page Images¶

There are 2 kinds of images in PDF documents:
  • XObject images
  • inline images

Every one is represented by its own class(Image and InlineImage)

Let's extract some pictures now! They are accessible through canvasattribute. Have a look at page 8of the sample document. It contains a fax message, and is is availableon inline_images list.

This would be nothing if you can't see the image itself :-)Now let's convert it to a Pillow/PIL Imageobject and save!

Voila! Enjoy opening it in your favorite editor!

Check the complete list of Image (sec. 8.9.5)and InlineImage (sec. 8.9.7)attributes.

Extracting texts¶

Getting texts from a page is super easy. They are available on strings andtext_content attributes.

Let's go to the previous page (#7) and extract some data.

Remember, when you navigate another page the viewer resets the canvas.

Let's render the page and see the texts.
  • Decoded plain text strings are on strings(by pieces and in order they come on the page)
  • Decoded strings with PDF markdown are on text_content

As you see every character comes as an individual string in the page content stream here. Which is not usual.

Let's go to the very first page

PDF markdown is also available.

And the strings are decoded properly. Have a look atthefile:

pdfreader takes care of decoding binary streams, character encodings, CMap, fonts etc.So finally you have human-readable content sources and markdown.

Hyperlinks and annotations¶

Let's Have a look at the samplefile.

It contains several hyperlinks. Let's extract them!

Unlike HTML, PDF links are rectangle parts of viewing area, they are neither text properties nor attributes.That's why you can't find linked URLs in text content:

Links can be found in :class:`~pdfreader.types.objects.Page annotations(see 12.5 Annotations),which help user to interact with document.

Google Drive Tutorial Pdf

Annotations for a current page are accessible through annotations().The sample document has 3 annotations:

There are different types of annotations. Hyperlinks have Subtype of Link. We're ready to extract URLs:

Expert Pdf Reader Tutorial

Encrypted and password-protected PDF files¶

What if your file is protected by a password? Not a big deal! pdfreader supports encrypted and password-protected files!Just specify the password when create PDFDocument orSimplePDFViewer.

Pdf Reader Tutorial Windows 10

Let's see how this works with an encrypted password-protected filesamplefile.The password is qwerty.

Complete Html Tutorial Pdf

The same about PDFDocument:

What if the password is wrong? It throws an exception.

The same for SimplePDFViewer:

Note: Do you know, that PDF format supports encrypted files protected by the default empty password?Despite the password is empty, such files are encrypted still. Fortunately, pdfreader detects end decrypts such filesautomatically, there is nothig special to do!





broken image