////

Make scanned PDF’s searchable in SharePoint & Teams

8 mins read

I came across this scenario when one of my customers asked for this. Which I think might be useful for others as well. It’s weird to me that this is not integrated yet by default, because there is an OCR service in SharePoint that will process images. But this does not seem to work for PDF’s that have images inside it.

I think there are a lot of scenarios where this can be useful. E.g. scanned invoices dropped in SharePoint. You would like this to be searchable by the invoice number without having to tag all the documents.

In this blog post I’ll be using Power Automate and you’ll require the following things to get the flow below up and running:

  • Computer Vision API (cognitive service from Azure)
  • A Power Automate per user plan to use the HTTP connector

First we should setup the Computer Vision API in Azure. Go to Azure, create a resource and search for Computer Vision.

Notice the pricing tier dropdown. I’m using the Free (F0) for testing purposes. I think this might also apply to certain production scenarios. But if you’re unsure whether your PDF’s will have more than 2 pages in a single file, you should go with the Standard (S1). For more info about the pricing tiers, click here.

This was it in Azure for now. We will need the API key of this service later in the Power Automate flow. If you would like to skip all steps and go to the full flow, click here.

In Power Automate we start with a ‘When a file is created or modified’ trigger. We will add a trigger condition to make sure only pdf’s will get triggered.

Note: If your scenario only requires the ‘When a file is created’ trigger. You can use that as well (this will not trigger on a file replace). But in my case I would also like to catch file updates, which might of course apply to your scenario as well.

Go to settings of this trigger block and add the following trigger condition. @EndsWith(triggerOutputs()?[‘body/{FilenameWithExtension}’], ‘.pdf’)

After this I initialized 4 variables which will make our actions a lot easier.

  • PDFText => This will temporary store our full text result from the PDF
  • RequestId => This will store our request id, returned from the Computer Vision API. We will need this to poll for the result.
  • RequestStatus => Makes it easy to check for the status of the polling request result.
  • ResponseBody => Will store the response body of the latest successful polling request.

After this comes the most important part, calling the Computer Vision API through the HTTP connector. Which might look weird because there are standard Computer Vision OCR actions in power automate.

Make sure you get your API key for the Computer Vision API from Azure. Which you should pass in the ‘ocp-apim-subscription-key’ header.

Extract the apim-request-id from the headers of the response.
outputs(‘POST_-Computer_Vision_API_3.0-_Read’)[‘headers’][‘apim-request-id’]

Note: The reason I’m not using the standard Computer Vision connector is because these endpoints are still on V2. These API’s are only accepting images (no PDF’s), so we should convert our file before putting it into the V2 endpoint. But more important is that the V2 API’s are giving bad OCR results. Even simple characters / clearly readable characters are not recognized correctly in V2, at least that’s my experience.

Now we posted our OCR request to the Computer Vision API we got back the requestId (apim-request-id). We can now start a do until loop to start polling for the result to be ready.

You can get the RequestStatus from the response body from the HTTP call. body(‘GET_-Computer_Vision_API_3.0-_Read’)[‘status’]

Finally after getting through this do until loop. Our ResponseBody contains our final OCR result. We can extract all text lines to the PDFText variable as shown below.

The Pages variable can be initialized with the following value. variables(‘ResponseBody’)?[‘analyzeResult’]?[‘readResults’]

The TextLines can be set with the following value inside the foreach page. items(‘Foreach_Page’)?[‘lines’]

The PDFText can be appended with the following string. You’ll notice there is a new line in the concat function. Paste the text in the expression like this as well, to make sure each line is on a new line in the PDFText string. concat(items(‘Foreach_TextLine’)?[‘text’], ‘
‘)

As a last step, we should store the generated data into the SharePoint item.

There is a default SharePoint column available which is called ‘MediaServiceOCR’, but it seems that we can not update this column with any more than 255 characters. While SharePoint itself in the backend seems to be able to do this, we unfortunately can’t.

So we should push the PDFText variable into a multi line text field. We will need to create our own column to store this data. Create a multi line text column and update the field value in the action underneath.

Note: If you are using the ‘created or modified’ trigger, you are triggering your own flow here in this scenario. The good thing is that this will not trigger infinite, as a modified trigger will not fire if no metadata has been changed. So it will trigger 2 times for each change in the PDF file. Which is not critical, but could be more optimized if you would compare the vti_x005f_streamhash property from the getFileByServerRelativeUrl()/Properties endpoint before each run.

As a final result your flow should look something like this. Full image

Download an exported sample of this flow here.

Done! Enjoy your searchable PDF in SharePoint. You can now search for any keyword in your PDF to find the file you are looking for. E.g. You can now search your scanned invoices in SharePoint based on the invoice number in a scanned invoice.

As Teams is using SharePoint for it’s file storage, you’ll be able to use the Teams search bar to find your file by it’s content as well.

Leave a Reply

Your email address will not be published.

Previous Story

Filter large lists in SharePoint Online