Google Drive: how to convert a series of images to text and not die in the attempt?

recently I found myself in a difficult situation when I received a writing of more than 10 pages in a PDF composed entirely of a series of images that should transform images to text. To open it, the first thing I noticed was that each page in the end had a watermark saying “Scanned by CamScanner”; that was the only thing that was in text format, the rest was almost entirely a series of images the size of a sheet of letter size.

But that was not the only problem; the majority of contracts and legal documents are printed in a format in a table that has a series of bars from side to side, a header and a series of stamps (which have different orientations) and logos on the periphery that generate noise at the time of trying to make the conversion of images to text. To make matters worse, I didn’t hand in my personal computer software to perform this conversion.

Therefore, neither short nor lazy, I gave myself the task of finding an option that was not very difficult or costly to convert these images to text and eliminate the noise from the periphery. A little investigating, I found my pleasant surprise:

Google Drive allows to convert image files into text!

In a matter of minutes was loading the document in Google Drive and with only:

  • click with the right button on the file in question,
  • then proceeded to select the option to “Open with > documents from Google”.

He was already seeing results, although they were not those expected! “Murphy’s law” (statement of folk wisdom that professes, roughly: «”if something can go wrong, probably will go wrong»») came on the scene with all its power, leaving me with a page that had images size letter I mentioned at the beginning and all that had been converted to text was the aforementioned sentence saying:” Scanned by CamScanner “.

But my disappointment was swift and, in less “a twinkling of an eye”, I found myself exploring the conditions to improve the quality. After reading a little more about this feature of Google Drive, me di realized that there were a number of important points:

  • resolution: the text should be at least 10 pixels high,
  • orientation: documents must be placed with the correct side up,
  • languages, fonts and character sets: Google Drive detects the language of the documents ,
  • image quality: the crisp images with regular lighting and contrast clear work best (blurry photos reduces the quality of the text),
  • file size: max 2 MB for PDF image files.

Because of this, I decided to crop the images to eliminate all that noise in the header and the seals of the periphery, leaving only the body of writing using an application called CamScaner. The process took me a few minutes (while arranging each page) and then converted the final result to a PDF file. After that, I proceeded to load this PDF to Google Drive and run the OCR process again (to convert image to text).

What happened eventually?

The end result was not perfect, but I can say that at least you save 85% of the work of transcribing the document. I only had to adjust some characters or words that the system did not take properly due to the lack of resolution of the images of the original PDF.

Be the first to comment

Leave a Reply