September 26, 2022 by Simeon Hermann
How to transform a photo-scanned document into a professionally scanned document
As a junior image processing developer for the mobile document management app Docutain I am eager to tell you more about how we create authentic document scans from simple photo-scanned documents. But first, let us reach back a little bit to explain the general issue.
For a long time, classic document scanners were the only real option if you wanted to digitize documents for reasons as digital archiving or paperless bureaucratic processes. Nowadays, it has become more and more common to use smartphone cameras as mobile scanners to digitize documents. While the use of handy, mobile cameras as scanners, especially in private environments, has some obvious advantages over separate and stationary scanners, the scans as such exhibit some optical weaknesses.
The reason for this is that the imaging conditions for simple photographs are largely uncontrolled. Meanwhile, the usual flatbed scanners ensure that the document lies flat and is evenly illuminated — thus also ensuring largely controlled imaging conditions. The uncontrolled environment while casually photo-scanning with your phone can result in a wide variety of undesirable effects, which on the one hand can make digital post-processing such as text recognition (OCR) more difficult, but on the other hand can also, for the human viewer, impair the readability and aesthetics of the document photos.
Those effects can be divided into geometric and photometric distortions. Geometric distortions primarily result from the fact that a document was not photographed from a vertical perspective and does not lie completely flat on a surface. You may know this from folded letters that must be kept flat manually. But this will not be the focus now. On the other side, photometric distortions manifest themselves by illumination artifacts such as brightness gradients, shading, shadows or color casts. They might result from various light sources which are not aimed directly at the document. But also objects occluding the incoming light result in artifacts. A really common scenario is the smartphone and photographer occluding the ceiling light and, hence, creating a more or less strong shadow on the document photo. Color casts especially occur with white paper as it reflects the light’s whole visible spectrum. Usually, you have to deal with a warm yellowish or a cold blueish lighting mood. You can see some of those typical effects in the images below:
Various photometric effects as color casts (left), phone shadows (middle) or darkening shade (right)
With Docutain it is possible to remove those distortions. For geometric distortions the photos are cropped to only the documents themselves and then corrected for perspective. The included, applicable filters allow to correct the photometric distortions. The result should be a photo without any illumination artifacts but only the documents’ surface properties. Just like in the filtered photos below which are taken directly from Docutain:
illumination corrected versions of the previous document photos
To remove those artifacts, illumination correction must be applied on the document image. The basic idea behind this one is, to find out which structures in the photo can be attributed to the illumination and which ones to the actual content of the document. According to the Retinex theory, the image formation model can simply be expressed by
N = R * I.
The original illuminated image is N, which consists of the reflectance R, basically the content of the document, and the illumination term I. The most common basic approach is to estimate the illumination image by removing the reflectance structures. By the time you know the illumination image, you are able to remove the illumination influence from the input image by simply dividing N by I. This might sound quite simple but the mentioned estimation of the illumination often is not that easy.
Check out our Docutain SDK
Integrate high quality document scanning, text recognition and data extraction into your apps. If you like to learn more about the Docutain SDK, contact us anytime via SDK@Docutain.com.
When it comes to documents, there are luckily a lot of assumptions that we can draw about the content of the original document. This makes the illumination correction a lot easier than for random subjects as outdoor photos. One key assumption is that we assume a uniformly colored paper on which the document’s content is printed on. Mostly it is pure white but basically it can be any brighter color. So if we somehow know the original paper color, we can turn the task of illumination estimation into a question of background estimation. We just remove the printed content of the document, respectively the foreground, from the photo to obtain an illuminated background image. Since the original background is uniform, all structures in the background image can be traced back to the illumination. According to the image formation model in relation to the background, we now have the input background image N_BG, as well as the (homogenous) original background color R_BG. With this, the illumination image I can be calculated. It basically describes, in the form of a pixelwise gain factor, how strongly the document was distorted by the influence of the illumination. Since the illumination is independent of whether it was derived from the original or the background image, we can use it for the illumination correction of the original image as well. The formula for that is
R = (M / N_BG) * N.
where M is the original paper material color, N_BG the illuminated background image, and N the input image. The result is R, which shows only the content of the document including the given uniform background color. This is exactly what we strive for.
In summary, the illumination estimation is practically equal to a background estimation if there is a given homogenous background respectively paper color.
separation of the original image (left) into reflectance (middle) and illumination (right)
Nevertheless, at this point it is still open how we can determine a background image. In order to be able to remove the foreground from the input image using suitable image processing methods, we still need additional assumptions about the influence of the illumination, the document’s content as well as the background itself. There is quite a range of assumptions and presumptions about document photos. The problem is, however, that hardly any of them are always correct. With increasing complexity of the documents’ content as well as of the illumination situation, their correctness becomes more and more uncertain.
When it comes to plain text documents with black text on white paper, we have quite specific expectations of the document photo. Usually, text is rather small and can be detected by high intensity gradients. On the other side, the illumination component mostly consists of large-scale and smoothly varying structures. Just think of a typical brightness gradient or shading that is almost unavoidable when taking photos, especially of bright objects. Typical image processing methods to remove the text from a document photo like this without losing information about the illumination component are low-pass filters, rank filters or morphological filters. Low-pass filters keep the smooth gradient along the whole document but filter out the high frequency components as text. Rank filters as a median filter can be used based on the assumption that the foreground is rather small. Then, the foreground pixels make up only a small portion of the pixels in a neighborhood and are being replaced by the background pixels. Of course, the kernel size must be adjusted to the typical pixel size of the text. Furthermore, a percentile filter with a high given percentile or even a morphological dilatation both exploit the fact that the foreground is significantly darker than the background and, thus, can be eliminated more effectively by choosing a rather bright pixel value from the neighborhood.
But it’s not always as easy as that. This is due to the following two reasons: On the one hand, the illumination artifacts to be removed are not always smooth and large-scaled. Common exceptions are strong shadows with distinct edges or small shaded areas resulting from wrinkles in the paper. On the other hand, documents cannot always be reduced to text alone. They might contain any kind of illustrations as well. The possibilities of their look, shape or size are virtually unlimited.
In these cases, our simple foreground removing methods would have the following effects: For the illumination estimation some illumination artifacts like the ones just mentioned may be falsely removed while bigger printed content as illustrations may remain in the estimated illumination. Conversely, for the illumination correction this means that the mentioned illumination artifacts will not be removed while illustrations are tried to be removed resulting in their impairment.
So if we do not only want to focus on plain text documents, the most common approach is segmentation and interpolation. If we manage to identify the foreground elements, we can simply mask them and estimate the hypothetical background in the masked areas via interpolation. According to that, the two main tasks are segmentation of background and foreground, and a suitable interpolation method.
The segmentation approaches are mostly somehow based on the assumption that foreground elements are separated from the background by gradients. But to what extent this actually applies is uncertain. Due to the sheer amount of illumination artifacts and even more of different looking illustrations, the assumption cannot always be guaranteed. In the worst case, an illustration merges smoothly into the background. However, since we assume that the two segments are separated, the approaches are either to directly detect foreground elements, for example by edge detection, or to first detect a contiguous background, for example by region growing.
When it comes to interpolation techniques, ordinary interpolation methods as bilinear interpolation can be a legitimate option. However, one must keep in mind that in our scenario the masked regions can be very unevenly distributed. The problem is also called scattered data interpolation since there might be a lot of known points in an unprinted area but no data at all in a big illustrated area. Therefore, specialized methods as natural neighbor interpolation might work more effectively. Generally, the interpolation can be done by inpainting methods. The printed areas are seen as damaged or missing parts for our background image. There are several techniques for that, for example one based on the Fast Marching Method and described by Alexandru Telea in 2004. Usually, the masked regions are somehow gradually filled by weighted averages of the surrounding known pixels.
So now that we derived a full background image, we only need the original uniform paper color to generate a factorial shading map. We can manually set a fixed color like the widespread pure white or estimate it with the known background pixels. In the latter case, it is of course difficult to remove color casts but the illumination inhomogeneity of the document will still be corrected.
Finally, with the estimated illuminated background image N_BG and original paper color M, we can multiply the shading map with the original input photo N and obtain the illumination corrected document photo. This is done with the previously mentioned formula
R = (M / N_BG) * N.
As you might have realized, the biggest challenge is the distinction between illumination-based structures and content-based structures. It is an ill-posed and underdetermined problem and thus, as things stand today, it seems almost impossible to achieve flawless results for all scenarios all the time. Nevertheless, we at Docutain are very satisfied with the results of our filters based on these principles of illumination correction of document photos.
Below you can see an example of how one of our new filters outperformed the filters of the competing apps Adobe Scan and Microsoft Lens. You can clearly detect how our filter manages to remove strong color casts as the blueish area in the lowest third of the letter, which is something that especially Adobe Scan struggles with. On the other hand, the filter also succeeds at preserving illustrated areas as the black region at the bottom of the document. The default Microsoft Lens document filter erases almost every content bigger than usual text, which seems to relate to the previously mentioned downside of focusing on plain text documents.
comparison of the filtering of a photo-scanned document (left) through the filters of Docutain, Adobe Scan and Microsoft Lens (from left to right)
You are invited to try out the scan function of Docutain on your own documents! Just download Docutain for free in the Google Play Store and/or Apple App Store.
By the way, our scan functionality and a few more, e.g. data extraction based on the detected text, is also offered as a separate software development kit which you can use to add these functionalities into your own apps.
Check out our Docutain SDK
Integrate high quality document scanning, text recognition and data extraction into your apps. If you like to learn more about the Docutain SDK, have a look at our Developer Documentation or contact us anytime via SDK@Docutain.com.
You might also be interested in the following articles
Please enter a valid e-mail address.