Aug 13, 2021

Project: Scanned Document Organizer

Software developed to automate the organization of scanned documents for EDM (Electronic Document Management)

The software in question is available for download at: Releases Scanned-Document-Organizer.

The main objective of my internship was to develop an Electronic Document Management System, EDM, for the management of personal documents of public servants of SEFAZ/SE. For this purpose, printed documents are digitized, and then they needed to be organized in a standardized and typified manner to finally feed the file server and database with the information.

I developed this project with the objective of automating the process of organizing and typifying the digitized documents. I gave the name Scanned Document Organizer, and this software is still used for organizing documents.

Organization

When scanning the pages, each document is separated by a black paper sheet, thus the program detects the black sheet and limits the organization to that document.

The software receives as input, the pages of the documents in image format, including the blank pages (back of the sheets) and black ones. And returns the pages separated into documents, organized in directories.

At the end, the directory of the digitized files is organized with the documents separated into directories named with the date, protocol and/or internal communication, which are numbers included in the documents. And in each directory are its pages in image format, named with the document typification, together with a timestamp numbering for ordering.

Features

This video shows the program in use:

The functionalities emerged during the use, among the functions, I can introduce, the option of converting PDF pages to JPG, added to meet a need that arose.

Besides this function, the software presents features such as:

Auto detection of date, protocol and document type
Navigation between pages
Feedback of the total number of pages of the document and current page number
Link to the directory and link to view the page in the default OS program
Zoom on pages
Document types organized to facilitate choice
Classification of Portarias by page numbers
Intelligent formatting of date, protocol and internal communication
Quick view, front and back of the black sheet
Keyboard shortcuts to speed up work
Option to see the extracted text
Image rotation
Option to send all remaining pages to the Files directory
Button to catch white sheet, used if the detector does not identify

Construction

The project was built in the Java programming language, more specifically JavaFX. I chose this platform for visual reasons and for ease in developing the GUI.

With this, I could focus on the functional part, one of them is the black and white detector, thus it was possible to identify the pages that are white and separate them into a specific directory, as well as the black sheets that serve to limit the documents.

As shown in the video, when a document is organized, the next document often appears with pre-filled information in its respective place, this happens because the first pages of each document are reduced and sent to tess4j, Java Tesseract OCR library, when passing through Tesseract, it returns the text extracted from that page, thus, through regular expression, the relevant information is detected and selected, such as: date, protocol and document type.

But it's not just capturing date, protocol or document type, for example, there are several dates on a single page, including dates with months written in full, and this was thought to obtain the best date, in the same way, the algorithm was developed to obtain the best possible information.

To speed up the process of extracting text and selecting relevant content, this process is performed in another thread, as if it were in the background, making it imperceptible to the end user.

The combination of this technology in favor of automation is interesting to see. The source code for this project is in the repository: Scanned-Document-Organizer.