Extracting data from bills that are scans

Extracting data from bills that are scans

As of October 31st 2019, we have extended our OCR functionality. Click here for more information

Firstly, we should note that bills and receipts are different things, and Lightyear treats them differently.  For details on how to process receipts (and the data that Lightyear extracts from receipts), please refer to this article.

In regard to bills, Lightyear employs a number of methods of data-extraction, depending on the file-type/format that has been emailed/uploaded to Lightyear. 
The vast majority of bills being emailed/uploaded to Lightyear are system-generated pdfs, but a very small percentage are image files, and generally have suffix's such as jpg, jpeg, tif and png.  In addition, a very small number of pdf files are actually image files, wrapped up as a pdf.  And furthermore, a very very small percentage of system-generated pdfs have scrambled metadata. 
For these jpegs, tif, png, image-based pdfs and pdfs with scrambled meta-data, Lightyear uses Optical Character Recognition to extract the data.

The OCR capabilities in Lightyear allow us to create maps for you to use, which will extract data from these bills, much in the same way as we do from system-generated pdfs.

Uploading A Scanned Document or Image File

You can upload image files into Lightyear via the Upload icon/function which you can find on the top ribbon of the Lightyear screen.  Or you can email these in to your unique Lightyear email address as per the usual process.  

When the document arrives into Lightyear, Lightyear will automatically detect whether the file is a system-generated pdf, or whether Lightyear needs to use OCR to extract the data.  If the file requires OCR data-extraction, Lightyear will display the following image in Panel 2 whilst the data is being extracted via OCR.  OCR can take up to 2 minutes to extract data, but typically takes between 15 and 60 seconds.   You will need to manually refresh the image panel, but we will be introducing automatic page refresh in the near future which will automatically update Panel 2 with the bill data once it is available.  

   

The following file types are accepted for OCR : PNG, TIF, TIFF, PDF (Scans) & JPEG.  You cannot upload (or email in) any other file type.

Once OCR has completed you will see a Scanner Icon in panel 1 (like the image below) which indicates that the document has been OCR'd.



Once you see the green scanner icon, you can now treat the bill/credit-note/statement much as you would any system-generated pdf.

Mapping a bill using OCR

From your point of view, the mapping process will be the same. You'll locate the bill in your Processing tab, and either search for and apply an existing map, or request a new map be created. However, there are a couple of important points to consider:
  • OCR will automatically run on any bill/file in which we detect no metadata ... basically all image files. A very very small percentage of image files actually do contain some metadata.  For these bills, you may find that they remain in your Processing tab.  But, please do ask for a map to be created, and we will try some magic behind the scenes to see if we can create rules to handle them now (and in future) for you.
  • Please note - Although we can do magic, we're not wizards.  Or magicians.  For OCR to work, the quality of the image must be good.  Scans that have caught in the document feeder (and are skewed/squished) are impossible to extract data from.  If your staff has ticked off stock arriving with markers, and have hidden text, we'll obviously struggle.  Creases and folds (when scanned) create vertical and horizontal lines, which may obstruct the data we are looking to extract.  Similarly staples.  Or water-marks.  
  • OCR will work best with at least 150 DPI.  300 DPI is good too.  Anything above that will create an image file of a size that might be too large to upload to Lightyear.  Anything less than 150 DPI might not be of good enough quality for us to extract data, from.
  • The maximum file size for an OCR bill is 5mb, as opposed to the standard 10mb, so please consider that when setting your DPI level.
  • When an OCR bill is received in the Processing tab, it can take some time for the image to load. This is because the bill needs to be processed by our OCR engine before returning to the Processing tab.
  • When sending OCR bills into Lightyear, please make sure that each bill is its own separate attachment.  If you have a 2-page bill, please scan the file as 2 pages, but 2 pages within 1 file.  The rule is, if you have 5 bills, you need to send in 5 separate files.  You can, of course, send those 5 bills into Lightyear on the one email, but they must be 1 file, for 1 bill.


Automatic OCR for problem bills

As of October 31st 2019, we're making some improvements in terms of what we can put through OCR to ensure the best data extraction possible. Until now, this was only available on scanned images or system generated PDF bills that contain no metadata.

As of today, Lightyear staff can identify bills that can't be mapped through our normal means but may be mappable through OCR. This means we can set rules that will allow us to automatically run OCR on bills that otherwise may not be mappable. As a user you don't need to do anything on your end, send in your bills for mapping as normal and we'll apply the rules as and when we see a benefit to running a bill through OCR.


What's coming next?

During our Beta, OCR has come on leaps and bounds and massive improvements have been made. Once we're happy with the new rules we're putting in place to automatically run OCR on problematic bills, we will look to start rolling out functionality to our entire user base. Watch this space.

As always, your feedback is valued so please reach out to us with suggestions and experiences using OCR.



      

    Check out release notes for Q3 2021

     June  July  August 


      • Related Articles

      • Utility Bills

        At Lightyear, bills passing through our system come in all shapes and sizes. This requires an adaptable approach to ensure the satisfaction of our customers. Our data mapping technology is becoming more versatile and capable as we grow, allowing us ...
      • Bills with "scrambled" metadata

        Some bills are being generated with metadata that doesn't represent the data being displayed. On the PDF the information looks readable, however no useful data can actually be extracted.                                              Original PDF View  ...
      • What type of map should I ask for? And what data is extracted for each map type?

        Sometimes suppliers will send in multiple invoices within the one PDF. Make sure you split these before sending in a map request. Check out this article for more information Our Free 'Map for Me' Service Finding a map, trying a map and using a map is ...
      • My invoice was mapped to the wrong supplier. What do I do?

        Don't panic, it's super easy to fix. First we'll add the correct supplier to the invoice then go over how to prevent it from happening. ​ Fixing the Invoice This can be done while the invoice is in the Approval Tab. Select the invoice in question, ...
      • What type of documents can Lightyear receive and process?

        Lightyear can process almost any type of bill/credit note/statement/receipt, but depending on the type and format they may be processed differently.  System generated PDFs System generated PDFs are electronic documents sent directly from suppliers by ...