PaidRight, Google Cloud and enterprise record keeping
Business records are never perfect. There are bound to be errors or inconsistencies, which can create issues when businesses are paying their employees. These errors also cause issues for a company like PaidRight when conducting an external pay compliance review.
In our previous blog post we spoke about some of the challenges of enterprise record keeping and data quality that we have seen in payroll data. We discussed key challenges around the completeness, consistency and volume of enterprise data, which showed how some basic issues with data quality can pose challenges for Australian businesses.
So how do we overcome these challenges?
At PaidRight we use the enormous scalability of Google Cloud Platform (GCP), particularly BigQuery and Cloud Storage which are serverless and auto-scaling systems that enable our large data storage and processing needs. The use of GCP allows us to securely handle large enterprise datasets we receive from customers.
We have developed a custom ETL system using Python that gives us a simple, expressive and flexible API with the ability to utilise test-driven development and associated good coding discipline in our data engineering practice.
Overcoming incomplete datasets
One of the main challenges we discussed was the completeness of payroll data, and not being able to see missing roster and timesheet data caused by simple mistakes like employees not punching in and out of breaks.
Our Python-based ETL framework allows us to build and unit test at a very detailed level to gain confidence that the data is cleaned and updated where required to meet the needs of our modelling and analysis.
Overcoming inconsistent datasets
The second challenge that we discussed was the consistency of data. Inconsistencies could arise through different payroll systems being used over time, which could be seen in the most basic example of different date formats. We see considerable variety in the systems of record used for business data in the pay domain.
This variety needs to be conformed to a common data model which is performed by our data ingestion processing which uses a package approach. Datasets are provided by our clients usually in the form of flat files – each file is ingested by a processing package that transforms the source data to a consistent format in terms of data types, attributes and granularity.
These now consistent datasets are then loaded to Google BigQuery available for use by our pay compliance models and advanced analytics. Each data processing package in our framework has a comprehensive suite of unit tests that ensure changes are managed safely.
Dealing with large datasets
One of the biggest challenges that companies face is the volume of data that makes up their business records. There can be a great level of enterprise data collection when data points have to be collected for every employee, every shift of every day.
Our approach to handling these volumes has been to progressively leverage cloud-scale services within Google Cloud Platform. The data processing architecture leverages Compute clusters running on Google Cloud.
This scalable data ingestion framework works well when loading large data volumes with complex transformation requirements to Google BigQuery, which can then support a wide variety of data analytical workloads.
To manage this infrastructure we are progressively moving towards an Infrastructure-As-Code approach where system configuration, security setup etc are managed using artefacts that are stored in git repositories and deployed through Cloud Build, Google’s serverless CI/CD platform.
This approach enables us to manage our infrastructure using DevOps practices very similar to our source code management.
It’s expected that business records will have some missing or incorrect data, but when conducting an external pay review the missing data has to be located and cleaned before entering the PaidRight platform for modelling and analysis.
Our use of the Google Cloud Platform allows us to achieve this in an efficient and secure way so we can effectively provide pay compliance insights such as possible employee underpayments or overpayments to our customers.
In our final blog of the series we will delve into the current trends that are shaping the future of enterprise data processing capabilities, and the ability to scale and standardise to be accurate and add business value.