Are you someone that process the same data frequently, like weekly or monthly but lack the resources to purchases fancy software to do so. Read on if you’re on the same boat
How many of us find ourselves in this situation whereby:
- You are working on a large set of data and processing same set of data monthly
- Lot of data cleansing
- Your IT department do not have time to entertain you
- Your department/company do not have money to invest in software for data cleansing activities
- Your department doesn’t have enough money for MS Flow or MS Data Platform
- Or , you could write a python script for such data cleansing activities, but no one understand it but yourself thus rendering it useless once you moved on
WELCOME TO THIS FREE GUIDE
If you tick most of the boxes, welcome to the club of “DIY NON IT Business Analyst” and let me guide you through the following: Prepare a FOC data pipeline, Easy to maintain and Understandable by most (hopefully the new generation are good with PowerQuery)
If these are your current workflow and data volume, the effort is worthwhile
- Working on huge data sets [more than 500k]
- Desperate needs to automate your data cleansing activities
- Feed those data to various analytics needs or data models
Source : Identifying the source of the main data you are using and standard the download format. In terms of the fields and file type, excel,csv or text as these might derail your future steps once you start automating the process.
Pipe it into folder — In this case, I have created a folder “3.PeopleTrackingMovement” and dump all the monthly data into the folder. For the next step, i created a new excel workbook / “Pipe”, “3.PeopleTrackingMovment”, and created a Query to read Folder 3.PeopleTrackingMovement”. Since the file format have been standardize, the query was created once and will clean the data and append it together on a monthly basis.
Bring It together — Creating “Mini Database”, after creating the various “Pipe” from 1–4. In this example, “Consolidation” was created to handle more data wangling from  “1.AllTraining” & “3.PeopleTrackingMovment”.
Within  “3.PeopleTrackingMovment” is a huge data volume, more than 1mil rows of data so is better to create another file to handle the load.
Some planning is needed here, as you need to know how this file is going to be used for a bigger visualization so at a later stage you don’t do much data waggling.
For this use case, we need to know throughout the years, who have done what training. After much review, these 2 data set isn’t used in other “pipe”
Big data models — At this stage, is either PowerBI or Excel Report, whereby you pull 1 or 2 “Mini Database” and a few “Pipe” to create your final report.
Bringing all the smaller data models to build your reporting or visualization is one of the final steps. As we tend to build many types of reports or visualization, it is easier to prepared smaller data models, read the data models, and build it from there. This way, you will not overload or performance issue on the visualization or reporting.
This is a free and fuss free method to prepare your data for constant uses. Especially if your department doesn’t have a dedicated IT personal or money to fund for specialise ETL software.
If you’re experiencing similar issue preparing data, feel free to drop me a message.