Six Important Data Preparation Steps for Machine Learning

Data preparation is an integral part of designing enterprise software systems today using machine learning and AI. Enterprise scale businesses and government organisations often deal with terabytes and petabytes of data. They not only need to manage the complexity of data, but use the data in the right context at the right time to make better decisions. Data preparation is the key step in cleansing data to make sense of the information using machine learning.
The data needs to be formatted in a specific way for it to be leveraged by ML algorithms. The quality of the datasets is paramount to providing pertinent insights for the organisation. When dealing with large volumes of unstructured datasets, there could be issues with missing values, obsolete data, invalid formats. outliers etc. So, for any algorithm to produce relevant, useful and contextual predictions, data preparation is a must. If data is not cleansed and validated properly, it can affect the accuracy of the system and even provide misleading insights. Here’s a look at the pivotal steps for good data preparation to build more accurate systems.
1. Defining the Problem
The first step in data preparation requires defining the context in which data will be used. It needs clarity in terms of the key issues or problems that need to be addressed. For e.g. an organisation that is focused on improving its turnaround time for product development will need to analyse the project implementation steps. 
The breakup of the project schedule and identifying the parts that can be completed without any dependencies can be taken up in parallel. So, the model can provide relevant and contextual tasks to the team involved in the execution. The impact of each task on the project, service delivery and its quality can be assessed by mapping the relevant data.
The focus needs to be well defined in terms of the outcomes an organisation wants to achieve. In the above case, it could be improving product development time by 30% and quality by 30%. The steps involved are then mapped as data inputs for the algorithm to suggest improvement measures. By focusing on the problem and KPIs, the objectives of the system are clear. It can simplify considerations about the types of data to gather for analysis.
The intended purpose and key outcomes drive the design of the machine learning mode. Once the problem is well formulated, it is easier to map relevant data. The problem could be defined using some of these steps:
i) Gather data from the relevant domain or case in point.
ii) Let the data analysts and subject matter experts weigh in the system
iii) Select the right variables to be used as inputs and outputs for a predictive model for your problem.
iv) Review the data that is collected.
v) Summarize & visualise the data using statistical methods.
vi) Visualize the collected data using plots and charts for building predictive models.
2. Data Collection & Discovery

The process of transforming raw data into actionable data sets for algorithms and analysts requires consolidation of data. There could be many sources for business data, structured or unstructured. These could be endpoint data, existing enterprise systems, customer data, marketing data, accounting and financial data etc.
Data preparation requires mapping all the data sources as well as identification of relevant data sets. The behaviour of the model to make practical insights depends on the data sets. It may be pointed out that adding too much irrelevant information adversely affects the accuracy of the model.
To start with a list of key performance indicators or questions that need to be answered are analysed. The relevant data sources are mapped, integrated and made accessible for analysis.
3. Data Cleansing
Data cleansing helps to streamline information for analysis. The validation techniques for data cleansing can be used to identify and eliminate inconsistencies, aberrations, outliers, invalid formats, incomplete data etc. Once the data is cleansed, it can provide accurate answers upon analysis.
There are tools that can help organisations to clean up their data and validate it before using it for machine learning. Good quality data is the backbone of an accurate machine learning model. Data preparation involves cleaning up, validating data formats, check missing values, and other things that can affect data analysis.
Data cleansing also involves proactively looking at outliers or one time events in data sets. For e.g. correlation between online sales and lockdowns and identifying their correlation using ML models. The idea is to understand the causal relations inherent in data, but eliminate outliers that can affect the accuracy of the system. There are open source tools like Open Refine that may be used for standardising your organisational data.
4. Data Format & Standardization

After the data set has been cleansed, it needs to be formatted and standardised. This step involves resolving issues like multiple date formats, inconsistent datatypes, removing irrelevant information, duplicity, redundancy, removing multiple sources of truth etc.
After data is cleansed and formatted, some data variables may not be needed for the analysis and hence they can be deleted. Data preparation requires deletion of noise and unwanted information for building a robust automation system.
The cleansing and formatting process should have a consistent and repeatable work flow. It can be used by the organisation to maintain consistency of data in the future iterations too. The data is constantly added to the model realtime with similar steps. For e.g. marketing data could be added every month based on relevant keyword searches on the internet.
5. Data Quality
Do you trust the quality of your data? Erroneous data can lead to disastrous consequences. When the data is not reliable, it can create more problems than it solves. Take for e.g. an online retailer who needs to dynamically price the items on its portal, any inaccuracy in pricing may affect sales as well as reputation for the retailer.
Low quality data is a deterrent to the design of a good machine learning model. Even with the best algorithms and models, the system could produce ordinary results, when data quality is poor. But, what makes good quality data? The answers may vary across industries and companies. Industries like pharmaceuticals and medical need very stringent data quality standards compared to other industries like consumer goods.
An e.g. of Data Quality Assessment Framework adopted by IMF for data quality follows:
Integrity: Statistics are collected, processed, and disseminated based on the principle of objectivity.
Methodological soundness: Statistics are created using internationally accepted guidelines, standards, or good practices.
Accuracy and reliability: Source data used to compile statistics are timely, obtained from comprehensive data collection programs that consider country-specific conditions.
Serviceability: Statistics are consistent within the dataset, over time, and with major datasets, as well as revisioned on a regular basis. Periodicity and timeliness of statistics follow internationally accepted dissemination standards.
Accessibility: Data and metadata are presented in an understandable way, statistics are up-to-date and easily available. Users can get a timely and knowledgeable assistance.
Some important questions to ask regarding the quality of your data:
Is the data reliable and representing realtime information?
Is the data obtained from the right source?
Is the data missing or omitting something important?
Is the data representing sufficient information for you to make a decision?
Is the data representing the relationships between key variables accurately?
6. Feature Engineering & Selection

Feature engineering deals with adding or modifying attributes to model’s output. This is the last stage in data preparation for building a machine learning model.
The feature engineering identifies the most important or relevant input data variables for the model. It involves deriving new variables from the available dataset based on adjusting and reworking the variables to enable models to uncover useful insights & causal relationships. The variables or predictors are tweaked to ensure better predictive performance of the system and this is known as feature engineering.
The experimental approach explores different variables from the available data sets to make predictive insights. Some variables may look promising, but may not deliver the right results due to extended model training, overfitting and less weightage in relation to the predictive accuracy of the model. 
Many features may need to be evaluated and weighed before converging to the right model. Good data preparation delivers high-quality and trusted data for improving the predictive behaviours and accuracy of the enterprise software.
Kreyon Systems provides enterprise software implementation for clients with end to end data lifecycle management. Our expertise is leveraged by governments and corporates for managing their data. If you have any queries, please reach out to us.
