Data can be defined as useful information or facts. In today’s world data is the backbone of technology. Most advancements and new features are totally dependent on data.
Data powers new age technology such as Machine learning, Artificial Intelligence, data processing and analytics,etc.
Why to clean Data?
Data that we get from sources such as surveys and social media platforms has certain inconsistencies.
Some part of the data might be missing whereas some may be incorrect.
If we simply use this data for our algorithms and machines, classification can be inaccurate and we might not reach the expected result.
This might be less problematic and have low effect for small problems but certain fields such as the health sector and many areas which are very much dependent on data can cause great harm and defamation.
Therefore, It is very important to clean data.
How to clean Data?
Data cleaning is technique to remove the noisy data and correct certain inconsistencies which are present in data.
It involves identifying or removing outliers, filling missing data, resolving inconsistency and smoothing noise in data.
Steps:
- Locate and identify individual data elements in source systems and isolate these items in target files.
- Use data algorithms to correct individual data elements.
- Using standard and business rules transform data into consistent form.
- Eliminate duplications by searching and matching duplicates in data.
- Analyze and identify the relationship between matched records and match them into one record.
Examples:
This is a simple example for data cleaning.
Before cleaning:
CustId CustomerFName CustomerLName Contact
001 Sam abc 011123
002 Badri none 123456
003 Michael na 098765
004 Issac newton gravity falls,9999
005 Albert einstein relative,009945
After cleaning:
Custid CustomerName Contact
001 Sam Abc 011123
002 Badri 123456
003 Micheal 098765
004 Issac Newton 9999
005 Albert Einstein 009945
Data Cleaning using Python:
Python is a powerful data processing language. It has a set of libraries that can easily and fastly modify or clean our data. We are going to see some of them like Numpy and Pandas.
First import libraries:
>>import numpy as np
>>import pandas as pd
To drop columns:
>>ds=pd.read_csv(“datasetname.csv”)
>>ds.head()
>>drop_list=[name of columns to be dropped]
>>ds.drop(drop_list,inplace=true,axis=1)
or
>>ds.drop(columns=drop_list,inplace=true)
To change index:
>>ds.set_index(“index”)
Remove extra space or Tidy data:
>>spc=ds[“column name”].str.extract(regex,expand=false)
Combining str methods with numpy to clean data:
>>np.where(condition,then,else)
>>ds[“columnname”]=np.where(column_value,then operation,else)
Renaming column name:
>>new_name={original:changes}
>>ds.rename(columns=new_names,inplace=true)