Python language has vast application. Data analysis and manipulation is one of many application.
We can work on data using Pandas library in Python for generating data analytics and manipulating data to generate useful patterns.
What is Pandas?
Pandas is a Python library which is used to work with sequential and tabular data. It has functionality to manage, analyze and manipulate data in a simple and efficient way.
We can think of its data structures as relatives to database tables or spreadsheets.
Pandas includes NumPy library. Two primary data structures of pandas are series (1D data) and dataframe (2D data). It can work with Homogenous as well as Heterogenous data.
Features of Pandas
- Time-series manipulation tools
- Works with missing data (NaN)
- Works with different data files (xls,db,csv,psv,hdf5,etc.)
- ETL tools (Extraction, Transformation and Load tools)
What is DataFrame?
Pandas DataFrame is a heterogenous 2D object, i.e. data are of same type within each column but it could be a different data type for each column and can be labeled with an index (implicit or explicit).
In simple words, DataFrame is like table in database.
The index can be implicit, starting with 0 or we can have our own index. Index can even include dates and times.
Let us now work with DataFrame
Creating an empty DataFrame
import pandas as pd
df1 = pd.DataFrame()
print(df1)
Creating an empty structure DataFrame
import pandas as pd
df1 = pd.DataFrame(columns=['Sr. no','Item','Desc'])
print (df1)
df2 = pd.DataFrame(columns=['Sr. no','Item','Desc'],index=range(1,10))
print (df2)
Creating a DataFrame passing NumPy array
import pandas as pd
arr = {'Sr. no' : [1,2,3,4],
'Items' : ['A','B','C','D']}
df1 = pd.DataFrame(arr)
print(df1)
Creating a DataFrame passing a Dictionary
import pandas as pd
dict1 = {1:'A',2:'B',3:'C',4:'D'}
df1 = pd.DataFrame([dict1])
print(df1)
Creating a DataFrame with datetime index
import pandas as pd
arr = {'Sr. no' : [1,2,3,4],
'Items' : ['A','B','C','D']}
indx = pd.DatetimeIndex(['2021-12-30','2021-12-31','2022-01-01','2022-01-02'])
df1 = pd.DataFrame(arr,index=indx)
print(df1)
Viewing DataFrame
import pandas as pd
arr = {'Sr. no' : [1,2,3,4],
'Items' : ['A','B','C','D']}
df1 = pd.DataFrame(arr,index=indx)
print(df1) # pd.DataFrame() also print DataFrame
# get first two rows
df.head(2)
# get last two rows
df.tail(2)
# get DataFrame's index
df.index
# get DataFrame's columns
df.columns
#get DataFrame's values
df.values
Importing Data in Pandas
Pandas DataFrame can read data from several data formats, most common includes csv, psv, xls, json, sql, hdf5, etc.
We will look at few examples to import data from different data formats and customizations.
import pandas as pd
df1 = pd.read_csv('file1.csv',sep=' ') # to read dataframe csv file with blank space as separator
df1 = pd.read_csv('file1.csv',usecols=[0,1,2,3],nrows = 100) # to read data from columns 0 to 3
df1 = pd.read_excel('file1.xls',sheet_name='User') # to read data from excel file and sheet named 'User'
More on Pandas DataFrame in upcoming post.
Hope it helps!
Happy Learning 🙂