I have planned to write an article per week on Data Engineering from the perspective of a beginner – the first one was Python for Frontend Engineers. Do not expect a proven pathway to become a Data Engineer someday because I do not have any strategy at the moment. I am just following my gut feeling to become proficient sooner than later. So consider these blog posts as my personal notes which may or may not be helpful to others.
So, Pandas is a data processing tool which helps in data analysis – meaning it provides various functions/methods to manipulate large datasets efficiently.
I am still learning pandas and will continue to explore its new features. The Pandas documentation is pretty self-explanatory so I will just give a glimpse of its powers in this article just like the trailer.
This is how you can import pandas and start using it right away.
Importing Pandas
import pandas as pd
pd.* # where * denotes all the supported methods
Pandas supports two data structures at the moment.
Series
The Series data structure represents a one-dimensional array with labels i.e. Python Dictionary. However, the data in the dictionary can be of any primitive types supported by Python.
Creating Series
sons_of_pandu = {
'son1': 'Yudhishthira',
'son2': 'Bhima',
'son3': 'Arjuna',
'son4': 'Nakula',
'son5': 'Sahadeva'
}
pandavas_series = pd.Series(sons_of_pandu)
print(pandavas_series)
# Prints following in Jupyter Notebook
# son1 Yudhishthira
# son2 Bhima
# son3 Arjuna
# son4 Nakula
# son5 Sahadeva
# dtype: object
Changing Indices of Series
Sometimes we prefer to change the indexing for brevity. So here we can change the index of the series to Pandavas’ progenitors.
pandavas_series.index = ["Yama", "Vayu", "Indra", "Ashwini Kumara Nasatya", "Ashwini Kumara Darsa"] # Prints following in Jupyter Notebook
# Yama Yudhishthira
# Vayu Bhima
# Indra Arjuna
# Ashwini Kumara Nasatya Nakula
# Ashwini Kumara Darsa Sahadeva
# dtype: object
Slicing Series
Slicing is really handy when glancing at a large dataset. We can also slice the series for an exploratory view as follows.
pandavas_series[0:2] # Prints first and second rows excluding the third
pandavas_series[1:] # Prints all rows except the first one
pandavas_series[-2:] # Prints the last two rows only
Appending Series
It is very common to deal with different data sets in Pandas and the append
method is just a compliment you can not ignore.
kauravas = ["Duryodhan", "Dushasana", "Vikarna", "Yuyutsu", "Jalsandh", "Sam", "Sudushil", "Bheembal", "Subahu", "Sahishnu", "Yekkundi", "Durdhar", "Durmukh", "Bindoo", "Krup", "Chitra", "Durmad", "Dushchar", "Sattva", "Chitraksha", "Urnanabhi", "Chitrabahoo", "Sulochan", "Sushabh", "Chitravarma", "Asasen", "Mahabahu", "Samdukkha", "Mochan", "Sumami", "Vibasu", "Vikar", "Chitrasharasan", "Pramah", "Somvar", "Man", "Satyasandh", "Vivas", "Upchitra", "Chitrakuntal", "Bheembahu", "Sund", "Valaki", "Upyoddha", "Balavardha", "Durvighna", "Bheemkarmi", "Upanand", "Anasindhu", "Somkirti", "Kudpad", "Ashtabahu", "Ghor", "Roudrakarma", "Veerbahoo", "Kananaa", "Kudasi", "Deerghbahu", "Adityaketoo", "Pratham", "Prayaami", "Veeryanad", "Deerghtaal", "Vikatbahoo", "Drudhrath", "Durmashan", "Ugrashrava", "Ugra", "Amay", "Kudbheree", "Bheemrathee", "Avataap", "Nandak", "Upanandak", "Chalsandhi", "Broohak", "Suvaat", "Nagdit", "Vind", "Anuvind", "Arajeev", "Budhkshetra", "Droodhhasta", "Ugraheet", "Kavachee", "Kathkoond", "Aniket", "Kundi", "Durodhar", "Shathasta", "Shubhkarma", "Saprapta", "Dupranit", "Bahudhami", "Yuyutsoo", "Dhanurdhar", "Senanee", "Veer", "Pramathee", "Droodhsandhee", "Dushala"]
kauravas_series = pd.Series(kauravas)
pandavas_series.append(kauravas_series) # Prints following in Jupyter Notebook
# Yama Yudhishthira
# Vayu Bhima
# Indra Arjuna
# Ashwini Kumara Nasatya Nakula
# Ashwini Kumara Darsa Sahadeva
# 0 Duryodhan
# 1 Dushasana
.
.
.
# Length: 106, dtype: object
Dropping from Series
Pass the index to drop any row from the series.
pandavas_series.drop('Yama') # Prints following in Jupyter Notebook
# Vayu Bhima
# Indra Arjuna
# Ashwini Kumara Nasatya Nakula
# Ashwini Kumara Darsa Sahadeva
# 0 Duryodhan
# 1 Dushasana
.
.
.
# Length: 105, dtype: object
Dataframes
The Dataframe data structure represents a two-dimensional list with labels i.e. Python List.
Creating Dataframe
sons_of_pandu = [{
'name': 'Yudhishthira',
'progenitor': "Yama"
}, {
'name': 'Bhima',
'progenitor': "Vayu"
}, {
'name': 'Arjuna',
'progenitor': "Indra"
}, {
'name': 'Nakula',
'progenitor': "Ashwini Kumara Nasatya"
}, {
'name': 'Sahadeva',
'progenitor': "Ashwini Kumara Darsa"
}]
df_pandavas = pd.DataFrame(sons_of_pandu)
Head’ing DataFrame
df_pandavas.head() # returns first 5 rows
df_pandavas.head(3) # returns first 3 rows
Tail’ing DataFrame
df_pandavas.tail() # returns last 5 rows
df_pandavas.tail(3) # returns last 3 rows
Sorting DataFrame
df_pandavas.sort_values(by="name")
Slicing DataFrame
df_pandavas[0:2] # Prints first and second rows excluding the third
df_pandavas[1:] # Prints all rows except the first one
df_pandavas[-2:] # Prints the last two rows only
df_pandavas[["name"]] # Prints all rows with "name" column only
Copying DataFrame
df_pandavas_in_alternate_dimension = df_pandavas.copy()
Wrap up
That’s it. There are more to Pandas than mere slicing/merging/copying/sorting. You can easily read/write CSV/XL files in Pandas like never before. Head over to Pandas Documentation for more information.