Kung Fu “Pandas”


I have planned to write an article per week on Data Engineering from the perspective of a beginner – the first one was Python for Frontend Engineers. Do not expect a proven pathway to become a Data Engineer someday because I do not have any strategy at the moment. I am just following my gut feeling to become proficient sooner than later. So consider these blog posts as my personal notes which may or may not be helpful to others.

So, Pandas is a data processing tool which helps in data analysis – meaning it provides various functions/methods to manipulate large datasets efficiently.

I am still learning pandas and will continue to explore its new features. The Pandas documentation is pretty self-explanatory so I will just give a glimpse of its powers in this article just like the trailer.

This is how you can import pandas and start using it right away.

Importing Pandas

import pandas as pd

pd.* # where * denotes all the supported methods

Pandas supports two data structures at the moment.

Series

The Series data structure represents a one-dimensional array with labels i.e. Python Dictionary. However, the data in the dictionary can be of any primitive types supported by Python.

Creating Series

sons_of_pandu = {
  'son1': 'Yudhishthira',
  'son2': 'Bhima',
  'son3': 'Arjuna',
  'son4': 'Nakula',
  'son5': 'Sahadeva'
}
pandavas_series = pd.Series(sons_of_pandu)
print(pandavas_series)
# Prints following in Jupyter Notebook
# son1   Yudhishthira
# son2   Bhima
# son3   Arjuna
# son4   Nakula
# son5   Sahadeva
# dtype: object

Changing Indices of Series

Sometimes we prefer to change the indexing for brevity. So here we can change the index of the series to Pandavas’ progenitors.

pandavas_series.index = ["Yama", "Vayu", "Indra", "Ashwini Kumara Nasatya", "Ashwini Kumara Darsa"] # Prints following in Jupyter Notebook
# Yama                   Yudhishthira
# Vayu                   Bhima
# Indra                  Arjuna
# Ashwini Kumara Nasatya Nakula
# Ashwini Kumara Darsa   Sahadeva
# dtype: object

Slicing Series

Slicing is really handy when glancing at a large dataset. We can also slice the series for an exploratory view as follows.

pandavas_series[0:2] # Prints first and second rows excluding the third
pandavas_series[1:]  # Prints all rows except the first one
pandavas_series[-2:] # Prints the last two rows only

Appending Series

It is very common to deal with different data sets in Pandas and the append method is just a compliment you can not ignore.

kauravas = ["Duryodhan", "Dushasana", "Vikarna", "Yuyutsu", "Jalsandh", "Sam", "Sudushil", "Bheembal", "Subahu", "Sahishnu", "Yekkundi", "Durdhar", "Durmukh", "Bindoo", "Krup", "Chitra", "Durmad", "Dushchar", "Sattva", "Chitraksha", "Urnanabhi", "Chitrabahoo", "Sulochan", "Sushabh", "Chitravarma", "Asasen", "Mahabahu", "Samdukkha", "Mochan", "Sumami", "Vibasu", "Vikar", "Chitrasharasan", "Pramah", "Somvar", "Man", "Satyasandh", "Vivas", "Upchitra", "Chitrakuntal", "Bheembahu", "Sund", "Valaki", "Upyoddha", "Balavardha", "Durvighna", "Bheemkarmi", "Upanand", "Anasindhu", "Somkirti", "Kudpad", "Ashtabahu", "Ghor", "Roudrakarma", "Veerbahoo", "Kananaa", "Kudasi", "Deerghbahu", "Adityaketoo", "Pratham", "Prayaami", "Veeryanad", "Deerghtaal", "Vikatbahoo", "Drudhrath", "Durmashan", "Ugrashrava", "Ugra", "Amay", "Kudbheree", "Bheemrathee", "Avataap", "Nandak", "Upanandak", "Chalsandhi", "Broohak", "Suvaat", "Nagdit", "Vind", "Anuvind", "Arajeev", "Budhkshetra", "Droodhhasta", "Ugraheet", "Kavachee", "Kathkoond", "Aniket", "Kundi", "Durodhar", "Shathasta", "Shubhkarma", "Saprapta", "Dupranit", "Bahudhami", "Yuyutsoo", "Dhanurdhar", "Senanee", "Veer", "Pramathee", "Droodhsandhee", "Dushala"]
kauravas_series = pd.Series(kauravas)
pandavas_series.append(kauravas_series) # Prints following in Jupyter Notebook
# Yama                   Yudhishthira
# Vayu                   Bhima
# Indra                  Arjuna
# Ashwini Kumara Nasatya Nakula
# Ashwini Kumara Darsa   Sahadeva
# 0                      Duryodhan
# 1                      Dushasana
.
.
.
# Length: 106, dtype: object

Dropping from Series

Pass the index to drop any row from the series.

pandavas_series.drop('Yama') # Prints following in Jupyter Notebook
# Vayu                   Bhima
# Indra                  Arjuna
# Ashwini Kumara Nasatya Nakula
# Ashwini Kumara Darsa   Sahadeva
# 0                      Duryodhan
# 1                      Dushasana
.
.
.
# Length: 105, dtype: object

Dataframes

The Dataframe data structure represents a two-dimensional list with labels i.e. Python List.

Creating Dataframe

sons_of_pandu = [{
  'name': 'Yudhishthira',
  'progenitor': "Yama"
}, {
  'name': 'Bhima',
  'progenitor': "Vayu"
}, {
  'name': 'Arjuna',
  'progenitor': "Indra"
}, {
  'name': 'Nakula',
  'progenitor': "Ashwini Kumara Nasatya"
}, {
  'name': 'Sahadeva',
  'progenitor': "Ashwini Kumara Darsa"
}]
df_pandavas = pd.DataFrame(sons_of_pandu)

Head’ing DataFrame

df_pandavas.head()  # returns first 5 rows
df_pandavas.head(3) # returns first 3 rows

Tail’ing DataFrame

df_pandavas.tail()  # returns last 5 rows
df_pandavas.tail(3) # returns last 3 rows

Sorting DataFrame

df_pandavas.sort_values(by="name")

Slicing DataFrame

df_pandavas[0:2] # Prints first and second rows excluding the third
df_pandavas[1:]  # Prints all rows except the first one
df_pandavas[-2:] # Prints the last two rows only
df_pandavas[["name"]] # Prints all rows with "name" column only

Copying DataFrame

df_pandavas_in_alternate_dimension = df_pandavas.copy()

Wrap up

That’s it. There are more to Pandas than mere slicing/merging/copying/sorting. You can easily read/write CSV/XL files in Pandas like never before. Head over to Pandas Documentation for more information.

If you found this article useful in anyway, feel free to donate me and receive my dilettante painting as a token of appreciation for your donation.
Advertisement