Python Pandas Functions for Data Scientists [Top 10 Functions]

Python Pandas Functions for Data Scientists [Top 10 Functions]

In this tutorial, we will learn about Python pandas functions for Data Scientists. We will learn about top 10 functions which are go to functions for every Data scientists. Data analysis is the most growing field in the data world. Data engineers, Data scientists, Data analyst, they all rely of Python’s Pandas library for doing most of the data operation and convert it into usable format. We will discuss about Pandas library and its power by looking at some of its important methods.

 

What is Pandas ?

Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures like Series(1-dimensional) and Dataframe (2-dimensional) that are designed to handle and manipulate structured data seamlessly. We will cover 10 most popular and commonly used Pandas function in the upcoming sections. So let’s  get started.

 

Python Pandas Functions for Data Scientists [Top 10 Functions]

Also read: Python Pandas Compare Two Dataframes [Solved]

The Pandas methods which are discussed in later part of this tutorial  are the fundamental methods for data manipulation and analysis . They offer powerful tools for reading, processing and transforming tabular data efficiently. Let us look at each method one by one.

1. read_csv()

In Pandas, read_csv() method is used to read data from a CSV file and create a DataFrame. It allows you to load data into a pandas DataFrame from a comma-separated values(CSV) file. We will use a common code throughout this tutorial to perform all the methods operation.

In the below example, we are fetching the data set from a url using request module. requests.get() method allows us to fetch the data from url and store it in a variable (stores in ‘url’ variable in this case). We then check the response code using response.status_code. If response code is success (200), we are converting the dataset into text format and read the data using pd.read_csv() method. Create a file and save below code.

Syntax

read_csv(<filepath>, sep=',', delimiter=None, header='infer', names=None)

 

read_csv() method

import pandas as pd
import requests
from io import StringIO

url = "https://github.com/datablist/sample-csv-files/raw/main/files/organizations/organizations-100.csv"

response = requests.get(url)
response_status = response.status_code

if response_status == 200:
    data = response.text
    organization_data = StringIO(data)
    df = pd.read_csv(organization_data)
    print(df)

else:
    print(f"Failed to fetch data from given url. Status code: {response_status}")
OUTPUT
  Index Organization Id  Name ...                    Founded Industry                    Number of employees
0 1     FAB0d41d5b5d22c  Ferrell LLC ...             1990    Plastics                     3498
1 2     6A7EdDEA9FaDC52  Mckinney, Riley and Day ... 2015    Glass / Ceramics / Concrete  4952
2 3     0bFED1ADAE4bcC1  Hester Ltd ...              1971    Public Safety                5287
3 4     2bFC1Be8a4ce42f  Holder-Sellers ...          2004    Automotive                   921
4 5     9eE8A6a4Eb96C24  Mayer Group ...             1991    Transportation               7870
.. ... ... ... ... ... ... ...
95 96   0a0bfFbBbB8eC7c  Holmes Group ...            1975    Photography                  2988
96 97   BA6Cd9Dae2Efd62  Good Ltd ...                1971    Consumer Services            4292
97 98   E7df80C60Abd7f9  Clements-Espinoza ...       1991    Broadcast Media              236
98 99   AFc285dbE2fEd24  Mendez Inc ...              1993    Education Management         339
99 100  e9eB5A60Cef8354  Watkins-Kaiser ...          2009    Financial Services           2785

[100 rows x 9 columns]

 

2. head()

In Pandas, head() method is used to display the first n rows of a DataFrame. It is useful for quickly checking the structure and content of the DataFrame. In the below example, we are fetching first 10 rows from the dataset. Make below modification in the previous example file.

Syntax

head(n)

 

head() method

if response_status == 200:

    data = response.text
    organization_data = StringIO(data)

    df = pd.read_csv(organization_data)
    first10 = df.head(10)
    print(first10)
OUTPUT
  Index Organization Id Name ...                    Founded Industry                    Number of employees
0 1     FAB0d41d5b5d22c Ferrell LLC ...             1990    Plastics                      3498
1 2     6A7EdDEA9FaDC52 Mckinney, Riley and Day ... 2015    Glass / Ceramics / Concrete   4952
2 3     0bFED1ADAE4bcC1 Hester Ltd ... 1971                 Public Safety                 5287
3 4     2bFC1Be8a4ce42f Holder-Sellers ... 2004             Automotive                    921
4 5     9eE8A6a4Eb96C24 Mayer Group ... 1991                Transportation                7870
5 6     cC757116fe1C085 Henry-Thompson ... 1992             Primary / Secondary Education 4914
6 7     219233e8aFF1BC3 Hansen-Everett ... 2018             Publishing Industry           7832
7 8     ccc93DCF81a31CD Mcintosh-Mora ... 1970              Import / Export               4389
8 9     0B4F93aA06ED03e Carr Inc ... 1996                   Plastics                      8167
9 10    738b5aDe6B1C6A5 Gaines Inc ... 1997                 Outsourcing / Offshoring      9698

 

3. describe()

In Pandas, describe() method generates descriptive statistics of the DataFrame, including measures of central tendency, dispersion and shape of the distribution. We are fetching the statistics summary of dataset using describe() method. Make below code change in the previous example file.

Syntax

describe()

 

describe() method

if response_status == 200:

    data = response.text
    organization_data = StringIO(data)

    df = pd.read_csv(organization_data)
    print(df.describe())  # Use pandas describe() function to display summary statistics
OUTPUT
Index Founded Number of employees
count 100.000000 100.000000 100.000000
mean 50.500000 1995.410000 4964.860000
std 29.011492 15.744228 2850.859799
min 1.000000 1970.000000 236.000000
25% 25.750000 1983.500000 2741.250000
50% 50.500000 1995.000000 4941.500000
75% 75.250000 2010.250000 7558.000000
max 100.000000 2021.000000 9995.000000

 

4. drop()

In Pandas, drop() method is used to remove specified labels from rows or columns of a DataFrame. It returns a new DataFrame with the specified labels dropped. We are using drop() method to remove ‘Founded’ and ‘Number of employees’ column from the dataset. Make below code changes in the previous example file.

Syntax

drop(labels=None, axis=0, index=None, columns=None, inplace=False)

 

drop() method

if response_status == 200:

    data = response.text
    organization_data = StringIO(data)

    df = pd.read_csv(organization_data)
    print(f"These are the columns in Dataset = {df.columns}")

    dropColumn = df.drop(['Founded', 'Number of employees'], axis=1)
    print(f"Dataset after deleting requested Columns:\n {dropColumn}")
OUTPUT
These are the columns in Dataset = Index(['Index', 'Organization Id', 'Name', 'Website', 'Country', 'Description',
'Founded', 'Industry', 'Number of employees'],
dtype='object')
Dataset after deleting requested Columns:
  Index Organization Id ... Description                                   Industry
0 1     FAB0d41d5b5d22c ... Horizontal empowering knowledgebase            Plastics
1 2     6A7EdDEA9FaDC52 ... User-centric system-worthy leverage            Glass / Ceramics / Concrete
2 3     0bFED1ADAE4bcC1 ... Switchable scalable moratorium                 Public Safety
3 4     2bFC1Be8a4ce42f ... De-engineered systemic artificial intelligence Automotive
4 5     9eE8A6a4Eb96C24 ... Synchronized needs-based challenge             Transportation
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 96   0a0bfFbBbB8eC7c ... Right-sized zero tolerance focus group         Photography
96 97   BA6Cd9Dae2Efd62 ... Reverse-engineered composite moratorium        Consumer Services
97 98   E7df80C60Abd7f9 ... Progressive modular hub Broadcast Media
98 99   AFc285dbE2fEd24 ... User-friendly exuding migration                Education Management
99 100  e9eB5A60Cef8354 ... Synergistic background access                  Financial Services

[100 rows x 7 columns]

 

5. loc[]

In Pandas, the loc[] method is used for label-based indexing. It is used to access a group of rows and columns by labels or a Boolean array. We are fetching the record at index 2 using ‘df.loc[2]’. To print the value of Industry for record at index 2, we have used ‘df.loc[2, ‘Industry’]’. Make below code changes in the previous example file.

Syntax

loc[row_indexer, column_indexer]

 

loc[] method

if response_status == 200:
    data = response.text
    organization_data = StringIO(data)

    df = pd.read_csv(organization_data)
    label = df.loc[2]

    print(f"Fetching record at index 2: {label}")
    label = df.loc[2, 'Industry']

    print(f"\nFetching type of Industry for record at index 2: {label}")
OUTPUT
Fetching record at index 2: Index 3
Organization Id 0bFED1ADAE4bcC1
Name Hester Ltd
Website http://sullivan-reed.com/
Country China
Description Switchable scalable moratorium
Founded 1971
Industry Public Safety
Number of employees 5287
Name: 2, dtype: object

Fetching type of Industry for record at index 2: Public Safety

 

6. iloc[]

In Pandas, iloc[] method is used for integer-location based indexing. It is used to access a group of rows and columns by integer positions. We are fetching record at index 2 using ‘df.iloc[2]’. To fetch data at row 2, column 4, we have used  ‘df.iloc[2, 4]’

Syntax

iloc[row_indexer, column_indexer]

 

iloc[] method

if response_status == 200:

    data = response.text
    organization_data = StringIO(data)

    df = pd.read_csv(organization_data)

    label = df.iloc[2]
    print(f"Fetching record at index 2: {label}")

    label = df.iloc[2, 4]
    print(f"\nFetching type of Industry for record at index 2: {label}")
OUTPUT
Fetching record at index 2: Index 3
Organization Id 0bFED1ADAE4bcC1
Name Hester Ltd
Website http://sullivan-reed.com/
Country China
Description Switchable scalable moratorium
Founded 1971
Industry Public Safety
Number of employees 5287
Name: 2, dtype: object

Fetching type of Industry for record at index 2: China

 

7. groupby()

In Pandas, groupby() method is used to split the data into groups based on some criteria. It is often followed by an aggregation function to perform a computation on each group. We have used grouby() method to group ‘Name’ and ‘Number of employees columns’ in the dataset. Make below modification to the previous example file.

Syntax

groupby(by=None, axis=0)

 

groupby() method

if response_status == 200:

    data = response.text
    organization_data = StringIO(data)

    df = pd.read_csv(organization_data)

    grouped_data = df.groupby(['Name', 'Number of employees']).size().reset_index(name='Count')
    print(grouped_data)
OUTPUT
  Name                           Number of employees Count
0 Arroyo Inc                     9067                 1
1 Ayala LLC                      7664                 1
2 Baker, Mccann and Macdonald    1638                 1
3 Bartlett-Arroyo                3987                 1
4 Beasley, Greene and Mahoney    869                  1
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 Walls LLC                     1678                 1
96 Walton-Barnett                1746                 1
97 Watkins-Kaiser                2785                 1
98 Weiss and Sons                5984                 1
99 Wilkinson, Charles and Arroyo 602                  1

[100 rows x 3 columns]

 

8. merge()

In Pandas, merge() method is used to combine two or more DataFrames based on a common column or index. It performs database-like join operations. We have used a different example to implement the merge() method. We have declared two dataset namely, dataset1 and dataset2. We have merged these two dataset using merge() method as shown below.

Syntax

merge(left, right, how='inner', on=None)

 

merge() method

import pandas as pd

dataset1 = {

    'NumID': [1, 2, 3],
    'Name': ['One', 'Two', 'Three']
}

dataset2 = {

    'NumID': [1, 2, 3],
    'Value': [100, 200, 300]
}

df1 = pd.DataFrame(dataset1)
df2 = pd.DataFrame(dataset2)

mergedDF = pd.merge(df1, df2, on='NumID', how='outer')
print(f"Merged Dataframe:\n {mergedDF}")

OUTPUT

Merged Dataframe:
  NumID Name  Value
0 1      One   100
1 2      Two   200
2 3      Three 300

 

9. isnull()

In Pandas, isnull() method is used to detect missing or null values in a Dataframe. It returns a Dataframe of the same shape as the input, where each element is a Boolean value indicating whether the corresponding element in the original DataFrame was null. We have used isnull() method to check if there is any missing or null value in our dataset. It it is present, it will return those records from the dataset else it will return the message ‘No records with null values found’.

Syntax

isnull()

 

isnull() method

if response_status == 200:

    data = response.text
    organization_data = StringIO(data)
    df = pd.read_csv(organization_data)

    # Check for null values in the DataFrame
    if df.isnull().values.any():
        print(df[df.isnull().any(axis=1)])

    else:
        print("No records with null values found.")
OUTPUT
No records with null values found.

 

10. fillna()

In Pandas, the fillna() method is used to fill missing or NaN(null) values in a DataFrame with specified values or using a certain method like forward fill or backward fill. In the below example, we fill all the None values in dataset1 with ‘Unknown’ as shown below.

Syntax

fillna(value=None, method=None, axis=None, inplace=False)

 

fillna() method

import pandas as pd

dataset1 = {

    'NumID': [1, 2, 3],
    'Name': ['One', None, None]
}

dataset2 = {

    'NumID': [1, 2, 3],
    'Value': [100, 200, 300]
}

df1 = pd.DataFrame(dataset1)
df2 = pd.DataFrame(dataset2)

# Filling missing values in 'Value1' column with 'Unknown'
df1['Name'].fillna('Unknown', inplace=True)

mergedDF = pd.merge(df1, df2, on='NumID', how='outer')
print(f"Merged Dataframe with Filled Missing Values:\n {mergedDF}")
OUTPUT
Merged Dataframe with Filled Missing Values:
  NumID Name    Value
0 1     One     100
1 2     Unknown 200
2 3     Unknown 300

 

Summary

Reference pandas.pydata.org

Leave a Comment