Day 23: Pandas Introduction

🎯 Enterprise Objective

Pandas is the industry standard for tabular data manipulation. It brings SQL-like power to Python in memory. Today we learn how to create DataFrames, load external files, perform Exploratory Data Analysis (EDA), and execute vectorized column math.

📋 Strategic Overview

#	Topic	Concept
1	DataFrames	Tables & Series
2	EDA	`info()`, `describe()`
3	Operations	Column math & renaming

1. Pandas Series & DataFrames : Data Structures

🔍 What is it?
Pandas is the ultimate tool for tabular data (like Excel/SQL in Python). It is built on top of NumPy. A Series is a 1D column with row labels (an index). A DataFrame is a 2D table composed of multiple Series that share the same index.

import pandas as pd
# Creating a DataFrame from a dictionary
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})

💼 Why Data Analysts Care

• SQL Replacement: Pandas can perform JOINs, GROUP BYs, and aggregations directly in Python memory

• Data Cleaning: Pandas has hundreds of built-in methods to handle missing data and transform formats

⚠️ Looping over DataFrames

Never use a for loop to iterate over rows in a DataFrame (e.g., `iterrows()`). It destroys Pandas' vectorized performance. Always use column-level vectorized math or `.apply()`.

In [ ]:

🧪 Concept Checks: `DataFrames`

Q1. Import pandas as pd. Create a Series from [10, 20, 30] and print it. Notice the index column.

In [ ]:

Q2. Create a DataFrame from a dictionary of lists: {"Product": ["A", "B"], "Price": [10.5, 20.0]}. Print it.

In [ ]:

Q3. Create a DataFrame from a list of dictionaries (JSON style): [{"A": 1, "B": 2}, {"A": 3, "B": 4}]. Print it.

In [ ]:

Q4. Extract the "Price" column from the Q2 DataFrame and print its type(). It should be a Series.

In [ ]:

Q5. Print the df.index and df.columns attributes of your DataFrame.

In [ ]:

2. Reading & Exploring Data : I/O and EDA

🔍 What is it?

In the real world, you don't create DataFrames by hand; you read them from CSVs, SQL, or JSON. Once loaded, you use Exploratory Data Analysis (EDA) methods to understand the shape, data types, and missing values in your dataset.

Method	Purpose
`pd.read_csv()`	Load data from a CSV file
`df.head(n)`	View the first n rows (default 5)
`df.info()`	Check column types and missing (Null) values
`df.describe()`	Summary statistics (mean, min, max) for numeric columns
`df.shape`	Tuple of (rows, columns)

💼 Why Data Analysts Care

• Initial Audit: df.info() is always the first command you run to see if your numeric columns accidentally loaded as strings

• Data Distribution: df.describe() instantly shows if you have massive outliers in your data

🧠 Pro Tip

If you have a very wide DataFrame, df.head() will truncate columns. You can fix this by running pd.set_option('display.max_columns', None).

In [ ]:

🧪 Concept Checks: `Exploring Data`

Q1. Write the theoretical command to read a file named "sales_data.csv" into a DataFrame df.

In [ ]:

Q2. Create a random DataFrame with 20 rows. Print df.head() and df.tail(3).

In [ ]:

Q3. Run df.info() on your DataFrame. What information does the Non-Null Count column provide?

In [ ]:

Q4. Run df.describe(). What is the 50% row representing? (Hint: The Median).

In [ ]:

Q5. Extract the total number of rows from df.shape and print "Total rows: [N]".

In [ ]:

3. Basic Column Operations : Vectorized Math

🔍 What is it?

Because Pandas is built on NumPy, you can perform math on entire columns instantly without looping. You can easily create new columns by calculating combinations of existing columns.

# Creating a new column
df['Profit'] = df['Revenue'] - df['Cost']

💼 Why Data Analysts Care

• Feature Engineering: Creating a Price_Per_Unit column by dividing Total_Price by Quantity

• Date Math: Calculating days since a purchase by subtracting Purchase_Date from Today

🧠 Pro Tip

To drop a column, use df.drop(columns=['ColName']). Remember that Pandas methods usually return a NEW DataFrame. To save it, overwrite the variable: df = df.drop(...).

In [ ]:

🧪 Concept Checks: `Column Math`

Q1. Given df with Cost and Revenue, create a new column Profit = Revenue - Cost.

In [ ]:

Q2. Create a column Margin which is Profit / Revenue. Print the DataFrame.

In [ ]:

Q3. Drop the Cost column. Ensure you save the result back to df or a new variable.

In [ ]:

Q4. Add 100 to every value in the Revenue column using df["Revenue"] = df["Revenue"] + 100.

In [ ]:

Q5. Rename the column "Revenue" to "Total_Sales" using df.rename(columns={"Old": "New"}).

In [ ]:

🛠️ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Dictionary to DF): Create a dictionary containing data for 5 employees (Name, Department, Salary). Convert it to a Pandas DataFrame. Print the DataFrame.

In [ ]:

Task 2 (Summary Stats): Create a DataFrame with 1000 rows of random numbers (np.random.rand(1000)). Use df.describe() to find the mean and standard deviation. Print them.

In [ ]:

Task 3 (Currency Conversion): Given a DataFrame with a Price_USD column, create a new column Price_EUR assuming an exchange rate of 0.85. Drop the original USD column.

In [ ]:

Task 4 (Boolean Column): Given a DataFrame with a Score column (0-100), create a new boolean column Passed which is True if Score >= 60, and False otherwise.

In [ ]:

Task 5 (CSV Simulation): Use pathlib to write a small CSV string to data.csv. Then use pd.read_csv('data.csv') to load it into a DataFrame and print df.head().

In [ ]:

💻 Pure Coding Interview Questions

Q1.

What is the difference between a Pandas Series and a Pandas DataFrame?

In [ ]:

Q2.

Why is Pandas built on top of NumPy? What benefits does it provide?

In [ ]:

Q3.

Explain why iterating over a DataFrame with `iterrows()` is considered an anti-pattern.

In [ ]:

Q4.

What does `df.info()` show that `df.describe()` does not?

In [ ]:

Q5.

How do you read an Excel file in Pandas? What dependency is required? (`openpyxl`).

In [ ]:

Q6.

What is the `inplace=True` argument? Why is the Pandas team discouraging its use in newer versions?

In [ ]:

Q7.

How do you change the data type of a column from string to integer? (`astype`).

In [ ]:

Q8.

Write code to rename multiple columns in a DataFrame using a dictionary.

In [ ]:

Q9.

What happens if you try to add a new column using dot notation `df.NewCol = 10` instead of bracket notation `df['NewCol'] = 10`?

In [ ]:

Q10.

How do you write a DataFrame back to a CSV file without including the index column? (`index=False`).

In [ ]:

Q11.

Explain the difference between `df['A']` and `df[['A']]` in terms of the object type returned.

In [ ]:

Q12.

How do you sample a random 10% of your DataFrame? (`df.sample(frac=0.1)`).

In [ ]:

Q13.

What is the Pandas Index? How is it different from a regular column?

In [ ]:

Q14.

How do you set a specific column to be the index of the DataFrame? (`set_index`).

In [ ]:

Q15.

Write code to drop multiple columns at once.

In [ ]:

Q16.

How do you check memory usage of a DataFrame? (`df.info(memory_usage='deep')`).

In [ ]:

Q17.

Explain how Pandas handles missing data natively. What object represents a missing number?

In [ ]:

Q18.

Write a vectorized operation that squares the values in Column A and adds them to Column B.

In [ ]:

Q19.

How do you read data from a SQL database directly into a Pandas DataFrame? (`read_sql`).

In [ ]:

Q20.

What is a categorical data type in Pandas, and when should you use it to save memory?

In [ ]:

Q21.

How do you apply a custom Python function to an entire column? (`.apply()`).

In [ ]:

Q22.

Explain the difference between `.map()` and `.apply()` on a Series.

In [ ]:

Q23.

How do you read a JSON file into Pandas where the records are nested deeply?

In [ ]:

Q24.

Write code to extract all the column names of a DataFrame into a Python list.

In [ ]:

Q25.

What is `pd.to_datetime()` used for?

In [ ]:

📊 Day 23 Executive Summary

#	Topic	Key Takeaway
1	Structures	DataFrames are collections of Series sharing an index
2	I/O	Use `pd.read_csv()` to load data, `df.info()` to check it
3	Math	`df['New'] = df['A'] + df['B']` calculates row-by-row instantly

✅ Instructor's End-of-Day Checklist

• [ ] I can create a DataFrame from a dictionary.

• [ ] I can check data types and nulls using df.info().

• [ ] I can create new calculated columns.

🚀 Continue to Day 24 - Pandas Selection: Loc, Iloc, and Filtering

📊 Day 23 : Pandas Introduction

🎯 Enterprise Objective

📋 Strategic Overview

1. Pandas Series & DataFrames : Data Structures

💼 Why Data Analysts Care

⚠️ Looping over DataFrames

🧪 Concept Checks: DataFrames

2. Reading & Exploring Data : I/O and EDA

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Exploring Data

3. Basic Column Operations : Vectorized Math

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Column Math

🛠️ Professional Practice Tasks

💻 Pure Coding Interview Questions

What is the difference between a Pandas Series and a Pandas DataFrame?

Why is Pandas built on top of NumPy? What benefits does it provide?

Explain why iterating over a DataFrame with iterrows() is considered an anti-pattern.

What does df.info() show that df.describe() does not?

How do you read an Excel file in Pandas? What dependency is required? (openpyxl).

What is the inplace=True argument? Why is the Pandas team discouraging its use in newer versions?

How do you change the data type of a column from string to integer? (astype).

Write code to rename multiple columns in a DataFrame using a dictionary.

What happens if you try to add a new column using dot notation df.NewCol = 10 instead of bracket notation df['NewCol'] = 10?

How do you write a DataFrame back to a CSV file without including the index column? (index=False).

Explain the difference between df['A'] and df[['A']] in terms of the object type returned.

How do you sample a random 10% of your DataFrame? (df.sample(frac=0.1)).

What is the Pandas Index? How is it different from a regular column?

How do you set a specific column to be the index of the DataFrame? (set_index).

Write code to drop multiple columns at once.

How do you check memory usage of a DataFrame? (df.info(memory_usage='deep')).

Explain how Pandas handles missing data natively. What object represents a missing number?

Write a vectorized operation that squares the values in Column A and adds them to Column B.

How do you read data from a SQL database directly into a Pandas DataFrame? (read_sql).

What is a categorical data type in Pandas, and when should you use it to save memory?

How do you apply a custom Python function to an entire column? (.apply()).

Explain the difference between .map() and .apply() on a Series.

How do you read a JSON file into Pandas where the records are nested deeply?

Write code to extract all the column names of a DataFrame into a Python list.

What is pd.to_datetime() used for?

📊 Day 23 Executive Summary

✅ Instructor's End-of-Day Checklist

🧪 Concept Checks: `DataFrames`

🧪 Concept Checks: `Exploring Data`

🧪 Concept Checks: `Column Math`

Explain why iterating over a DataFrame with `iterrows()` is considered an anti-pattern.

What does `df.info()` show that `df.describe()` does not?

How do you read an Excel file in Pandas? What dependency is required? (`openpyxl`).

What is the `inplace=True` argument? Why is the Pandas team discouraging its use in newer versions?

How do you change the data type of a column from string to integer? (`astype`).

What happens if you try to add a new column using dot notation `df.NewCol = 10` instead of bracket notation `df['NewCol'] = 10`?

How do you write a DataFrame back to a CSV file without including the index column? (`index=False`).

Explain the difference between `df['A']` and `df[['A']]` in terms of the object type returned.

How do you sample a random 10% of your DataFrame? (`df.sample(frac=0.1)`).

How do you set a specific column to be the index of the DataFrame? (`set_index`).

How do you check memory usage of a DataFrame? (`df.info(memory_usage='deep')`).

How do you read data from a SQL database directly into a Pandas DataFrame? (`read_sql`).

How do you apply a custom Python function to an entire column? (`.apply()`).

Explain the difference between `.map()` and `.apply()` on a Series.

What is `pd.to_datetime()` used for?