๐ Day 23 : Pandas Introduction
๐ฏ Enterprise Objective
Pandas is the industry standard for tabular data manipulation. It brings SQL-like power to Python in memory. Today we learn how to create DataFrames, load external files, perform Exploratory Data Analysis (EDA), and execute vectorized column math.
๐ Strategic Overview
| # | Topic | Concept |
|---|---|---|
| 1 | DataFrames | Tables & Series |
| 2 | EDA | info(), describe() |
| 3 | Operations | Column math & renaming |
1. Pandas Series & DataFrames : Data Structures
Pandas is the ultimate tool for tabular data (like Excel/SQL in Python). It is built on top of NumPy. A Series is a 1D column with row labels (an index). A DataFrame is a 2D table composed of multiple Series that share the same index.
import pandas as pd
# Creating a DataFrame from a dictionary
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
๐ผ Why Data Analysts Care
โข SQL Replacement: Pandas can perform JOINs, GROUP BYs, and aggregations directly in Python memory
โข Data Cleaning: Pandas has hundreds of built-in methods to handle missing data and transform formats
โ ๏ธ Looping over DataFrames
for loop to iterate over rows in a DataFrame (e.g., `iterrows()`). It destroys Pandas' vectorized performance. Always use column-level vectorized math or `.apply()`.
๐งช Concept Checks: DataFrames
Q1. Import pandas as pd. Create a Series from [10, 20, 30] and print it. Notice the index column.
Q2. Create a DataFrame from a dictionary of lists: {"Product": ["A", "B"], "Price": [10.5, 20.0]}. Print it.
Q3. Create a DataFrame from a list of dictionaries (JSON style): [{"A": 1, "B": 2}, {"A": 3, "B": 4}]. Print it.
Q4. Extract the "Price" column from the Q2 DataFrame and print its type(). It should be a Series.
Q5. Print the df.index and df.columns attributes of your DataFrame.
2. Reading & Exploring Data : I/O and EDA
In the real world, you don't create DataFrames by hand; you read them from CSVs, SQL, or JSON. Once loaded, you use Exploratory Data Analysis (EDA) methods to understand the shape, data types, and missing values in your dataset.
| Method | Purpose |
|---|---|
pd.read_csv() | Load data from a CSV file |
df.head(n) | View the first n rows (default 5) |
df.info() | Check column types and missing (Null) values |
df.describe() | Summary statistics (mean, min, max) for numeric columns |
df.shape | Tuple of (rows, columns) |
๐ผ Why Data Analysts Care
โข Initial Audit: df.info() is always the first command you run to see if your numeric columns accidentally loaded as strings
โข Data Distribution: df.describe() instantly shows if you have massive outliers in your data
๐ง Pro Tip
If you have a very wide DataFrame, df.head() will truncate columns. You can fix this by running pd.set_option('display.max_columns', None).
๐งช Concept Checks: Exploring Data
Q1. Write the theoretical command to read a file named "sales_data.csv" into a DataFrame df.
Q2. Create a random DataFrame with 20 rows. Print df.head() and df.tail(3).
Q3. Run df.info() on your DataFrame. What information does the Non-Null Count column provide?
Q4. Run df.describe(). What is the 50% row representing? (Hint: The Median).
Q5. Extract the total number of rows from df.shape and print "Total rows: [N]".
3. Basic Column Operations : Vectorized Math
Because Pandas is built on NumPy, you can perform math on entire columns instantly without looping. You can easily create new columns by calculating combinations of existing columns.
# Creating a new column
df['Profit'] = df['Revenue'] - df['Cost']
๐ผ Why Data Analysts Care
โข Feature Engineering: Creating a Price_Per_Unit column by dividing Total_Price by Quantity
โข Date Math: Calculating days since a purchase by subtracting Purchase_Date from Today
๐ง Pro Tip
To drop a column, use df.drop(columns=['ColName']). Remember that Pandas methods usually return a NEW DataFrame. To save it, overwrite the variable: df = df.drop(...).
๐งช Concept Checks: Column Math
Q1. Given df with Cost and Revenue, create a new column Profit = Revenue - Cost.
Q2. Create a column Margin which is Profit / Revenue. Print the DataFrame.
Q3. Drop the Cost column. Ensure you save the result back to df or a new variable.
Q4. Add 100 to every value in the Revenue column using df["Revenue"] = df["Revenue"] + 100.
Q5. Rename the column "Revenue" to "Total_Sales" using df.rename(columns={"Old": "New"}).
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (Dictionary to DF): Create a dictionary containing data for 5 employees (Name, Department, Salary). Convert it to a Pandas DataFrame. Print the DataFrame.
Task 2 (Summary Stats): Create a DataFrame with 1000 rows of random numbers (np.random.rand(1000)). Use df.describe() to find the mean and standard deviation. Print them.
Task 3 (Currency Conversion): Given a DataFrame with a Price_USD column, create a new column Price_EUR assuming an exchange rate of 0.85. Drop the original USD column.
Task 4 (Boolean Column): Given a DataFrame with a Score column (0-100), create a new boolean column Passed which is True if Score >= 60, and False otherwise.
Task 5 (CSV Simulation): Use pathlib to write a small CSV string to data.csv. Then use pd.read_csv('data.csv') to load it into a DataFrame and print df.head().
๐ป Pure Coding Interview Questions
Q1.
What is the difference between a Pandas Series and a Pandas DataFrame?
Q2.
Why is Pandas built on top of NumPy? What benefits does it provide?
Q3.
Explain why iterating over a DataFrame with iterrows() is considered an anti-pattern.
Q4.
What does df.info() show that df.describe() does not?
Q5.
How do you read an Excel file in Pandas? What dependency is required? (openpyxl).
Q6.
What is the inplace=True argument? Why is the Pandas team discouraging its use in newer versions?
Q7.
How do you change the data type of a column from string to integer? (astype).
Q8.
Write code to rename multiple columns in a DataFrame using a dictionary.
Q9.
What happens if you try to add a new column using dot notation df.NewCol = 10 instead of bracket notation df['NewCol'] = 10?
Q10.
How do you write a DataFrame back to a CSV file without including the index column? (index=False).
Q11.
Explain the difference between df['A'] and df[['A']] in terms of the object type returned.
Q12.
How do you sample a random 10% of your DataFrame? (df.sample(frac=0.1)).
Q13.
What is the Pandas Index? How is it different from a regular column?
Q14.
How do you set a specific column to be the index of the DataFrame? (set_index).
Q15.
Write code to drop multiple columns at once.
Q16.
How do you check memory usage of a DataFrame? (df.info(memory_usage='deep')).
Q17.
Explain how Pandas handles missing data natively. What object represents a missing number?
Q18.
Write a vectorized operation that squares the values in Column A and adds them to Column B.
Q19.
How do you read data from a SQL database directly into a Pandas DataFrame? (read_sql).
Q20.
What is a categorical data type in Pandas, and when should you use it to save memory?
Q21.
How do you apply a custom Python function to an entire column? (.apply()).
Q22.
Explain the difference between .map() and .apply() on a Series.
Q23.
How do you read a JSON file into Pandas where the records are nested deeply?
Q24.
Write code to extract all the column names of a DataFrame into a Python list.
Q25.
What is pd.to_datetime() used for?
๐ Day 23 Executive Summary
| # | Topic | Key Takeaway |
|---|---|---|
| 1 | Structures | DataFrames are collections of Series sharing an index |
| 2 | I/O | Use pd.read_csv() to load data, df.info() to check it |
| 3 | Math | df['New'] = df['A'] + df['B'] calculates row-by-row instantly |
โ Instructor's End-of-Day Checklist
โข [ ] I can create a DataFrame from a dictionary.
โข [ ] I can check data types and nulls using df.info().
โข [ ] I can create new calculated columns.