โณ Loading Python Engine...

๐Ÿ“Š Day 25 : Pandas Cleaning

๐ŸŽฏ Enterprise Objective

Data Analysts spend 80% of their time cleaning data. Missing values crash algorithms. Duplicates ruin financial aggregations. Strings masquerading as numbers prevent math. Today we master the unglamorous but critical art of Data Cleaning.

๐Ÿ“‹ Strategic Overview

#TopicConcept
1Missing Dataisna(), fillna()
2Duplicatesdrop_duplicates()
3Typesastype(), .str accessor

1. Handling Missing Data : The Reality of Data

๐Ÿ” What is it?

Real-world data is messy. Missing values are represented as NaN (Not a Number) or None. Pandas provides methods to detect (isna()), remove (dropna()), or replace (fillna()) these missing values.

MethodActionCommon Use
df.isna()Returns Boolean DFIdentifying where data is missing
df.dropna()Drops rows/colsRemoving rows where critical data is null
df.fillna(val)Replaces NaNsReplacing missing ages with the median age

๐Ÿ’ผ Why Data Analysts Care

โ€ข Data Quality: Machine Learning models will crash if you feed them NaN values. You must handle them first!

โ€ข Imputation: Filling missing salaries with the average salary of the employee's department

โš ๏ธ Blindly Dropping

Never run df.dropna() without thinking. If 30% of your data has a missing 'Phone Number', dropping those rows destroys 30% of your valid 'Email' and 'Name' data. Drop the column, or fill it with 'Unknown'.

In [ ]:

๐Ÿงช Concept Checks: Missing Data

Q1. Run df.isna().sum() on your DataFrame to see which columns have missing values.

In [ ]:

Q2. Use df.dropna(subset=["Age"]) to drop ONLY rows where "Age" is missing. Print it.

In [ ]:

Q3. Fill all NaN values in the entire DataFrame with "Unknown" using df.fillna("Unknown").

In [ ]:

Q4. Fill the "Salary" column with the median salary of that column using df["Salary"].fillna(...).

In [ ]:

Q5. Use df.ffill() (forward fill) on a Series with a missing value. What does it do?

In [ ]:

2. Handling Duplicates : Deduplication

๐Ÿ” What is it?

Duplicate rows can skew aggregations and ruin analysis. df.duplicated() returns a boolean mask of duplicate rows, and df.drop_duplicates() removes them. You can specify whether to keep the 'first' occurrence, the 'last', or False (drop all duplicates completely).

# Keep only the first occurrence of a User_ID
df.drop_duplicates(subset=['User_ID'], keep='first')

๐Ÿ’ผ Why Data Analysts Care

โ€ข ETL Pipelines: When pulling data daily, you might accidentally pull the same transactions twice. Deduplication fixes this

โ€ข Latest Records: Sorting by 'Update_Date' and keeping the 'last' duplicate ensures you only have the most recent data for each user

๐Ÿง  Pro Tip

Always sort your DataFrame BEFORE dropping duplicates if you want to keep the highest/lowest/newest value. E.g., Sort by Date ascending, then drop_duplicates(keep='last').

In [ ]:

๐Ÿงช Concept Checks: Duplicates

Q1. Create a DataFrame with 3 identical rows. Run df.duplicated() to see the boolean mask.

In [ ]:

Q2. Run df.drop_duplicates() on the DataFrame from Q1 to clean it.

In [ ]:

Q3. Given a DF with ID and Score, sort by Score descending, then drop duplicates on ID keeping the first.

In [ ]:

Q4. What happens if you use keep=False in drop_duplicates()? Try it.

In [ ]:

Q5. Count the number of duplicate IDs by chaining df.duplicated(subset=["ID"]).sum().

In [ ]:

3. Data Type Conversion : Fixing Bad Formatting

๐Ÿ” What is it?

When reading from CSVs, numbers might be loaded as strings (e.g., '1,000'). You cannot perform math on strings! The .astype() method converts columns to proper types. For strings requiring cleaning, the .str accessor is your best friend.

MethodActionExample
astype(type)Cast to typedf['Age'].astype(int)
pd.to_numeric()Safe castingpd.to_numeric(df['Price'], errors='coerce')
.str.replace()String cleanupdf['Price'].str.replace('$', '')

๐Ÿ’ผ Why Data Analysts Care

โ€ข Currency Parsing: Converting '$1,200.50' into the float 1200.50 so you can sum it

โ€ข Memory Optimization: Downcasting an int64 column to int8 to save 8x the RAM on a billion rows

๐Ÿง  Pro Tip

If .astype(float) fails because of a weird string like 'N/A', use pd.to_numeric(col, errors='coerce'). It will force the bad strings into NaN so you can proceed!

In [ ]:

๐Ÿงช Concept Checks: Data Types

Q1. Given df["Age"] = ["25", "30"] (strings), cast it to integer using .astype(int).

In [ ]:

Q2. Given df["Cost"] = ["1,000", "2,500"], remove the comma using .str.replace(",", "").

In [ ]:

Q3. Cast the cleaned Cost column to float.

In [ ]:

Q4. Use pd.to_numeric(..., errors="coerce") on ["10", "bad", "20"]. Print the result.

In [ ]:

Q5. Check the memory usage of a column using df["Col"].memory_usage(). Cast it to float32 and check again.

In [ ]:

๐Ÿ› ๏ธ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (The Complete Pipeline): Create a dirty DataFrame: {'ID': [1,2,2,3], 'Price': ['$10', '$20', '$20', 'NaN'], 'Qty': [1,2,np.nan,3]}. Write a 4-step pipeline: 1) Drop exact duplicates. 2) Clean Price '$' and cast to float. 3) Fill missing Qty with 1. 4) Calculate Total = Price * Qty.

In [ ]:

Task 2 (Threshold Dropping): Create a DF with 5 rows and 3 columns of NaNs/data. Use df.dropna(thresh=2) to drop rows that do not have AT LEAST 2 valid non-NaN values. Print it.

In [ ]:

Task 3 (Categorical Conversion): Given a column df['Size'] = ['S', 'M', 'L', 'S', 'M'] with 100,000 rows. Convert its type to 'category' using .astype('category'). Check df.info() to see memory savings.

In [ ]:

Task 4 (String Extraction): Given df['Code'] = ['Item-123', 'Item-456']. Use .str.split('-').str[1] to extract just the numbers. Cast them to integers.

In [ ]:

Task 5 (Interpolation): Given a Time Series df['Temp'] = [20, np.nan, np.nan, 26]. Use df['Temp'].interpolate() to fill the missing values with a linear progression. Print it.

In [ ]:

๐Ÿ’ป Pure Coding Interview Questions

Q1.

Explain the difference between NaN, None, and NaT in Pandas.

In [ ]:

Q2.

What does the errors='coerce' argument do in pd.to_numeric()?

In [ ]:

Q3.

How do you drop columns that contain MORE than 50% missing values?

In [ ]:

Q4.

Explain the thresh parameter in df.dropna().

In [ ]:

Q5.

Why might filling missing values with the Mean be a bad idea for highly skewed data?

In [ ]:

Q6.

Write code to fill missing values in column 'A' with the median of column 'A'.

In [ ]:

Q7.

How do you perform a Forward Fill (ffill)? In what scenario (e.g., time series) is it useful?

In [ ]:

Q8.

What is the difference between df.drop_duplicates() and df.duplicated()?

In [ ]:

Q9.

Write code to find the number of exact duplicate rows in a DataFrame.

In [ ]:

Q10.

How do you keep the LAST occurrence of a duplicate based on a specific subset of columns?

In [ ]:

Q11.

Explain how converting a string column with low cardinality to category saves memory.

In [ ]:

Q12.

Write code to remove all whitespace from the beginning and end of a string column using .str.strip().

In [ ]:

Q13.

How do you extract a substring from a column using a regular expression in Pandas? (.str.extract()).

In [ ]:

Q14.

What happens if you use .astype(int) on a column that contains NaN values? (Hint: It fails).

In [ ]:

Q15.

Explain the Int64 (capital I) nullable integer data type introduced in newer Pandas versions.

In [ ]:

Q16.

Write code to replace all negative values in a DataFrame with NaN using df.where() or np.where().

In [ ]:

Q17.

How do you rename the index of a DataFrame?

In [ ]:

Q18.

What is Data Imputation? Name two advanced methods beyond simple mean/median filling (e.g., KNN, Regression).

In [ ]:

Q19.

Write code to convert a column containing 'Yes'/'No' strings into boolean True/False.

In [ ]:

Q20.

How do you parse a column of dates in the format 'YYYY/MM/DD' into Pandas datetime objects?

In [ ]:

Q21.

Explain the inplace=True argument. Why is it generally avoided by Pandas core devs?

In [ ]:

Q22.

Write code to calculate the percentage of missing values in every column of a DataFrame.

In [ ]:

Q23.

How do you apply a custom data-cleaning function to every element in a single column?

In [ ]:

Q24.

Explain the difference between .map(), .apply(), and .applymap() in Pandas.

In [ ]:

Q25.

What is the performance implication of using .str accessors on a column of 10 million rows?

In [ ]:

๐Ÿ“Š Day 25 Executive Summary

#TopicKey Takeaway
1NaNsNever drop NaNs blindly. Think about why they are missing.
2DedupeSort by date first to ensure you keep the latest record.
3CastingUse pd.to_numeric(..., errors='coerce') for messy strings.

โœ… Instructor's End-of-Day Checklist

โ€ข [ ] I can find and fill missing values.

โ€ข [ ] I can drop duplicates while keeping the latest record.

โ€ข [ ] I can clean strings and cast them to floats.