๐ Day 20 : Capstone Project
๐ฏ Enterprise Objective
You have mastered Core Python. Today, you prove it. You will build a professional, object-oriented Log Analysis Pipeline that integrates File I/O, Generators, Regex, and JSON. This is the bridge between learning syntax and writing real-world software.
๐ Strategic Overview
| # | Topic | Concept |
|---|---|---|
| 1 | Architecture | System Design |
| 2 | Implementation | OOP + Regex |
| 3 | Execution | Generators + JSON |
1. Phase 1 Capstone : Python Core Mastery
You have completed the core Python phase. You understand data structures, control flow, functions, OOP, and file I/O. Today is a pure coding challenge day. No new theory, just application. You will build a complete, robust, object-oriented application from scratch.
๐ฏ The Mission: Log Analyzer System
You will build a system that reads messy server logs, parses them using Regex, cleans them using Comprehensions, models them using OOP, and exports a clean JSON report.
๐ผ Why Data Analysts Care
โข Portfolio Building: This is a realistic engineering task you can put on your resume
โข Knowledge Integration: Forces you to use Lists, Dicts, Regex, OOP, and JSON together
โ ๏ธ Blank Page Syndrome
Don't freeze. Break the problem into tiny pieces. Write one small function at a time. Test it. Then move to the next.
๐งช Concept Checks: Capstone
Q1. Before coding, plan: What attributes should a LogEntry class have? (e.g., timestamp, level, message).
Q2. What regex pattern would you use to extract [2023-10-01] ERROR: Disk full?
Q3. How will you handle missing or corrupt lines in the file? (Hint: try/except).
Q4. How will you aggregate the data? (Hint: collections.Counter for counting error types).
Q5. Set up your working directory. Create a dummy server.log file with 10 lines of fake data to test against.
2. Capstone Architecture : System Design
A good system is modular. Let's design the architecture before writing the full implementation.
1. LogEntry(timestamp, level, message): A class that uses str for nice printing.
2. parse_line(line): A function or static method that takes a string, runs regex, and returns a LogEntry (or raises a ValueError if invalid).
3. process_file(path): A generator that opens the file safely and yields parsed LogEntry objects.
4. generate_report(entries): A function that takes the generator, counts error levels, and saves to report.json.
๐ผ Why Data Analysts Care
โข Clean Code: Separation of concerns. The parser shouldn't write files. The writer shouldn't parse Regex.
๐งช Concept Checks: Architecture
Q1. Implement the LogEntry class with an init and repr method.
Q2. Implement the parse_line method using re.match. Return a LogEntry. Raise ValueError if it fails.
Q3. Test parse_line on a valid string and an invalid string. Catch the error.
Q4. Implement process_file(path). Use with open and a for loop. yield parsed entries, use try/except to ignore bad lines.
Q5. Loop over process_file and print the entries.
3. Capstone Execution : Putting it Together
Now integrate the components and produce the final output.
Your final task is to write the aggregator. It should loop over the generator, group messages by their Log Level (e.g., 5 ERRORs, 10 WARNs), and export this summary as beautifully formatted JSON.
๐งช Concept Checks: Execution
Q1. Import collections.Counter.
Q2. Create a generate_report(entries) function. Initialize counters.
Q3. Loop through entries. Increment counters based on entry.level.
Q4. Construct the final dictionary format shown above.
Q5. Use json.dump(..., indent=4) to save the dictionary to "report.json".
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (Capstone Step 1): Generate fake log data. Write a script to write 100 random log lines to server.log (mix of INFO, WARN, ERROR, and some garbage lines).
Task 2 (Capstone Step 2): Write the LogEntry class and regex parsing logic. Test it thoroughly.
Task 3 (Capstone Step 3): Write the generator pipeline to read the file efficiently and yield objects.
Task 4 (Capstone Step 4): Write the JSON aggregation logic. Run the full pipeline.
Task 5 (Capstone Step 5): Refactor. Add type hints (-> str). Add docstrings. Make it look like professional enterprise code.
๐ป Pure Coding Interview Questions
Q1.
How do you approach debugging a system that consists of multiple interacting classes?
Q2.
Why did we use a generator to read the file instead of returning a list of all LogEntry objects?
Q3.
If the log file was 500GB, how would your code handle it? Would it crash?
Q4.
How would you modify this system to read from a continuous stream of logs (like a network socket) instead of a static file?
Q5.
Explain the importance of Separation of Concerns in software architecture.
Q6.
How would you write unit tests for the parse_line function?
Q7.
What edge cases might break your Regex pattern in production?
Q8.
How would you handle timezone differences if the logs came from servers in different regions?
Q9.
If you needed to store this data permanently, would you choose JSON, CSV, or a Database? Why?
Q10.
How would you use Python's logging module instead of print() statements for debugging this application?
Q11.
Explain the tradeoff between using regex vs basic string splitting (.split()) for parsing logs.
Q12.
How could you make the file reading multi-threaded if you had to process 100 different log files?
Q13.
What is cyclomatic complexity and why should we avoid deep nesting?
Q14.
How do you ensure your classes are easily extensible in the future (e.g., adding a CriticalLogEntry type)?
Q15.
Explain the Single Responsibility Principle (SRP) from SOLID.
Q16.
How would you package this script so another team could pip install it?
Q17.
What are Python Type Hints (a: int) and why are they useful in enterprise projects?
Q18.
How would you handle a PermissionError when trying to write the report.json file?
Q19.
If the JSON export gets too large for memory, how do you stream JSON to a file?
Q20.
How do you benchmark the execution speed of your pipeline?
Q21.
What is the difference between a functional approach and an OOP approach for this specific log parsing task?
Q22.
How would you refactor the code to use pathlib exclusively?
Q23.
Explain how you would containerize this script using Docker.
Q24.
How would you automate this script to run every midnight? (Cron/Task Scheduler).
Q25.
What did you find most challenging about integrating all these Python concepts together?
๐ Day 20 Executive Summary
| # | Topic | Key Takeaway |
|---|---|---|
| 1 | Planning | Always separate concerns (Parser vs Writer) |
| 2 | Robustness | Catch bad data early, don't let it crash the pipeline |
| 3 | Execution | Code is useless until it solves a business problem |
โ Instructor's End-of-Day Checklist
โข [ ] I successfully implemented an OOP architecture.
โข [ ] I processed data efficiently using generators.
โข [ ] I exported a clean JSON report.