Python for Data Engineering Training
This 4-day intensive training covers the Python data engineering ecosystem. From Pandas and Polars for data manipulation, to ETL pipelines, database connections, Apache Airflow orchestration, data validation, PySpark, and pipeline testing. Participants build production-grade data pipelines by the end of the course.
Build production data pipelines with this intensive 3-day training. Master pandas and NumPy for data manipulation, design robust ETL processes, and learn to handle real-world data at scale with Python's powerful data ecosystem.
Training Details
| Duration | 3 days (24 hours) |
| Level | Intermediate |
| Delivery | In-person, Live online, Hybrid |
| Certification | N/A |
Who Is This For?
- Python developers moving into data engineering
- Data analysts scaling beyond spreadsheets and SQL
- Backend engineers building data pipelines
- DevOps engineers automating data workflows
- Anyone responsible for data quality and delivery
Learning Outcomes
After completing this training, participants will be able to:
- Manipulate and transform data efficiently with pandas DataFrames
- Perform numerical computations with NumPy arrays and vectorized operations
- Design and implement ETL pipelines for batch and incremental processing
- Read and write data across formats including CSV, JSON, Parquet, and databases
- Validate data quality with schema checks and automated tests
- Orchestrate multi-step pipelines with error handling and retry logic
Detailed Agenda
Day 1: Data Manipulation with pandas and NumPy
Module 1: NumPy Foundations
- ndarray creation and data types
- Indexing, slicing, and boolean masking
- Vectorized operations and broadcasting
- Hands-on: Process sensor data with NumPy
Module 2: pandas DataFrames
- Series and DataFrame creation
- Indexing with loc, iloc, and boolean selection
- Column operations, dtypes, and missing data
- Hands-on: Clean and explore a messy dataset
Module 3: Data Transformation
- Filtering, sorting, and grouping
- Aggregations, pivot tables, and crosstabs
- Merging, joining, and concatenating DataFrames
- Hands-on: Combine and summarize multi-source sales data
Day 2: ETL Pipeline Design
Module 4: Reading and Writing Data
- CSV, JSON, and Excel I/O with pandas
- Working with Parquet and columnar formats
- Database connections with SQLAlchemy
- Hands-on: Build a multi-format data ingestion layer
Module 5: Data Validation and Quality
- Schema validation and type enforcement
- Detecting duplicates, outliers, and anomalies
- Data quality metrics and reporting
- Hands-on: Build a data validation framework
Module 6: ETL Pipeline Patterns
- Extract-Transform-Load vs ELT approaches
- Incremental loading and change data capture
- Idempotent pipeline design
- Hands-on: Implement a complete ETL pipeline from raw files to database
Day 3: Production Pipelines
Module 7: Pipeline Orchestration
- Task dependencies and execution order
- Scheduling with cron and APScheduler
- Error handling, retries, and dead-letter queues
- Hands-on: Orchestrate a multi-step pipeline with dependency management
Module 8: Performance Optimization
- Chunked processing for large datasets
- Memory optimization with dtypes and categories
- Parallel processing with multiprocessing and Dask
- Hands-on: Optimize a slow pipeline to handle 10x data volume
Module 9: Testing and Monitoring
- Unit testing data transformations with pytest
- Integration testing with test fixtures and sample data
- Pipeline monitoring, logging, and alerting
- Hands-on: Add comprehensive tests and monitoring to a production pipeline
Prerequisites
- Python Fundamentals or equivalent programming experience
- Comfort with functions, classes, and file I/O
- Basic understanding of SQL and relational databases
- Familiarity with CSV and JSON data formats
Delivery Formats
| Format | Description |
|---|---|
| In-Person | On-site at your company's location, hands-on with direct interaction |
| Live Online | Interactive virtual sessions with screen sharing and real-time labs |
| Hybrid | Combination of on-site and remote sessions, flexible scheduling |
All formats include hands-on labs, course materials, sample datasets, and post-training support.
Ready to get started?
Request a training quote for your team — in-person, live-online, or hybrid.