Introduction to Data Engineering
Data Storage and Management
Data Integration and Transformation
Overview of Data Integration and Transformation
Database Fundamentals
Introduction to Databases
Relational Database Management Systems
Structured Query Language (SQL)
Database Security and Administration
Understanding Database Transactions
OLAP and OLTP
ACID and BASE
SQL Database Fundamentals
Data types
Table Creation
Table Creation
Data Manipulation
Joins
Aggregate Functions
NoSQL Database Fundamentals
Document-oriented databases
Key-value stores
Graph databases
Column-family stores
Data modeling and Database Designs
Conceptual Data Modeling
Logical Data Modeling
Physical Data Modeling
Normalization
Indexing
Query Optimization
Data Integrity
Python (Fundamentals)
Essential Syntax and Data Structures
Functions
Object-Oriented Programming
File Handling
Basic Data Analysis
Connecting to Databases
Executing Basic SQL Commands
Python (Intermediate)
Data Wrangling with Pandas
Data Preparation, Cleaning, and Transformation
Leveraging Multiprocessing and Multithreading for Improved Performance
Python (Advanced)
Advanced Data Manipulation with Pandas
Advanced Data Analysis
Grouping and Aggregation Techniques
Introduction to Data Engineering:
Overview of Data Engineering
Data Storage and Retrieval
Data Processing and Transformation
Data Pipelines and Workflows
Data Quality and Governance
Data Integration and ETL (Extract, Transform, Load)
ETL (Extract, Transform, Load) processes
Extracting Insights from Raw Data
Transforming Data for Efficient Processing
Loading Data with Confidence
Data Profiling for Better Decision Making
Cleaning Data for Actionable Insights
Integrating Data for a Holistic View
Error Handling for Seamless Data Management
Data Extraction
Identifying the data sources
Selecting the appropriate data extraction tools and techniques
Defining the data extraction requirements
Validating the extracted data
Data Integration and Transformation:
Overview of Data Integration and Transformation
Data Integration Techniques and Best Practices
Semantic Transformation in Data Integration
Common Challenges and Pitfalls of Data Integration and Transformation
Data Transformation Techniques and Tools
Data Warehouse Storage and Management
Data Loading
Initial load
Incremental load
Verifying the referential integrity between the dimensions and the fact tables
Big Data Processing and open source tools:
Overview of Big Data Processing and Open Source Tools
Hadoop: A Framework for Storing and Processing Data
Apache Spark: An Analytics Engine for Large Scale Data
Cassandra and MongoDB: NoSQL Databases for Big Data
HPCC and Apache Storm: Distributed Computing Platforms for Big Data
KNIME Analytics Platform, RapidMiner, and RStudio: Open Source Tools for Big Data Analytics
Distributed data processing using AWS Apache spark
Overview of Distributed Data Processing using AWS Apache Spark
Apache Spark: A Distributed Processing Framework for Big Data
Amazon EMR: A Managed Service for Running Apache Spark on AWS
Spark Libraries for Machine Learning, Stream Processing, and Graph Analytics
Data Processing with Apache Spark on Amazon SageMaker
AWS Fundamentals
Overview of Cloud Computing and AWS
AWS Core Services
Storage and Management
AWS Security and Compliance
AWS Pricing and Support
AWS Architecture and Design
AWS Operations and Management
Securing AWS resources using IAM
AWS Identity and Access Management (IAM) Introduction
AWS IAM Policies and Permissions
Managing AWS IAM Roles, Users and Groups
Accessing AWS via Command line interface
Configure and Validate AWS CLI
AWS Storage (S3 and Glacier) Storage
Getting started with S3
Storage: Deep dive into S3
Setting up Local Development Environment
Setting up local development environment for AWS on windows
Setting up local development environment for AWS on Mac
Setting up environment for practice using Cloud9
Cloud9 features you must know
Connecting and Working with Cloud9
Working with EC2 Instances
Launching and connecting to EC2
Managing EC2 instances
Securing EC2 connections and resources
Advanced EC2 Instance Management
Metadata, Querying, Filtering, and Bootstrapping
Creating and Validating AMIs
Data ingestion using Lambda Functions Introduction to Serverless Computing and AWS Lambda
Developing and Deploying lambda Functions for Data Ingestion
Automating data processing with AWS Lambda and Event-Driven Architectures
Optimization of AWS Lambda functions
Development Lifecycle for PySpark
Introduction to PySpark and Spark Session
Data Ingestion with PySpark
Data Processing with PySpark APIs
Data Export with PySpark
Productionizing PySpark Code
Developing Your First ETL Job with AWS Glue
AWS Glue Job and Basic Configuration
Creating and Running Your First AWS Glue ETL Job
Spark History server for glue jobs
Setting up Spark History server for glue jobs
Running AWS Glue Jobs with Spark UI Container
Mastering AWS Glue Catalog
Creating and Managing Glue Catalog Tables
Managing Glue Catalog Programmatically
Programmatically Interacting with AWS Glue Using API
Updating IAM Role and Creating a Baseline Glue Job
Partitioning Data with Glue Script
Incremental Data Processing with AWS Glue Job Bookmark
Overview of Glue Job Bookmark
Running jobs with Glue Job Bookmarks
Incremental Data Processing with Glue Job Bookmarks
Getting started with AWS EMR
AWS EMR Cluster Fundamentals
EMR Cluster Configuration and Management
Deploying Spark applications using AWS EMR
Deploying Applications using AWS EMR
Running Spark Applications on AWS EMR Cluster
Managing Applications on AWS EMR Clusters
Optimizing Data on EMR
Security and Networking
Optimization and Performance Tuning
Managing data in EMR
Troubleshooting, debugging and Best Practice
Building a Streaming Pipeline using Kinesis
Streaming Data Processing with AWS Kinesis
Setting up the Streaming Pipeline
Using Kinesis Firehose for Data Delivery
Setting up Kinesis Delivery Stream for s3
Accessing and Reading S3 Objects with Kinesis and Boto3
Setting Up Access and reading S3 objects with Kinesis and Boto
Working with DynamoDB using Boto3
Getting most out /of Amazon Athena
Amazon Athena and Glue Catalog
Creating Tables and Populating Data in Athena
Partitioning Data in Athena
Amazon Athena using AWS CLI
Utilizing Amazon Athena using AWS CLI
Managing Athena with AWS CLI
Running Athena Queries with AWS CLI
Amazon Athena using Python boto3
Amazon Athena using Python boto3
Managing Amazon Athena using Python boto3
Run Amazon Athena Queries using Python boto3
Getting started with Amazon Redshift
Setting up and Managing Redshift Cluster
Querying Redshift Tables
Redshift Tables Management
Copy data from S3 into Redshift tables
Introduction and Overview of Redshift Copy Command
Setting up and Running the Redshift Copy Command
Copying Data using IAM Role and JSON Dataset
Develop applications using Redshift cluster
Setting up Redshift Cluster and Access
Working with Redshift Databases and Tables
Interacting with Redshift using Python
Redshift Tables with Distkeys and Sortkeys
Redshift Architecture and Cluster Creation
Redshift Tables and Distribution Strategies
Maintenance and Troubleshooting
Redshift Federated Queries and spectrum
Setting up RDS and Redshift Integration
Data Processing with Redshift Federated Queries
Running Queries with Redshift Spectrum
Clean up and Maintenance
Preview
This is a small preview to the Data Engineering on AWS Masters Program.
The Introduction
Introduction to Course
Introduction to Data Engineering
Cloud Fundamentals
Quick Review of AWS
Build your Database Foundations
Concepts of DBMS
SQL Fundamentals
NoSQL Fundamentals
Understanding Data Modelling and Designing
Introduction to Data Modelling and Database Design
Conceptual data modeling and Entity-Relationship diagrams
Types of Data Modelling, Normalization and Denormalization
Indexing and Query Optimization Techniques
Constraints and Triggers for Data Integrity
Choosing the Appropriate Design for given scenario
Best Practices for Data Modelling and Database Design
Python for Data Engineering
Python Essentials
Python for Data Engineering – Foundations
Python for Data Engineering – Advanced
Individual Project – 1
This section consists of One Individual Project. Learners gain Practical knowledge on the different topics such as Database Fundamentals, Python for Data Engineering.
Understanding the concepts of Data Engineering
Introduction to Data Engineering
Understanding ETL Processes
Data Extraction ,Integration, Transformation and Data Loading
Explore in-depth concepts of Data Engineering
Exploring Big Data Processing and Open-Source Tools
Introduction to Big Data Processing and Open-Source Tools
Hadoop Framework and Apache Spark Analytics Engine
Cassandra and MongoDB: NoSQL databases for Big Data storage and retrieval
Distributed computing platforms – HPCC and Apache Storm
Working with KNIME Analytics Platform, RapidMiner, and RStudio
Dive into Cloud Computing and AWS
Concepts of Cloud Computing
AWS Fundamentals
Working with AWS
Mastering Distributed Data Processing using AWS Apache Spark
Overview of distributed data processing using AWS Apache Spark
Apache Spark as a distributed processing framework for big data
Data processing with Apache Spark on Amazon Sage Maker
Amazon EMR for running Spark on AWS
Building a Spark application on AWS
Best practices for distributed data processing using AWS Apache Spark
Group Project – 1
This section consists of Group Project. Learners gain Hands-On Experience on the Data Engineering tasks using AWS tools such as Lambda, Glue, PySpark and EMR
Working with the Tools for Data Engineering Part – 1 (Hands-on)
Working with AWS Lambda
Development Lifecycle for AWS PySpark
Working with AWS Glue
Mastering AWS EMR
Tools for Data Engineering Part – 2 (Hands-on)
Building a Streaming Pipeline using Kinesis
Working with DynamoDB using Boto3
Getting most out of Amazon Athena
Getting started with Amazon Redshift
Group Project – 2
This section consists of Group Project. Learners gain Hands-On Experience on the Data Engineering tasks using AWS tools such as Kinesis, DynamoDB, Athena and Redshift