Airflow etl redshift

1/9/2024

When you drop an external table, the underlying data remains intact because all tables in Athena are EXTERNAL.

Output_location=’s3://bucket_name/prefix_folder_path’, Insert into schema1.table2 select * from schema1.table1 Now lets talk about orchestration of our transformation using airflow, awsathenaoperator is used to fire athena transformation queries, where the cleansed, deduped and flattened data is dumped to transformed parquet table ,įrom .operators.athena import AWSAthenaOperator Source table is all set, please refer to for better understanding on map, struct and array datatypes. `col4` array❮string❯ COMMENT ‘from deserializer’ `col3` map❮string,string❯ COMMENT ‘from deserializer’, On top of the s3 data with day level partition, create an Athena table with json serde The source is uncleaned, duplicate, complex nested json data and dumped at daily folders in s3. In this article lets clean and dedupe the s3 json data in our transformation layer. I will not talk about Mwaa setup and role, policy creation in this article, it is pretty simple and can refer to aws article. With Athena sharing Glue catalog, other services of AWS such as redshift spectrum/ EMR presto/EMR hive/ Glue spark etc can talk to transformed s3 dataset. One important reason for choosing Athena over glue spark is the ease of code and maintenance were data engineer can built the transformation in ansi sql which can be maintained by analytical engineer or data analysts. In this article Athena will be the heavy lifter who takes care of all transformations.

Athena is ideal for interactive querying and can also handle complex analysis, including large joins, window functions, and arrays. The beauty of athena is it supports wide variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet and many compression types. The core of athena is prestodb which was designed and developed by facebook then became community driven under apache license. So we chose airflow as our frontliner.Īthena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. With Most companies in the market are driving towards multi-cloud architecture, it is important to choose an orchestrator service that can talk to pretty much every cloud provider (aws, azure, gcp, oracle, etc. Airflow is an beautiful orchestrator that can integrate seamlessly with wide variety of connectors in the market, while MWAA (Amazon Managed Workflows for Apache Airflow ) makes it easier to setup and operate end-to-end data pipelines in the cloud at scale.

0 Comments

BLOG

Airflow etl redshift

Leave a Reply.

Author

Archives

Categories