CleanCodeNZ

Lakehouse solution in AWS

2021-12-06T20:06:38.000Z

My understanding of difference of datalake and lakehouse is lakehouse has a data warehouse component, its main data transformations happen on the data warehouse, while the datalake is purely using object storage, and transformations happen on spark cluster.

The dbt is a very good tool if the backend is a database or data warehouse, and transformations can be done by SQL statements, dbt supports most of traditional databases, also the data warehouses like redshift, snowflake, for the big data data warehouses like hive and databricks

The lakehouse stack on AWS:

Data storage: S3
Ingestion: Glue catalog(streaming, jdbc or s3 files) + Glue job
Transformation: dbt
Datawarehouse: Redshift or Redshift spectrum
Data catalog: Glue catalog
Job scheduling/Automation: Airflow(using bashoperators or generic aws operators)

A Simple Datalake Solution in AWS

2021-12-01T20:25:32.000Z

Here is a simple datalake stack in aws:

Data storage: S3
Ingestion: Glue catalog(streaming, jdbc or s3 files) + Glue job
Transformation: Glue jobs(spark sql)
Load: Athena or Redshift or Redshift spectrum
Data catalog: Glue catalog
Job scheduling/Automation: Airflow(using bashoperators which run aws cli commands)

Cleancode NZ Gone live

2021-11-28T20:59:05.000Z

Cleancode NZ web site has gone live on github page.

The reason it is not using s3 is that s3 static web hosting does not have static ip addresses, you can only use aws Cloudfront to redirect traffic to hosted files, and also to use ssl, you need aws Certificate Manager.

While current solution is the files are hosted by github page and Cloudflare is used as nameserver and ssl certificate provider, this solution is much simpler than the static pages hosted by s3.

The idea is from Here

Hosted on aws s3

2021-10-29T23:11:35.000Z

The web site has been moved to aws s3.
Checkout my main page Home!