Arrange fine-grained permissions to your knowledge pipeline utilizing MWAA and EKS

Big Data

Arrange fine-grained permissions to your knowledge pipeline utilizing MWAA and EKS

lohitnath.453

September 29, 2023

Arrange fine-grained permissions to your knowledge pipeline utilizing MWAA and EKS

[ad_1]

This can be a visitor weblog put up co-written with Patrick Oberherr from Contentful and Johannes Günther from Netlight Consulting.

This weblog put up reveals how you can enhance safety in a knowledge pipeline structure based mostly on Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and Amazon Elastic Kubernetes Service (Amazon EKS) by establishing fine-grained permissions, utilizing HashiCorp Terraform for infrastructure as code.

Many AWS clients use Amazon EKS to execute their knowledge workloads. Some great benefits of Amazon EKS embrace completely different compute and storage choices relying on workload wants, larger useful resource utilization by sharing underlying infrastructure, and a vibrant open-source neighborhood that gives purpose-built extensions. The Information on EKS mission supplies a sequence of templates and different sources to assist clients get began on this journey. It features a description of utilizing Amazon MWAA as a job scheduler.

Contentful is an AWS buyer and AWS Accomplice Community (APN) associate. Behind the scenes of their Software program-as-a-Service (SaaS) product, the Contentful Composable Content material Platform, Contentful makes use of insights from knowledge to enhance enterprise decision-making and buyer expertise. Contentful engaged Netlight, an APN consulting associate, to assist arrange a knowledge platform to assemble these insights.

Most of Contentful’s software workloads run on Amazon EKS, and information of this service and Kubernetes is widespread within the group. That’s why Contentful’s knowledge engineering staff determined to run knowledge pipelines on Amazon EKS as effectively. For job scheduling, they began with a self-operated Apache Airflow on an Amazon EKS cluster and later switched to Amazon MWAA to scale back engineering and operations overhead. The job execution remained on Amazon EKS.

Contentful runs a fancy knowledge pipeline utilizing this infrastructure, together with ingestion from a number of knowledge sources and completely different transformation jobs, for instance utilizing dbt. The entire pipeline shares a single Amazon MWAA atmosphere and a single Amazon EKS cluster. With a various set of workloads in a single atmosphere, it’s crucial to use the precept of least privilege, making certain that particular person duties or elements have solely the precise permissions they should perform.

By segmenting permissions based on roles and duties, Contentful’s knowledge engineering staff was capable of create a extra strong and safe knowledge processing atmosphere, which is important for sustaining the integrity and confidentiality of the info being dealt with.

On this weblog put up, we stroll by means of establishing the infrastructure from scratch and deploying a pattern software utilizing Terraform, Contentful’s instrument of alternative for infrastructure as code.

Stipulations

To comply with alongside this weblog put up, you want the most recent model of the next instruments put in:

Overview

On this weblog put up, you’ll create a pattern software with the next infrastructure:

Architecture drawing of the sample application deployed in this blog post

The pattern Airflow workflow lists objects within the supply bucket, quickly shops this record utilizing Airflow XComs, and writes the record as a file to the vacation spot bucket. This software is executed utilizing Amazon EKS pods, scheduled by an Amazon MWAA atmosphere. You deploy the EKS cluster and the MWAA atmosphere right into a digital personal cloud (VPC) and apply least-privilege permissions to the EKS pods utilizing IAM roles for service accounts. The configuration bucket for Amazon MWAA incorporates runtime necessities, in addition to the applying code specifying an Airflow Directed Acyclic Graph (DAG).

Initialize the mission and create buckets

Create a file primary.tf with the next content material in an empty listing:

locals {
  area = "us-east-1"
}

supplier "aws" {
  area = native.area
}

useful resource "aws_s3_bucket" "source_bucket" {
  bucket_prefix = "supply"
}

useful resource "aws_s3_object" "dummy_object" {
  bucket  = aws_s3_bucket.source_bucket.bucket
  key     = "dummy.txt"
  content material = ""
}

useful resource "aws_ssm_parameter" "source_bucket" {
  title  = "mwaa_source_bucket"
  kind  = "SecureString"
  worth = aws_s3_bucket.source_bucket.bucket
}

useful resource "aws_s3_bucket" "destination_bucket" {
  bucket_prefix = "vacation spot"
  force_destroy = true
}

useful resource "aws_ssm_parameter" "destination_bucket" {
  title  = "mwaa_destination_bucket"
  kind  = "SecureString"
  worth = aws_s3_bucket.destination_bucket.bucket
}

This file defines the Terraform AWS supplier in addition to the supply and vacation spot bucket, whose names are exported as AWS Methods Supervisor parameters. It additionally tells Terraform to add an empty object named dummy.txt into the supply bucket, which permits the Airflow pattern software we’ll create later to obtain a consequence when itemizing bucket content material.

Initialize the Terraform mission and obtain the module dependencies by issuing the next command:

Create the infrastructure:

Terraform asks you to acknowledge adjustments to the atmosphere after which begins deploying sources in AWS. Upon profitable deployment, you must see the next success message:

Apply full! Sources: 5 added, 0 modified, 0 destroyed.

Create VPC

Create a brand new file vpc.tf in the identical listing as primary.tf and insert the next:

knowledge "aws_availability_zones" "accessible" {}

locals {
  cidr = "10.0.0.0/16"
  azs  = slice(knowledge.aws_availability_zones.accessible.names, 0, 3)
}

module "vpc" {
  title               = "data-vpc"
  supply             = "terraform-aws-modules/vpc/aws"
  model            = "~> 4.0"
  cidr               = native.cidr
  azs                = native.azs
  public_subnets     = [for k, v in local.azs : cidrsubnet(local.cidr, 8, k + 48)]
  private_subnets    = [for k, v in local.azs : cidrsubnet(local.cidr, 4, k)]
  enable_nat_gateway = true
}

This file defines the VPC, a digital community, that can later host the Amazon EKS cluster and the Amazon MWAA atmosphere. Word that we use an present Terraform module for this, which wraps configuration of underlying community sources like subnets, route tables, and NAT gateways.

Obtain the VPC module:

Deploy the brand new sources:

Word which sources are being created. By utilizing the VPC module in our Terraform file, a lot of the underlying complexity is taken away when defining our infrastructure, but it surely’s nonetheless helpful to know what precisely is being deployed.

Word that Terraform now handles sources we outlined in each recordsdata, primary.tf and vpc.tf, as a result of Terraform consists of all .tf recordsdata within the present working listing.

Create the Amazon MWAA atmosphere

Create a brand new file mwaa.tf and insert the next content material:

locals {
  requirements_filename = "necessities.txt"
  airflow_version       = "2.6.3"
  requirements_content  = <<EOT
apache-airflow[cncf.kubernetes]==${native.airflow_version}
EOT
}

module "mwaa" {
  supply = "github.com/aws-ia/terraform-aws-mwaa?ref=1066050"

  title              = "mwaa"
  airflow_version   = native.airflow_version
  environment_class = "mw1.small"

  vpc_id             = module.vpc.vpc_id
  private_subnet_ids = slice(module.vpc.private_subnets, 0, 2)

  webserver_access_mode = "PUBLIC_ONLY"

  requirements_s3_path = native.requirements_filename
}

useful resource "aws_s3_object" "necessities" {
  bucket  = module.mwaa.aws_s3_bucket_name
  key     = native.requirements_filename
  content material = native.requirements_content

  etag = md5(native.requirements_content)
}

Like earlier than, we use an present module to avoid wasting configuration effort for the Amazon MWAA atmosphere. The module additionally creates the configuration bucket, which we use to specify the runtime dependency of the applying (apache-airflow-cncf-kubernetes) within the necessities.txt file. This bundle, together with the preinstalled bundle apache-airflow-amazon, permits interplay with Amazon EKS.

Obtain the MWAA module:

Deploy the brand new sources:

This operation takes 20–half-hour to finish.

Create the Amazon EKS cluster

Create a file eks.tf with the next content material:

module "cluster" {
  supply = "github.com/aws-ia/terraform-aws-eks-blueprints?ref=8a06a6e"

  cluster_name    = "data-cluster"
  cluster_version = "1.27"

  vpc_id             = module.vpc.vpc_id
  private_subnet_ids = module.vpc.private_subnets
  enable_irsa        = true

  managed_node_groups = {
    node_group = {
      node_group_name = "node-group"
      desired_size    = 1
    }
  }
  application_teams = {
    mwaa = {}
  }

  map_roles = [{
    rolearn  = module.mwaa.mwaa_role_arn
    username = "mwaa-executor"
    groups   = []
  }]
}

knowledge "aws_eks_cluster_auth" "this" {
  title = module.cluster.eks_cluster_id
}

supplier "kubernetes" {
  host                   = module.cluster.eks_cluster_endpoint
  cluster_ca_certificate = base64decode(module.cluster.eks_cluster_certificate_authority_data)
  token                  = knowledge.aws_eks_cluster_auth.this.token
}

useful resource "kubernetes_role" "mwaa_executor" {
  metadata {
    title      = "mwaa-executor"
    namespace = "mwaa"
  }

  rule {
    api_groups = [""]
    sources  = ["pods", "pods/log", "pods/exec"]
    verbs      = ["get", "list", "create", "patch", "delete"]
  }
}

useful resource "kubernetes_role_binding" "mwaa_executor" {
  metadata {
    title      = "mwaa-executor"
    namespace = "mwaa"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    type      = "Function"
    title      = kubernetes_role.mwaa_executor.metadata[0].title
  }
  topic {
    type      = "Consumer"
    title      = "mwaa-executor"
    api_group = "rbac.authorization.k8s.io"
  }
}

output "configure_kubectl" {
  description = "Configure kubectl: ensure you're logged in with the right AWS profile and run the next command to replace your kubeconfig"
  worth       = "aws eks --region ${native.area} update-kubeconfig --name ${module.cluster.eks_cluster_id}"
}

To create the cluster itself, we benefit from the Amazon EKS Blueprints for Terraform mission. We additionally outline a managed node group with one node because the goal dimension. Word that in circumstances with fluctuating load, scaling your cluster with Karpenter as a substitute of the managed node group method proven above makes the cluster scale extra flexibly. We used managed node teams primarily due to the convenience of configuration.

We outline the id that the Amazon MWAA execution function assumes in Kubernetes utilizing the map_roles variable. After configuring the Terraform Kubernetes supplier, we give the Amazon MWAA execution function permissions to handle pods within the cluster.

Obtain the EKS Blueprints for Terraform module:

Deploy the brand new sources:

This operation takes about 12 minutes to finish.

Create IAM roles for service accounts

Create a file roles.tf with the next content material:

knowledge "aws_iam_policy_document" "source_bucket_reader" {
  assertion {
    actions   = ["s3:ListBucket"]
    sources = ["${aws_s3_bucket.source_bucket.arn}"]
  }
  assertion {
    actions   = ["ssm:GetParameter"]
    sources = [aws_ssm_parameter.source_bucket.arn]
  }
}

useful resource "aws_iam_policy" "source_bucket_reader" {
  title   = "source_bucket_reader"
  path   = "/"
  coverage = knowledge.aws_iam_policy_document.source_bucket_reader.json
}

module "irsa_source_bucket_reader" {
  supply = "github.com/aws-ia/terraform-aws-eks-blueprints//modules/irsa"

  eks_cluster_id              = module.cluster.eks_cluster_id
  eks_oidc_provider_arn       = module.cluster.eks_oidc_provider_arn
  irsa_iam_policies           = [aws_iam_policy.source_bucket_reader.arn]
  kubernetes_service_account  = "source-bucket-reader-sa"
  kubernetes_namespace        = "mwaa"
  create_kubernetes_namespace = false
}

knowledge "aws_iam_policy_document" "destination_bucket_writer" {
  assertion {
    actions   = ["s3:PutObject"]
    sources = ["${aws_s3_bucket.destination_bucket.arn}/*"]
  }
  assertion {
    actions   = ["ssm:GetParameter"]
    sources = [aws_ssm_parameter.destination_bucket.arn]
  }
}

useful resource "aws_iam_policy" "destination_bucket_writer" {
  title   = "irsa_destination_bucket_writer"
  coverage = knowledge.aws_iam_policy_document.destination_bucket_writer.json
}

module "irsa_destination_bucket_writer" {
  supply = "github.com/aws-ia/terraform-aws-eks-blueprints//modules/irsa"

  eks_cluster_id              = module.cluster.eks_cluster_id
  eks_oidc_provider_arn       = module.cluster.eks_oidc_provider_arn
  irsa_iam_policies           = [aws_iam_policy.destination_bucket_writer.arn]
  kubernetes_service_account  = "destination-bucket-writer-sa"
  kubernetes_namespace        = "mwaa"
  create_kubernetes_namespace = false
}

This file defines two Kubernetes service accounts, source-bucket-reader-sa and destination-bucket-writer-sa, and their permissions in opposition to the AWS API, utilizing IAM roles for service accounts (IRSA). Once more, we use a module from the Amazon EKS Blueprints for Terraform mission to simplify IRSA configuration. Word that each roles solely get the minimal permissions that they want, outlined utilizing AWS IAM insurance policies.

Obtain the brand new module:

Deploy the brand new sources:

Create the DAG

Create a file dag.py defining the Airflow DAG:

from datetime import datetime

from airflow import DAG
from airflow.suppliers.amazon.aws.operators.eks import EksPodOperator

dag = DAG(
    "dag_with_fine_grained_permissions",
    description="DAG with fine-grained permissions",
    default_args={
        "cluster_name": "data-cluster",
        "namespace": "mwaa",
        "get_logs": True,
        "is_delete_operator_pod": True,
    },
    schedule="@hourly",
    start_date=datetime(2023, 1, 1),
    catchup=False,
)

read_bucket = EksPodOperator(
    task_id="read-bucket",
    pod_name="read-bucket",
    service_account_name="source-bucket-reader-sa",
    picture="amazon/aws-cli:newest",
    cmds=[
        "sh",
        "-xc",
        "aws s3api list-objects --output json --bucket $(aws ssm get-parameter --name mwaa_source_bucket --with-decryption --query 'Parameter.Value' --output text)  > /airflow/xcom/return.json",
    ],
    do_xcom_push=True,
    dag=dag,
)

write_bucket = EksPodOperator(
    task_id="write-bucket",
    pod_name="write-bucket",
    service_account_name="destination-bucket-writer-sa",
    picture="amazon/aws-cli:newest",
    cmds=[
        "sh",
        "-xc",
        "echo '{tojson }' > list.json; aws s3 cp list.json s3://$(aws ssm get-parameter --name mwaa_destination_bucket  --with-decryption --query 'Parameter.Value' --output text)",
    ],
    dag=dag,
)

read_bucket >> write_bucket

The DAG is outlined to run on an hourly schedule, with two duties read_bucket with service account source-bucket-reader-sa and write_bucket with service account destination-bucket-writer-sa, working after each other. Each are run utilizing the EksPodOperator, which is liable for scheduling the duties on Amazon EKS, utilizing the AWS CLI Docker picture to run instructions. The primary job lists recordsdata within the supply bucket and writes the record to Airflow XCom. The second job reads the record from XCom and shops it within the vacation spot bucket. Word that the service_account_name parameter differentiates what every job is permitted to do.

Create a file dag.tf to add the DAG code to the Amazon MWAA configuration bucket:

locals {
  dag_filename = "dag.py"
}

useful resource "aws_s3_object" "dag" {
  bucket = module.mwaa.aws_s3_bucket_name
  key    = "dags/${native.dag_filename}"
  supply = native.dag_filename

  etag = filemd5(native.dag_filename)
}

Deploy the adjustments:

The Amazon MWAA atmosphere routinely imports the file from the S3 bucket.

Run the DAG

In your browser, navigate to the Amazon MWAA console and choose your atmosphere. Within the prime right-hand nook, choose Open Airflow UI . It’s best to see the next:

Screenshot of the MWAA user interface

To set off the DAG, within the Actions column, choose the play image after which choose Set off DAG. Click on on the DAG title to discover the DAG run and its outcomes.

Navigate to the Amazon S3 console and select the bucket beginning with “vacation spot”. It ought to include a file record.json not too long ago created by the write_bucket job. Obtain the file to discover its content material, a JSON record with a single entry.

Clear up

The sources you created on this walkthrough incur AWS prices. To delete the created sources, challenge the next command:

And approve the adjustments within the Terraform CLI dialog.

Conclusion

On this weblog put up, you discovered how you can enhance the safety of your knowledge pipeline working on Amazon MWAA and Amazon EKS by narrowing the permissions of every particular person job.

To dive deeper, use the working instance created on this walkthrough to discover the subject additional: What occurs if you happen to take away the service_account_name parameter from an Airflow job? What occurs if you happen to change the service account names within the two duties?

For simplicity, on this walkthrough we used a flat file construction with Terraform and Python recordsdata inside a single listing. We didn’t adhere to the commonplace module construction proposed by Terraform, which is mostly really useful. In a real-life mission, splitting up the mission into a number of Terraform initiatives or modules might also improve flexibility, velocity, and independence between groups proudly owning completely different components of the infrastructure.

Lastly, be sure to review the Information on EKS documentation, which supplies different beneficial sources for working your knowledge pipeline on Amazon EKS, in addition to the Amazon MWAA and Apache Airflow documentation for implementing your individual use circumstances. Particularly, take a look at this pattern implementation of a Terraform module for Amazon MWAA and Amazon EKS, which incorporates a extra mature method to Amazon EKS configuration and node automated scaling, in addition to networking.

When you’ve got any questions, you can begin a brand new thread on AWS re:Submit or attain out to AWS Help.

In regards to the Authors

Ulrich Hinze is a Options Architect at AWS. He companions with software program corporations to architect and implement cloud-based options on AWS. Earlier than becoming a member of AWS, he labored for AWS clients and companions in software program engineering, consulting, and structure roles for 8+ years.

Patrick Oberherr is a Workers Information Engineer at Contentful with 4+ years of working with AWS and 10+ years within the Information discipline. At Contentful he’s liable for infrastructure and operations of the info stack which is hosted on AWS.

Johannes Günther is a cloud & knowledge guide at Netlight with 5+ years of working with AWS. He has helped purchasers throughout numerous industries designing sustainable cloud platforms and is AWS licensed.

[ad_2]