Automating S3 Data Ingestion with GitHub Actions and Terraform: A Practical Guide

3 min readAug 12, 2024

Ready to streamline your data workflow? Let’s set up an automated system to create an S3 bucket and upload CSV data using GitHub Actions and Terraform. This guide will walk you through each step, making your data ingestion process smoother and more efficient.

What You’ll Need

Before we start, make sure you have:

An AWS account
A GitHub account
Basic knowledge of Python, YAML, and Terraform

Step 1: Setting Up AWS Credentials

First, let’s create AWS credentials:

Log into your AWS Management Console
Go to IAM (Identity and Access Management)
Click “Users” in the sidebar, then “Add user”
Set a username (e.g., “github-actions-user”)
Select “Programmatic access”
For permissions, attach “AmazonS3FullAccess” and “AWSCloudFormationFullAccess” policies
Complete the user creation process
Save the Access Key ID and Secret Access Key securely

Step 2: Adding Secrets to GitHub

Now, let’s add these credentials to your GitHub repository:

Go to your GitHub repository
Click “Settings” > “Secrets” > “New repository secret”
Add two secrets:

Name: AWS_ACCESS_KEY_ID, Value: Your AWS Access Key ID
Name: AWS_SECRET_ACCESS_KEY, Value: Your AWS Secret Access Key

Step 3: Creating the Terraform Configuration

Create a file named main.tf in your repository:

provider "aws" {
  region = "us-west-2"
}

resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-data-bucket-${random_string.bucket_suffix.result}"
  acl    = "private"  versioning {
    enabled = true
  }  tags = {
    Name        = "Data Ingestion Bucket"
    Environment = "Dev"
  }
}resource "random_string" "bucket_suffix" {
  length  = 8
  special = false
  upper   = false
}output "bucket_name" {
  value = aws_s3_bucket.data_bucket.id
}

This Terraform script creates a secure, versioned S3 bucket with a unique name.

Step 4: Writing the Python Upload Script

Create a file named upload_csv.py:

import boto3
import os
import sys

def upload_to_s3(file_name, bucket, object_name=None):
    if object_name is None:
        object_name = os.path.basename(file_name)    s3_client = boto3.client('s3')
    try:
        s3_client.upload_file(file_name, bucket, object_name)
        print(f"{file_name} successfully uploaded to {bucket}")
        return True
    except Exception as e:
        print(f"An error occurred: {e}")
        return Falseif __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python upload_csv.py <bucket_name>")
        sys.exit(1)    bucket_name = sys.argv[1]
    file_name = "data.csv"  # Ensure this file exists in your repository    if upload_to_s3(file_name, bucket_name):
        print("Data upload complete")
    else:
        print("Data upload failed")
        sys.exit(1)

This script uploads your CSV file to the S3 bucket.

Step 5: Setting Up GitHub Actions

Create a file at .github/workflows/ingest_data.yml:

name: S3 Data Ingestion

on:
  push:
    branches: [ main ]
  workflow_dispatch:jobs:
  ingest_data:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2    - name: Set up Terraform
      uses: hashicorp/setup-terraform@v1    - name: Configure AWS Credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-west-2    - name: Terraform Init
      run: terraform init    - name: Terraform Apply
      run: terraform apply -auto-approve    - name: Get Bucket Name
      id: get_bucket
      run: echo "::set-output name=bucket_name::$(terraform output -raw bucket_name)"    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.x'    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install boto3    - name: Run upload script
      run: python upload_csv.py ${{ steps.get_bucket.outputs.bucket_name }}

This workflow automates the process of creating the S3 bucket and uploading data.

Step 6: Adding Sample Data

Create a file named data.csv in your repository with some sample data:

id,name,value
1,Item 1,100
2,Item 2,200
3,Item 3,300

Step 7: Commit and Push

Commit all these files to your repository and push to the main branch:

git add .
git commit -m "Set up automated S3 data ingestion"
git push origin main

What Happens Next

When you push to the main branch, GitHub Actions starts automatically
Terraform creates or updates the S3 bucket
The Python script uploads your data.csv file to the new bucket

Keeping Things Secure

AWS credentials are stored as GitHub Secrets for safety
The S3 bucket is set to private by default
Consider using AWS IAM roles for even better security

You’ve now set up an automated process for S3 data ingestion. This setup can save time and reduce errors in your data workflow. Keep exploring and adapting this process to fit your specific needs!