Automating S3 Data Ingestion with GitHub Actions and Terraform: A Practical Guide

Mike Vincent
3 min readAug 12, 2024

--

Ready to streamline your data workflow? Let’s set up an automated system to create an S3 bucket and upload CSV data using GitHub Actions and Terraform. This guide will walk you through each step, making your data ingestion process smoother and more efficient.

What You’ll Need

Before we start, make sure you have:

  1. An AWS account
  2. A GitHub account
  3. Basic knowledge of Python, YAML, and Terraform

Step 1: Setting Up AWS Credentials

First, let’s create AWS credentials:

  1. Log into your AWS Management Console
  2. Go to IAM (Identity and Access Management)
  3. Click “Users” in the sidebar, then “Add user”
  4. Set a username (e.g., “github-actions-user”)
  5. Select “Programmatic access”
  6. For permissions, attach “AmazonS3FullAccess” and “AWSCloudFormationFullAccess” policies
  7. Complete the user creation process
  8. Save the Access Key ID and Secret Access Key securely

Step 2: Adding Secrets to GitHub

Now, let’s add these credentials to your GitHub repository:

  1. Go to your GitHub repository
  2. Click “Settings” > “Secrets” > “New repository secret”
  3. Add two secrets:
  • Name: AWS_ACCESS_KEY_ID, Value: Your AWS Access Key ID
  • Name: AWS_SECRET_ACCESS_KEY, Value: Your AWS Secret Access Key

Step 3: Creating the Terraform Configuration

Create a file named main.tf in your repository:

provider "aws" {
region = "us-west-2"
}
resource "aws_s3_bucket" "data_bucket" {
bucket = "my-data-bucket-${random_string.bucket_suffix.result}"
acl = "private"
versioning {
enabled = true
}
tags = {
Name = "Data Ingestion Bucket"
Environment = "Dev"
}
}
resource "random_string" "bucket_suffix" {
length = 8
special = false
upper = false
}
output "bucket_name" {
value = aws_s3_bucket.data_bucket.id
}

This Terraform script creates a secure, versioned S3 bucket with a unique name.

Step 4: Writing the Python Upload Script

Create a file named upload_csv.py:

import boto3
import os
import sys
def upload_to_s3(file_name, bucket, object_name=None):
if object_name is None:
object_name = os.path.basename(file_name)
s3_client = boto3.client('s3')
try:
s3_client.upload_file(file_name, bucket, object_name)
print(f"{file_name} successfully uploaded to {bucket}")
return True
except Exception as e:
print(f"An error occurred: {e}")
return False
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python upload_csv.py <bucket_name>")
sys.exit(1)
bucket_name = sys.argv[1]
file_name = "data.csv" # Ensure this file exists in your repository
if upload_to_s3(file_name, bucket_name):
print("Data upload complete")
else:
print("Data upload failed")
sys.exit(1)

This script uploads your CSV file to the S3 bucket.

Step 5: Setting Up GitHub Actions

Create a file at .github/workflows/ingest_data.yml:

name: S3 Data Ingestion
on:
push:
branches: [ main ]
workflow_dispatch:
jobs:
ingest_data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Terraform
uses: hashicorp/setup-terraform@v1
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-west-2
- name: Terraform Init
run: terraform init
- name: Terraform Apply
run: terraform apply -auto-approve
- name: Get Bucket Name
id: get_bucket
run: echo "::set-output name=bucket_name::$(terraform output -raw bucket_name)"
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install boto3
- name: Run upload script
run: python upload_csv.py ${{ steps.get_bucket.outputs.bucket_name }}

This workflow automates the process of creating the S3 bucket and uploading data.

Step 6: Adding Sample Data

Create a file named data.csv in your repository with some sample data:

id,name,value
1,Item 1,100
2,Item 2,200
3,Item 3,300

Step 7: Commit and Push

Commit all these files to your repository and push to the main branch:

git add .
git commit -m "Set up automated S3 data ingestion"
git push origin main

What Happens Next

  1. When you push to the main branch, GitHub Actions starts automatically
  2. Terraform creates or updates the S3 bucket
  3. The Python script uploads your data.csv file to the new bucket

Keeping Things Secure

  • AWS credentials are stored as GitHub Secrets for safety
  • The S3 bucket is set to private by default
  • Consider using AWS IAM roles for even better security

You’ve now set up an automated process for S3 data ingestion. This setup can save time and reduce errors in your data workflow. Keep exploring and adapting this process to fit your specific needs!

“That’s all Folks!”

--

--

Mike Vincent
Mike Vincent

Written by Mike Vincent

Mike Vincent is an American software engineer and writer based in Los Angeles. He writes about tech leadership and holds degrees in Linguistics and Management.

No responses yet