Automating S3 Data Ingestion with GitHub Actions and Terraform: A Practical Guide
Ready to streamline your data workflow? Let’s set up an automated system to create an S3 bucket and upload CSV data using GitHub Actions and Terraform. This guide will walk you through each step, making your data ingestion process smoother and more efficient.
What You’ll Need
Before we start, make sure you have:
- An AWS account
- A GitHub account
- Basic knowledge of Python, YAML, and Terraform
Step 1: Setting Up AWS Credentials
First, let’s create AWS credentials:
- Log into your AWS Management Console
- Go to IAM (Identity and Access Management)
- Click “Users” in the sidebar, then “Add user”
- Set a username (e.g., “github-actions-user”)
- Select “Programmatic access”
- For permissions, attach “AmazonS3FullAccess” and “AWSCloudFormationFullAccess” policies
- Complete the user creation process
- Save the Access Key ID and Secret Access Key securely
Step 2: Adding Secrets to GitHub
Now, let’s add these credentials to your GitHub repository:
- Go to your GitHub repository
- Click “Settings” > “Secrets” > “New repository secret”
- Add two secrets:
- Name: AWS_ACCESS_KEY_ID, Value: Your AWS Access Key ID
- Name: AWS_SECRET_ACCESS_KEY, Value: Your AWS Secret Access Key
Step 3: Creating the Terraform Configuration
Create a file named main.tf
in your repository:
provider "aws" {
region = "us-west-2"
}
resource "aws_s3_bucket" "data_bucket" {
bucket = "my-data-bucket-${random_string.bucket_suffix.result}"
acl = "private" versioning {
enabled = true
} tags = {
Name = "Data Ingestion Bucket"
Environment = "Dev"
}
}resource "random_string" "bucket_suffix" {
length = 8
special = false
upper = false
}output "bucket_name" {
value = aws_s3_bucket.data_bucket.id
}
This Terraform script creates a secure, versioned S3 bucket with a unique name.
Step 4: Writing the Python Upload Script
Create a file named upload_csv.py
:
import boto3
import os
import sys
def upload_to_s3(file_name, bucket, object_name=None):
if object_name is None:
object_name = os.path.basename(file_name) s3_client = boto3.client('s3')
try:
s3_client.upload_file(file_name, bucket, object_name)
print(f"{file_name} successfully uploaded to {bucket}")
return True
except Exception as e:
print(f"An error occurred: {e}")
return Falseif __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python upload_csv.py <bucket_name>")
sys.exit(1) bucket_name = sys.argv[1]
file_name = "data.csv" # Ensure this file exists in your repository if upload_to_s3(file_name, bucket_name):
print("Data upload complete")
else:
print("Data upload failed")
sys.exit(1)
This script uploads your CSV file to the S3 bucket.
Step 5: Setting Up GitHub Actions
Create a file at .github/workflows/ingest_data.yml
:
name: S3 Data Ingestion
on:
push:
branches: [ main ]
workflow_dispatch:jobs:
ingest_data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2 - name: Set up Terraform
uses: hashicorp/setup-terraform@v1 - name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-west-2 - name: Terraform Init
run: terraform init - name: Terraform Apply
run: terraform apply -auto-approve - name: Get Bucket Name
id: get_bucket
run: echo "::set-output name=bucket_name::$(terraform output -raw bucket_name)" - name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x' - name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install boto3 - name: Run upload script
run: python upload_csv.py ${{ steps.get_bucket.outputs.bucket_name }}
This workflow automates the process of creating the S3 bucket and uploading data.
Step 6: Adding Sample Data
Create a file named data.csv
in your repository with some sample data:
id,name,value
1,Item 1,100
2,Item 2,200
3,Item 3,300
Step 7: Commit and Push
Commit all these files to your repository and push to the main branch:
git add .
git commit -m "Set up automated S3 data ingestion"
git push origin main
What Happens Next
- When you push to the main branch, GitHub Actions starts automatically
- Terraform creates or updates the S3 bucket
- The Python script uploads your
data.csv
file to the new bucket
Keeping Things Secure
- AWS credentials are stored as GitHub Secrets for safety
- The S3 bucket is set to private by default
- Consider using AWS IAM roles for even better security
You’ve now set up an automated process for S3 data ingestion. This setup can save time and reduce errors in your data workflow. Keep exploring and adapting this process to fit your specific needs!