DATA COLLECTION USING AWS GLUE AND AWS ATHENA(A sports data lake)

·

2 min read

Challenge of the day: Build a data collection system that automates a building a data lake for NBA analytics.

The NBA Sport Data Lake Analytic project is a cloud-native solution that builds a scalable data lake for NBA analytics. By leveraging AWS services, this project automates data ingestion, cataloging, and querying, enabling efficient storage and analysis of NBA-related data.

AWS SERVICES

  1. Amazon Simple Storage Service(S3):

    S3 is a storage service that leverages the use of buckets to store data. There are different storage types to accomodate the need of the user. In this case, we will be using the general purpose buckets.

  2. Amazon Athena:

    Amazon Athena is a serverless query service that enables you to analyze data directly from Amazon S3 using standard SQL. It is easy to set up and use, eliminating the need to manage clusters or servers, making it ideal for data analysts and business users.

  3. Amazon Glue:

    Amazon Glue is a serverless data integration service that simplifies the process of preparing and combining data for analytics, machine learning, and application development.It enables you to create and run ETL (Extract, Transform, Load) jobs to move data between various sources, transform it as needed, and load it into your data warehouse or data lake.

WORKFLOW:

  1. Data is fetched from a website i.e sports.io. This data is fetched through an API call made directly to the website

  2. The raw data is fetched and stored in an S3 bucket.

  3. Amazon Glue is used to chreate a database and a schema

  4. Amazon Athena is used to query the data via data analytics

Data analytics is made easier using these services as I am able to directly query from Athena without the data leaving the S3 bucket at all. I was particularly excited to use Amazon Glue and Athena for the first time to experience the ETL job processing.

SKILLS LEARNED:

  1. Integrating the serverless querying and data integration services to work seamlessly.

  2. Using cloudshell instead of manually performing to creation of services.

  3. Integrating external APIs into cloud based workflows.

Here is the link to my github repo for this project:

https://github.com/mbengiivy/Sports-Data-Lake