Responsible for designing, implementing, and maintaining data solutions on the Amazon Web Services (AWS) platform. Focus on building scalable and efficient data pipelines, data storage, and data processing systems to support data analytics, machine learning, and business intelligence initiatives. Collaborates with cross functional teams to understand data requirements, architect data solutions, and ensure the availability, reliability, and security of data assets.
Key Deliverables:
1. Implement data ingestion processes to extract data from various sources, including databases, APIs, and streaming platforms.
2. Develop and maintain Extract, Transform, Load (ETL) pipelines using AWS Glue, AWS Lambda, or custom scripts.
3. Perform data cleansing activity such as removing spaces, upper/lower, drop duplicate data, Renaming columns, etc.
4. Data Transformation with PySpark, Python and SQL.
5. Configure and optimize storage systems such as Amazon S3, Amazon Redshift, or Amazon DynamoDB to meet performance, scalability, and cost requirements.
6. Performance tuning of long running PySpark or SQL queries.
7. Monitor data pipelines, storage systems, and processing jobs to identify performance bottlenecks, data quality issues, and potential failures.
8. Optimize data processing workflows and infrastructure configurations to improve performance, reliability, and cost efficiency.