Responsibilities
- Design, develop, and maintain data pipelines using Apache Spark to efficiently process and transform large volumes of data.
- Collaborate with data architects and other stakeholders to define data architecture and best practices.
- Ensure data models and structures align with business requirements and are scalable for future needs.
- Work on real-time data processing and streaming using Spark Streaming.
- Optimize Spark jobs and Java code for performance, scalability, and resource utilization.
- Monitor and troubleshoot data pipeline issues to ensure minimal downtime and maximum efficiency.
- Implement data quality checks, data validation, and error handling mechanisms to maintain data integrity.
- Ensure compliance with data governance and security policies.
- Document data engineering processes, data flows, and configurations for future reference.
- Collaborate with data scientists, analysts, and business stakeholders to understand data requirements and deliver solutions that meet their needs.
- Set up monitoring and alerting systems to proactively identify and address data pipeline issues.
- Perform routine maintenance tasks and keep software and systems up to date.
Requirements
- A Bachelor's or higher degree in Computer Science, Information Technology, or a related field.
- Knowledge of Java for software development.
- Extensive experience with Apache Spark, including Spark SQL and Spark Streaming.
- Proficiency in big data technologies and frameworks such as Hadoop, HDFS, and related tools.
- Knowledge of data warehousing concepts and technologies.
- Experience with database systems (SQL and NoSQL).
- Strong problem-solving skills and the ability to work in a collaborative, team-oriented environment.
- Excellent communication and documentation skills.
- Understanding of data security, privacy, and compliance best practices.