About The Candidate
Essential Skills
Candidates should possess experience and/or knowledge of the following:
Data Engineering Fundamentals
- Advanced Python
- Python libraries (Pandas, NumPy)
- Strong SQL
- Bash / Shell scripting
Current Data Ecosystem
- Experience with big data and distributed processing technologies:
- Apache Hadoop
- Apache Hive
- Apache Kafka
- Apache Spark
- Understanding of batch and streaming data processing
- Exposure to data pipeline design and data workflows
Modern Data Engineering Practices
- Building and maintaining data pipelines
- Understanding of data modelling and scalable data systems
- Awareness of data quality, validation, and governance
AI-Ready Data Foundations
- Understanding of how data platforms support analytics and machine learning workloads
- Experience preparing datasets for ML/AI use cases
- Awareness of how data quality impacts model performance
Desirable Skills
Core Technologies
- MS Excel
- ETL / Data Warehousing
- Linux / Linux Administration
- Basic Networking and IT Security
- Experience with ServiceNow, JIRA, Confluence
Cloud & Platforms
- Experience with:
- Snowflake, AWS, Microsoft Azure, or Google Cloud Platform
- Familiarity with cloud-native data services
AI & Emerging Capabilities (Highly Desirable)
Modern & Emerging Architectures
- Streaming and event-driven architectures
- Feature engineering and feature store concepts
Generative AI & Advanced Data Use Cases
- Awareness of:
- The data lifecycles in AI applications
- Vector databases and embeddings
- Retrieval-Augmented Generation (RAG)
- Working with unstructured data
Governance & Responsible AI
- Understanding of data privacy, security, and ethical considerations
Coaching & Training Capability (Critical)
- Ability to clearly explain complex technical concepts
- Strong passion for teaching and mentoring
- Ability to translate industry trends into practical training content
- Strong communication and stakeholder management skills
Additional Information
The role currently includes training across the Hadoop ecosystem (Hadoop, Hive, Kafka, Spark), with a growing emphasis on cloud-native platforms, real-time processing, and AI-enabled data engineering as industry demand evolves.