
Experienced Data Engineer with over 8 years of expertise in designing and implementing end-to-end data solutions, with a strong foundation in Python development. Skilled in building scalable architectures on cloud platforms, realtime data processing, predictive modelling, and machine learning engineering. Extensive experience in leveraging Python for data manipulation, feature engineering, and creating reusable frameworks to streamline business rules and data pipelines. Adept at leading cross-functional teams, collaborating with C-level executives, and delivering actionable insights through innovative solutions. Proficient in tools and platforms such as AWS, GCP, Apache Airflow, and PySpark, driving operational excellence and impactful results in data-driven environments
Led the design and maintenance of the company’s data systems, ensuring scalability and efficiency in pipelines capable of processing l...
Led the design and maintenance of the company’s data systems, ensuring scalability and efficiency in pipelines capable of processing large volumes of real-time data. • Developed robust solutions using AWS, integrating services such as SNS, SQS, Lambda, and Athena to optimize message traffic, automate processes, and enable efficient data querying for business intelligence needs. • Utilized Databricks for advanced data processing and analytics, streamlining workflows and enabling seamless integration with machine learning pipelines. • Managed data movement and large datasets with tools like Apache Airflow, PySpark, Apache Kafka, and Debezium. • Implemented real-time data ingestion and processing strategies, providing a solid foundation for fast, datadriven decision-making. • Actively collaborated with MLOps and DBA teams, transforming complex data into actionable insights and aligning solutions with business needs. • Streamlined workflows by integrating tools like GitHub for version control and Nexus for artifact management. • Designed and optimized scalable data pipelines to meet growing demands, ensuring operational efficiency and reducing latency in critical data flows. Stack: AWS (SNS, SQS, Lambda, Athena), Apache Airflow, PySpark, EMR, Apache Kafka, Debezium, Spark Streaming, Databricks, GitHub, Nexus.
Designed and implemented the foundation of the GCP environment for the company, adhering to best practices and organizational policies...
Designed and implemented the foundation of the GCP environment for the company, adhering to best practices and organizational policies, including IAM roles, billing accounts, project structures, and secure network communication with on-premises systems. • Developed a multi-cloud architecture combining AWS and Google Cloud Platform (GCP), ensuring seamless integration of critical datasets for machine learning projects. • Built and maintained a comprehensive feature store, enabling machine learning algorithms with high-quality, production-ready data. • Designed and deployed a Python-based framework for processing 2 TB of data per day on an on-premises Hive environment, including real-time orchestration and integration with data pipelines. • Established a research and development environment using Jupyter Notebook, empowering data scientists to prototype and refine machine learning models effectively. • Managed the end-to-end feature delivery lifecycle, including requirements gathering, data modelling, pipeline implementation, and production deployment. • Engineered machine learning models using Python and libraries like LightGBM, deploying solutions on AWS SageMaker and Google Vertex AI. • Implemented efficient ETL pipelines leveraging tools such as BigQuery, AWS S3, Dataflow, and Composer to process diverse behavioural data sources. • Optimized pipeline execution time, reducing it from 240 minutes to 3 minutes by streamlining workloads and resource allocation. Stack: GCP (BigQuery, Vertex AI, Dataflow, Composer), AWS (SageMaker, S3, Athena), Hive, Python, Jupyter Notebook, LightGBM.
• Interacted with C-level executives to gather business requirements and translate them into technical specifications for data product...
• Interacted with C-level executives to gather business requirements and translate them into technical specifications for data products. • Led the development of a Data Warehouse structured with a star schema, hosted on MS SQL Server, to centralize company-wide business intelligence needs. • Designed and implemented an ETL framework using Python, automating data ingestion and transformation while ensuring data consistency and quality. • Created dashboards and reports using QlikView, delivering actionable insights and driving decision-making across the organization. • Developed a local website using Django and JavaScript to share key performance indicators (KPIs) and reports across teams. • Engineered an innovative KPI, "Time in Call," which influenced the company’s perception of productivity and improved performance evaluation processes. • Reduced reporting and analysis lead time by consolidating data from multiple sources into the centralized Data Warehouse, ensuring accuracy and accessibility. Stack: MS SQL Server, Python, QlikView, Django, JavaScript, Rest API.