AgriData Pipeline
Senior Data Engineer
2023
An ETL pipeline processing terabytes of multi-spectral satellite imagery to monitor crop health, predict yields, and alert smallholder farmers of potential pest infestations before they become visible to the naked eye.
Context
Smallholder farmers in Kenya lose up to 40% of their yield to pests and diseases. Early detection is crucial, but satellite data is noisy, expensive, and hard to process at scale.
The Engineering
My role was to make the data pipeline reliable and cost-effective.
- Orchestration: Used Airflow DAGs to coordinate the retrieval of imagery from Planet Labs.
- Processing: Normalized different satellite bands (NDVI, EVI) using Cloud Functions (serverless) to handle burst loads during satellite passes without paying for idle compute.
- Data Warehousing: Structured the geospatial data in BigQuery to allow agronomy researchers to run SQL queries over "maps" effectively.
// Sample BQ Query logic
SELECT farm_id, ST_ClusterDBSCAN(geometry, 50, 2) OVER () AS cluster_id, AVG(ndvi_mean) as health_index FROM `agri_data.satellite_reads` WHERE read_date BETWEEN '2023-01-01' AND '2023-01-31' GROUP BY farm_id, geometry
Result
The pipeline reduced data latency from 2 weeks to 48 hours, allowing interventions to happen in near real-time. It currently monitors over 14,000 hectares of maize and coffee farms.
Tech Stack
Apache AirflowGoogle Cloud PlatformBigQueryPlanet Labs APITerraform