Quickstart - Yield Study
This guide walks you through the process of setting up and running a yield study using our framework, which helps you simulate crop yields across different regions.
Prerequisites
Before starting, ensure you have:
- Completed the setup and installed all requirements
- Generated LAI data for your region of interest
- (Optional) If using CHIRPS precipitation data: Downloaded historical and current CHIRPS precipitation data.
- (Optional) If using ERA5 Metdata: Run
earthengine authenticate
and copy the link to your browser to sign into earth engine.
Setup Process
Setting up a yield study involves three main steps:
- Defining your regions of interest
- Configuring your simulation parameters
- Preparing your base directory structure
- Optional: Adding reference (validation) data
While the individual steps are detailed in other pages of this living document, on this page we outline the quickstart using our setup helper.
If you already have prepared a base directory and a configuration file, you can skip to step 4. Otherwise follow the steps below. You will typically want to run this on your local machine and not a remote cluster. You can transfer the final setup to the cluster after creation.
1. Defining Regions of Interest
Your yield study will run simulations for each defined region, typically specified in a shapefile.
Important requirements for your shapefile:
- Contains geometries at the same administrative level only (e.g., all geometries are districts OR all counties)
- Includes an attribute column with region names (e.g.,
NAME_2
containing "Cook County", "Orleans Parish")
[Experimental] If your shapefile contains mixed administrative levels, use the interactive helper script to normalize it to a single level:
python apsim/prepare_shapefile.py --shp_fpath /path/to/your.shp --output_dir /path/to/save/dir
2. Defining Your Configuration
Create a configuration file that controls simulation parameters:
- Create an empty directory
new_basedir_path
that will be your basediretory of your study. Recommended to give it a meaningfull name. - Navigate to
snakemake/example_setup/
and copy one of the example configurations tonew_basedir_path/config_template.yaml
. - Modify parameters according to your study needs (years, date ranges, meteorological data sources). For now, leave the regions fields empty, as it will be filled in by the setup helper.
Note The script for matching remotely sensed LAI and APSIM predicted LAI is not publicly available in this repository. You will have to set the path to the true matching script in your config
file under scripts.match_sim_real
.
The full configuration options are documented in the Inputs documentation (Section 6).
3. Setting Up Your Base Directory
The base directory (your study head dir
) organizes region-specific geometries and APSIM simulations by year, timepoint, and region (See Details).
Use the provided Jupyter notebook (vercye_setup_helper.ipynb
) to create this structure - just set the parameters below in the first cell and run it.
-
Input shapefile & region names
-
SHAPEFILE_PATH
Path to your.shp
containing all simulation-level regions (all geometries must share the same admin level). -
ADMIN_COLUMN_NAME
Attribute column holding each region’s identifier (e.g.NAME_2
).
-
-
(Optional) Subset regions
If you only want a subset of regions (e.g. counties in Texas & Colorado), set:
-
FILTER_COL_NAME
Column for the higher-level admin unit (e.g.NAME_1
). -
FILTER_COL_VALUES
List of values to keep, e.g.['Texas', 'Colorado']
.
To include all regions, setFILTER_COL_NAME = None
and leaveFILTER_COL_VALUES = []
.
-
-
Intermediate & output folders
-
GEOJSONS_FOLDER
Temporary folder where the notebook extracts each region as a GeoJSON polygon. -
OUTPUT_DIR
Your new base directory. Placeconfig_template.yaml
(your Snakemake config) here. -
SNAKEFILE_CONFIG
Path to that prefilledconfig_template.yaml
(it lives inOUTPUT_DIR/config_template.yaml
; you can leave itsregions:
field empty).
-
-
APSIM configuration templates
Rather than manually copying and editing an APSIM file for each year/region, the helper will:
- Copy a template for each higher-level region (e.g. state) into every year’s folder.
- Auto-adjust the simulation dates.
Configure this by setting:
-
APSIM_TEMPLATE_PATHS_FILTER_COL_NAME
Admin column that groups regions sharing a template (e.g.NAME_1
). -
APSIM_TEMPLATE_PATHS
Dictionary mapping column values to template paths; e.g.
yaml APSIM_TEMPLATE_PATHS: Texas: /path/to/texas_template.yaml Colorado: /path/to/colorado_template.yaml
-
Single-template setup If you only require one APSIM file for all regions, set:
yaml APSIM_TEMPLATE_PATHS_FILTER_COL_NAME: None APSIM_TEMPLATE_PATHS: all: /your/path/to/generalApsimTemplate.yaml
Once all parameters are defined, run the notebook. It will:
- Create your
year/timepoint/region
directory tree underOUTPUT_DIR
. - Generate a final
config.yaml
that merges your Snakemake settings with the selected regions.
Note: Sometimes, you might want to add some custom conditionals or processing, that is why we have provided this code in a jupyter notebook. In that case make sure to read the input documentation, to understand the required structure.
4. Adding Reported Validation Data
The VeRCYE pipeline can automatically generate validation metrics (e.g., R², RMSE) if reported data is available. To enable this, you must manually add validation data for each year.
Validation data can be provided at different geographic scales. It may be available at the smallest unit (e.g., ROI level used in simulations) or at a coarser level (e.g., government statistics). You must specify the scale so VeRCYE can aggregate predictions accordingly.
Define aggregation levels in your config file
under eval_params.aggregation_levels
. For each level, provide a key-value pair where the key is a descriptive name and the value is the column in your original shapefile used for aggregation. For example, if state-level ground truth uses the ADMIN_1 column, specify State: ADMIN_1
. If the validation data is at ROI level, no specification is needed—it will be automatically recognized.
For each year and aggregation level, create a CSV file named: {year}/groundtruth_{aggregation_name}-{year}.csv
, where aggregation_name matches the key in your config (case-sensitive!).
Example: For 2024 state-level data, the file should be: basedirectory/2024/groundtruth_State-2024.csv
For simulation ROI-level data, use primary
as the aggregation name: basedirectory/2024/groundtruth_primary-2024.csv
CSV Structure
region
: Name matching GeoJSON folder (forprimary aggregation level
) or matching attribute table column values for custom aggregation level (Column as specified undereval_params.aggregation_levels
in tourconfig.yaml
)reported_mean_yield_kg_ha
: Mean yield in kg/ha If unavailable, providereported_production_kg
instead. The mean yield will then be calculated using cropmask area (note: subject to cropmask accuracy).If you do not have validation data for certain regions, simply do not include these in your CSV.- If your reference data contains area, it is recommended to also include this under
reported_area
even though this is not yet used in the evaluation pipeline.
5. Running the Yield Study
Once your setup is complete:
- Transfer your base directory to your HPC cluster (if using one).
- Adjust the
sim_study_head_dir
path inconfig.yaml
to match the location you copied the directory to. - Navigate to the snakemake directory:
cd vercye_ops/vercye_ops/snakemake
. - Open a
tmux
session or similar to start the long running job:tmux new -s vercye
- Ensure you have activated your virtual environment if applicable.
- Run the simulation in the tmux shell (This example will expect to use 110 CPU cores as defined in the
profile
file.):bash snakemake --profile profiles/hpc --configfile /path/to/your/config.yaml
For custom CPU core allocation, add the -c
flag (e.g with 20 cores) or adapt the profiles/hpc/config.yaml
file:
bash
snakemake --profile profiles/hpc --configfile /path/to/your/config.yaml -c 20
Output
When the simulation completes, results will be available in your base directory. See the Outputs Documentation for details on interpreting the results.
To run the pipeline over the same region(s), either use Snakemake's -F
flag or delete the log files at vercye_ops/snakemake/logs_*
. Runtimes are in vercye_ops/snakemake/benchmarks
.