The NICE-Food datapipeline is code belonging to the manuscript NICE-Food KG: A knowledge graph for the analysis of the Nutritional, Ingredient, Contaminant, and Environmental characteristics of food for food system research. In this publication we discuss how a KG-approach for the food domain can aid food researchers in performing analyses over multiple datasets at once. In this datapipeline we integrate and FAIRify data related to:
- Nutrition, Food composition, Nederlands voedingsstoffenbestand (NEVO)
- Ingredients, branded food ingredients, Levensmiddellendatabank (LEDA)
- Contaminants, chemical food safety contaminants, Quality Programme for Agricultural Products (KAP)
- Environment, life-cycle analysis data, Environmental impact of food (LCA)
The nice-processing pipeline presents the methods to get from a tabular dataset related to food, to a food RDF. Some programming skills are required, however, the Jupyter notebook is intended to be easy to use. We only show the complete datapipeline for NEVO. For the other datasets, LEDA, KAP and LCA, only data integration, and RDF creation are provided. Use-cases related to NICE-Food, querying of the different subgraphs in one KG is available in nice_food_analysis. Intermediate datasources to run the pipeline smoothly are available and need to be downloaded from nice food intermediate data. The final datasets can be found on ZENODO NICE-Food KG files
These are the high level contents of the repository: nevo_processing.ipynb shows the whole pipeline including:
- data wrangling for loading and annotation
- extracting tabular data from foodon (ontology processing tool)
- creating embeddings (create embedding tool)
- selecting embedding classes (class_selection_tool)
- merge annotated data to original dataset
- RDF creation
lca_processing.ipynb
- data wrangling for loading and annotation
- merge annotated data to original data
- RDF creation
leda_processing.ipynb
- data wrangling for loading and annotation
- Large language model (LLM) for data structuring (OpenAI GPT-4o)
- LLM for translation (OpenAI GPT-4o)
- merge annotated dataset to original data
- RDF creation
kap_processing.ipynb
- data loading
- data averaging
- Ontoportal API for annotations
- merge annotated dataset to original data
- RDF creation
data/
├── aggregated_data # Data not publishable directly
├── cosine # Sample data for ontology mapper
├── graph_input # Annotated CSVs in long format
├── mappings # Unique/annotated entities (food, chemicals, measures)
├── ontology # foodon.owl, tabular version for embeddings
├── ontology_mapper # Example ontology mapper outputs
├── original_data # Raw data from RIVM
└── translation # Example GPT translation outputs
functions/ # Helper functions for notebooks
create_embeddings_tool/ # Standalone: generate embeddings
ontology_processing_tool/ # Standalone: extract ontology labels etc.
class_selection_tool/ # Standalone: select annotation shortcuts
llm_config.json # Example LLM config (store outside repo for safety)
utils/ # Columns of interest per dataset
prompts/ # Prompts for cleaning and translating ingredient lists
In order to be able to get embeddings, an openAI-api key is required which can be bought at the OpenAI website. In order to configure the key, the llm_config.json should be filled in and placed in the parent folder of fair_processing. Other locations are also possible but this should be updated in the load_config function.
Ontoportal API key must be saved in a .env file in this repository
Create a Parent Directory in your terminal (Optional but recommended)
mkdir nicefood_project
cd nicefood_projectClone the NICE-Food Data Processing Repository
git clone https://github.com/rivm-syso/nicekg_processingClone the NICE-Food Analysis Repository (Go back to the parent folder and clone nice_food_analysis:)
git clone https://github.com/rivm-syso/nicekg_analysisDownload the intermediate data from zenodo. This data will provide insight in data cleaning and enables running the pipeline. Of one is only interested in the output data, download only the RDF files. Put the files in the nicekg_processing folder.
rdf files
NICE-Food KG files
NICE-Food KG intermediate data
cd nicekg_processing
mkdir data
# Copy the downloaded data in the directoryPlace Your LLM Config Put your llm_config.json file in the parent directory (nicefood_project/). You can adjust file paths in the code if needed. Edit llm_config.json and add your OpenAI API key:
{
"PUBLIC_OPENAI_API_KEY" : "XXX",
"PUBLIC_OPENAI_DEPLOYMENT_EMBEDDINGS": "text-embedding-ada-002",
"PUBLIC_OPENAI_MODEL" : "type openai model",
"LOCAL_OPENAI_API_KEY": "type key here",
"LOCAL_OPENAI_ENDPOINT": "XXX",
"LOCAL_OPENAI_API_VERSION": "XXX",
"LOCAL_OPENAI_DEPLOYMENT_ID": "gpt-4o-mini-research",
"LOCAL_OPENAI_DEPLOYMENT_EMBEDDINGS": "text-embedding-ada-002"
}
Create the Conda Environment
cd nicekg_processing
conda env create -f environment.yml
conda activate nicekg_processingAdd Your OntoPortal API Key
In the root of this repository, create a file called .env with your OntoPortal API key:
nano .env
# add in the document BIO_KEY=your_ontoportal_api_key
# to close and safe the file use ctrl + xTo save files locally, create file local_path.py with the following content
local_leda = "local path here"
local_nevo21 = "local path here"
local_nevo23 = "local path here"
local_kap = "local path here"
local_lca ="local path here"
NICE-Food is part of the BigFood project funded by the Netherlands Institute of Public Health and the Environment strategic programme. In this project we aim to accelerate protein transition research through food data FAIRification and artificial intelligence.
When citing the code, please use the original publication
NICE-Food KG: A knowledge graph for the analysis of the Nutritional, Ingredient, Contaminant, and Environmental characteristics of food for food system research
EUPL-1.2
The authors thank the BIGFOOD project team for their valuable input at various stages of the project.
In development of this work the author used OpenAI-4o in order to create and enhance the code. After using this tool/service, author(s) reviewed and edited the content as needed.