NICE-Food data processing and pipeline

project description

The NICE-Food datapipeline is code belonging to the manuscript NICE-Food KG: A knowledge graph for the analysis of the Nutritional, Ingredient, Contaminant, and Environmental characteristics of food for food system research. In this publication we discuss how a KG-approach for the food domain can aid food researchers in performing analyses over multiple datasets at once. In this datapipeline we integrate and FAIRify data related to:

Nutrition, Food composition, Nederlands voedingsstoffenbestand (NEVO)
Ingredients, branded food ingredients, Levensmiddellendatabank (LEDA)
Contaminants, chemical food safety contaminants, Quality Programme for Agricultural Products (KAP)
Environment, life-cycle analysis data, Environmental impact of food (LCA)

Content of the repository

The nice-processing pipeline presents the methods to get from a tabular dataset related to food, to a food RDF. Some programming skills are required, however, the Jupyter notebook is intended to be easy to use. We only show the complete datapipeline for NEVO. For the other datasets, LEDA, KAP and LCA, only data integration, and RDF creation are provided. Use-cases related to NICE-Food, querying of the different subgraphs in one KG is available in nice_food_analysis. Intermediate datasources to run the pipeline smoothly are available and need to be downloaded from nice food intermediate data. The final datasets can be found on ZENODO NICE-Food KG files

These are the high level contents of the repository: nevo_processing.ipynb shows the whole pipeline including:

data wrangling for loading and annotation
extracting tabular data from foodon (ontology processing tool)
creating embeddings (create embedding tool)
selecting embedding classes (class_selection_tool)
merge annotated data to original dataset
RDF creation

lca_processing.ipynb

data wrangling for loading and annotation
merge annotated data to original data
RDF creation

leda_processing.ipynb

data wrangling for loading and annotation
Large language model (LLM) for data structuring (OpenAI GPT-4o)
LLM for translation (OpenAI GPT-4o)
merge annotated dataset to original data
RDF creation

kap_processing.ipynb

data loading
data averaging
Ontoportal API for annotations
merge annotated dataset to original data
RDF creation

Directory structure

data/
├── aggregated_data     # Data not publishable directly
├── cosine              # Sample data for ontology mapper
├── graph_input         # Annotated CSVs in long format
├── mappings            # Unique/annotated entities (food, chemicals, measures)
├── ontology            # foodon.owl, tabular version for embeddings
├── ontology_mapper     # Example ontology mapper outputs
├── original_data       # Raw data from RIVM
└── translation         # Example GPT translation outputs

functions/             # Helper functions for notebooks

create_embeddings_tool/      # Standalone: generate embeddings

ontology_processing_tool/    # Standalone: extract ontology labels etc.

class_selection_tool/        # Standalone: select annotation shortcuts

llm_config.json        # Example LLM config (store outside repo for safety)

utils/                 # Columns of interest per dataset

prompts/               # Prompts for cleaning and translating ingredient lists

LLM configuration and onto_portal API

In order to be able to get embeddings, an openAI-api key is required which can be bought at the OpenAI website. In order to configure the key, the llm_config.json should be filled in and placed in the parent folder of fair_processing. Other locations are also possible but this should be updated in the load_config function.

Ontoportal API key must be saved in a .env file in this repository

Installation

Create a Parent Directory in your terminal (Optional but recommended)

mkdir nicefood_project
cd nicefood_project

Clone the NICE-Food Data Processing Repository

git clone https://github.com/rivm-syso/nicekg_processing

Clone the NICE-Food Analysis Repository (Go back to the parent folder and clone nice_food_analysis:)

git clone https://github.com/rivm-syso/nicekg_analysis

Download the intermediate data from zenodo. This data will provide insight in data cleaning and enables running the pipeline. Of one is only interested in the output data, download only the RDF files. Put the files in the nicekg_processing folder. rdf files
NICE-Food KG files
NICE-Food KG intermediate data

cd nicekg_processing
mkdir data
# Copy the downloaded data in the directory

Place Your LLM Config Put your llm_config.json file in the parent directory (nicefood_project/). You can adjust file paths in the code if needed. Edit llm_config.json and add your OpenAI API key:

{
    "PUBLIC_OPENAI_API_KEY" : "XXX",
    "PUBLIC_OPENAI_DEPLOYMENT_EMBEDDINGS": "text-embedding-ada-002",
    "PUBLIC_OPENAI_MODEL" : "type openai model",
    "LOCAL_OPENAI_API_KEY": "type key here",
    "LOCAL_OPENAI_ENDPOINT": "XXX",
    "LOCAL_OPENAI_API_VERSION": "XXX",
    "LOCAL_OPENAI_DEPLOYMENT_ID": "gpt-4o-mini-research",
    "LOCAL_OPENAI_DEPLOYMENT_EMBEDDINGS": "text-embedding-ada-002"
}

Create the Conda Environment

cd nicekg_processing
conda env create -f environment.yml
conda activate nicekg_processing

Add Your OntoPortal API Key

In the root of this repository, create a file called .env with your OntoPortal API key:

nano .env
# add in the document BIO_KEY=your_ontoportal_api_key
# to close and safe the file use ctrl + x

To save files locally, create file local_path.py with the following content

local_leda = "local path here"
local_nevo21 = "local path here"
local_nevo23 = "local path here"
local_kap = "local path here"
local_lca ="local path here"

Project status

NICE-Food is part of the BigFood project funded by the Netherlands Institute of Public Health and the Environment strategic programme. In this project we aim to accelerate protein transition research through food data FAIRification and artificial intelligence.

Citation

When citing the code, please use the original publication
NICE-Food KG: A knowledge graph for the analysis of the Nutritional, Ingredient, Contaminant, and Environmental characteristics of food for food system research

Licence

EUPL-1.2

Acknowledgements

The authors thank the BIGFOOD project team for their valuable input at various stages of the project.

Use of generative AI

In development of this work the author used OpenAI-4o in order to create and enhance the code. After using this tool/service, author(s) reviewed and edited the content as needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NICE-Food data processing and pipeline

project description

Content of the repository

Directory structure

LLM configuration and onto_portal API

Installation

Project status

Citation

Licence

Acknowledgements

Use of generative AI

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
class_selection_tool		class_selection_tool
create_embedding_tool		create_embedding_tool
functions		functions
ontology_processing_tool		ontology_processing_tool
prompts		prompts
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
kap_processing.ipynb		kap_processing.ipynb
lca_processing.ipynb		lca_processing.ipynb
leda_processing.ipynb		leda_processing.ipynb
llm_config.json		llm_config.json
nevo_processing.ipynb		nevo_processing.ipynb
requirements.txt		requirements.txt

License

rivm-syso/nicekg_processing

Folders and files

Latest commit

History

Repository files navigation

NICE-Food data processing and pipeline

project description

Content of the repository

Directory structure

LLM configuration and onto_portal API

Installation

Project status

Citation

Licence

Acknowledgements

Use of generative AI

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages