Skip to content

Biocomputing-Research-Group/WinnowNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WinnowNet

This algorithm was implemented and tested on Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-84-generic, x86_64).

Note:

This repository contains the development version of WinnowNet. For the code used to reproduce the experiments in the paper, please refer to the following repository: https://github.com/Biocomputing-Research-Group/WinnowNet4Review

Overview

WinnowNet is designed for advanced processing of mass spectrometry data with two core methods: a CNN-based approach and a self-attention-based approach. The repository includes scripts for feature extraction, model training, prediction (inference), and evaluation. A toy example is included to help users get started.

Table of Contents

Setup and installation

1. Create a new conda environment and activate it.

It is recommended to use Conda for dependency management. Run the following commands in your terminal:

conda create --name WinnowNet python=3.8
conda activate WinnowNet

2. Install dependencies:

CUDA version 11.8 Pytorch GPU version is compatible with corresponding cuda version

pip install -r ./requirements.txt

Requirements

  • Operation system: Linux
  • GPU Memory
    • Inference Mode: At least 8 GB (adjust batch size if necessary)
    • Training Mode: At least 20 GB

Download Required Files

Input pre-processing

Extract fragment ion matching features along with 11 additional features derived from both theoretical and experimental spectra. The PSM (peptide-spectrum match) candidate information should be provided in a tab-delimited file (e.g., a TSV file output from Percolator).

python SpectraFeatures.py -i <tsv_file> -s <ms2_file> -o spectra.pkl -t 48 -f cnn
  • Replace <tsv_file> with the path to your PSM candidates file.
  • Replace <ms2_file> with the path to your experimental spectra file.
  • The -t 48 option sets the number of threads (adjust this value as needed).
  • Use -f cnn when preparing input for the CNN-based architecture or -f att for the self-attention-based model.

Training WinnowNet Models

This folder contains scripts, datasets, and instructions for training two variants of the WinnowNet deep learning model: a self-attention-based model and a CNN-based model. Training is carried out in two phases to enable curriculum learning from synthetic (easy) to real-world metaproteomic (difficult) datasets.

Requirements

  • Python 3.7+
  • PyTorch
  • NumPy, Pandas, scikit-learn

Datasets


Self-Attention-Based WinnowNet

Phase 1: Training on Easy Tasks (Synthetic Data)

python SpectraFeatures_training.py -i filename.tsv -s filename.ms2 -o spectra_feature.pkl -t 20 -f att
python WinnowNet_Att.py -i spectra_feature_directory -m prosit_att.pt

Explanation of options:

  • -i: Input tab-delimited file with PSMs, including labels and weights.
  • -s: Corresponding MS2 file (filename should match TSV).
  • -o: Output file to store extracted features as a pkl file.
  • -t: Number of threads for parallel processing.
  • -f: Feature type (att for self-attention model).
  • -m: Filename to save the trained model.
  • A for-loop is needed to convert all tsv files to pkl files.

Phase 2: Training on Difficult Tasks (Real Data)

python SpectraFeatures_training.py -i filename.tsv -s filename.ms2 -o spectra_feature.pkl -t 20 -f att
python WinnowNet_Att.py -i spectra_feature_directory -m marine_att.pt -p prosit_att.pt
  • -p: Pre-trained model from Phase 1.
  • A for-loop is needed to convert all tsv files to pkl files.

Pre-trained model: marine_att.pt, https://figshare.com/articles/dataset/Models/25513531


CNN-Based WinnowNet

Phase 1: Training on Easy Tasks (Synthetic Data)

python SpectraFeatures_training.py -i filename.tsv -s filename.ms2 -o spectra_feature.pkl -t 20 -f cnn
python WinnowNet_CNN.py -i spectra_feature_directory -m prosit_cnn.pt

Phase 2: Training on Difficult Tasks (Real Data)

python SpectraFeatures_training.py -i filename.tsv -s filename.ms2 -o spectra_feature.pkl -t 20 -f cnn
python WinnowNet_CNN.py -i spectra_feature_directory -m cnn_pytorch.pt -p prosit_cnn.pt

Pre-trained model: cnn_pytorch.pt, https://figshare.com/articles/dataset/Models/25513531


Notes

  • All input MS2/TSV files must be preprocessed properly.
  • Models trained in Phase 1 are reused to initialize weights in Phase 2.
  • Training with GPU is recommended for performance.

Inference

PSM Rescoring

Self-Attention-Based WinnowNet

To generate input representations for PSM candidates and perform re-scoring using the self-attention model, run:

python SpectraFeatures.py -i tsv_file -s ms2_file -o spectra.pkl -t 48 -f att 
python Prediction.py -i spectra.pkl -o rescore.out.txt -m att_pytorch.pt  

CNN-Based WinnowNet

To generate input representations for PSM candidates and perform re-scoring using the CNN model, run:

python SpectraFeatures.py -i filename.tsv -s filename.ms2 -o spectra.pkl -t 48 -f cnn
python Prediction_CNN.py -i spectra.pkl -o rescore.out.txt -m cnn_pytorch.pt 

Explanation of options:

  • -i: Input tab-delimited file with PSMs
  • -s: Corresponding MS2 file (filename should match TSV).
  • -o: Output file to store extracted features as a pkl file.
  • -t: Number of threads for parallel processing.
  • -f: Feature type (att for self-attention model, cnnfor CNN model).
  • -m: Filename to save the trained model.
  • A for-loop is needed to convert all tsv files to pkl files.

Evaluation

FDR Control at the PSM/Peptide Levels

Filter the re-scored PSM candidates to control the false discovery rate (FDR) at both the PSM and peptide levels (targeted at 1% FDR). You will need both the original PSM file and the re-scoring results.

python filtering.py -i rescore.out.txt -p tsv_file -o filtered -d Rev_ -f 0.01

Explanation of options:

  • -i: Rescoring file from WinnowNet
  • -p: Input tab-delimited file with PSMs
  • -o: filtered results' prefix
  • -d: Decoy prefix used for target-decoy strategy. Default: Rev_
  • -f: False Discovery Rate. Default: 0.01
  • A for-loop is needed to convert all tsv files to pkl files.
  • The filtered output files include updated PSM information (new predicted scores, spectrum IDs, identified peptides, and corresponding proteins).
  • Assembling filtered identified peptides into proteins
  • This script is needed to run at the working directory inlucding filtered results at PSM and Peptide levels.
python sipros_peptide_assembling.py

When assembling filtered, identified peptides into proteins, the overall protein-level FDR depends on the quality of the filtered peptide list. An initial peptide-level FDR (for example, 1%) may lead to a protein-level FDR that is higher than desired. In such cases, you need to re-filter the peptides using a stricter (i.e., lower) FDR threshold until you achieve a 1% protein-level FDR.

Contact and Support

For further assistance, please consult the GitHub repository or reach out to the project maintainers.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages