Skip to content

Yonsei-Wave-Dectection/pre-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 

Repository files navigation

Seismic Data Preprocessing Pipeline

1. Code Execution Description

This code is designed to be executed on Google Colab.

  • Before running the code, upload the raw dataset folder named kquake_dataset.
  • Create an empty folder named preprocessed_csv, where the preprocessed .csv files will be automatically saved.

⚠️ Note: The raw dataset kquake_dataset is stored separately in our Git repository under the path jiyun/raw_data (due to its large size). Please ensure the entire dataset is placed inside the kquake_dataset folder before execution.


2. [pre-processing_final.py] Description

Since the dataset originates from South Korea, all timestamps have been converted from UTC to KST (Korea Standard Time).

2.1 Preprocessing Pipeline

The preprocessing consists of the following three steps:

  • DC offset removal
  • Cosine tapering (20%)
  • Bandpass filtering (0.1 Hz – 10 Hz)

2.2 Bandpass Filter Justification

The cutoff frequencies are based on the following study:

“Analysis of Frequency-Specific Sources of Background Noise in Seismic Observatories” “Signals recorded in ground vibration data typically fall within characteristic frequency bands. For example, P-waves and S-waves from local earthquakes, as well as anthropogenic noise, are most prominent in the 0.1–1 second period range, while surface waves generated by earthquakes are generally dominant in the 1–20 second range.”

2.3 Output

  • Preprocessed signals are saved as .csv files per sample.

2.4 Important

  • Lines 17 and 18 of the code must be edited to reflect the correct file paths on your system (Colab or local).

3. [pre-processing_demo.py] Description

This script demonstrates the preprocessing pipeline using a sample dataset:

  • File: KMA20230026_KG.BOG..HG.raw.mseed

To preprocess all 651 data files, use pre-processing_final.py.


4. [pre-processed_data] Folder Description

This folder contains the final preprocessed data, organized into training, validation, and test sets.

Due to file size limitations, the data has been split into multiple compressed .zip files:

  • Train: 9 zip files
  • Validation: 2 zip files
  • Test: 3 zip files

Each subset contains the following number of files:

  • Train: 1,368 files (70%)
  • Validation: 195 files (10%)
  • Test: 390 files (20%)

⚠️ Note: The division into multiple zip files per split is solely due to upload size restrictions. Please extract and merge the zip files within each split (train/val/test) before using them as input for your deep learning model.


5. [plot_images] Folder Description

This folder contains visualization images showing the before and after of the preprocessing steps, using the sample data: KMA20230026_KG.BOG..HG.raw.mseed

File Name Description
sample.png Raw 3-channel (HGZ, HGN, HGE) waveform before preprocessing
sample_fft.png Frequency-power spectrum graph after DC offset removal (used to inspect frequency distribution)
sample_preprocessed.png Preprocessed 3-channel waveform. Raw and processed waveforms are overlaid for clear comparison

📌 These visualizations help validate and understand the effect of each preprocessing step.

6. Data Resource

  • The dataset is sourced from the K-ESM (KIGAM Engineering Strong Motion) DB on the Geo Big Data Open Platform (https://data.kigam.re.kr/quake/data/kesmdb).
  • Each file contains raw acceleration waveform data recorded at an individual seismic station.

Dataset Characteristics

  • Raw, unprocessed format
  • Time segments selected using normalized Arias intensity (Arias, 1970)
  • Based on the Korea Meteorological Administration earthquake catalog (https://www.weather.go.kr/)
  • Extracted: 600 seconds of continuous waveform data after the earthquake origin time
  • Signal segments defined as 1–99% range of normalized Arias intensity
  • Pre- and post-noise included to ensure full P-wave and coda coverage

7. Sample Data Analysis

  • File: KMA20230026_KG.BOG..HG.raw.mseed
  • Channels: 3
  • Sampling Frequency: 100 Hz
  • Data Length: 8940 samples
  • Start Time (UTC): 2023-07-29T10:08:11.258390Z
  • End Time (UTC): 2023-07-29T10:09:40.648390Z

Note: Timestamps are converted to KST during preprocessing.


8. Dataset Summary

Attribute Value
Number of Data Files 651
Sampling Frequency 100 Hz
Data Length Range 38.4 – 462.1 sec
Channels per File 3
Number of Seismic Stations 31

9. Seismic Station Classification

Station Code Location Station Code Location
AJD 안좌도 GHR 가학리
BBK 방방골 GKP1 경북대
BGD 보길도 GKP2 경북대
BOG 봉계 GRE 구례
BRN 북백령도 GSU 경상대
BRS 남백령도 HAK 학계리
CGD 청도 HCH 학천
CGU 천군 HDB 효동리
CHNB 철원 HKU 교원대
CHS 청송 HSB 홍성
CRB 원주KSRS HWSB 화순
DES 덕성 IBA 입암산
DKJ 덕정리 JJB 제주도
DKJ2 덕정리 JRB 지리산
DOKDO 독도 JSB 정선
DUC 덕천 JUC 죽천
GCN 건천 KIP 김포
KJM 거제 SNU 서울대
KMC 김천 SIG 신계
KNUC 강원대 SND 상동
KNUD 도계 TJN 대전
KSA 간성 UNI 울산과기원
MAK 매곡리 WDL 원달리
MGB 문경 WID 위도
MKL 명계리 YIN 용인
MRD 마라도 YKB 양구
MUN 무안 YPD 연평도
NPR 나포리 YSB 양산
OJR 옥정리 YSUK 연세대 국제
PCH 포천 YSUM 연세대 미래
PKNU 부경대
POHB 포항
POSB 포항공대

About

pre-processing seismic data for efficient deep learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages