Skip to content

slopefields/IVC-Research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prompt Language Effects on LLM Reasoning

Research project presented at the IVC/Saddleback Symposium investigating how prompt language influences large language model (LLM) reasoning accuracy, performance, and token cost.


Overview

This project analyzes how the language used in prompts affects:

  • Reasoning accuracy
  • Reading comprehension performance
  • Token consumption
  • API cost efficiency

We evaluated multiple LLMs using SAT Math and SAT Reading questions to measure performance differences across prompt languages.


Research Question

Does the language of a prompt (e.g., English vs. Chinese) significantly impact:

  • Logical reasoning accuracy?
  • Reading comprehension performance?
  • Token usage?
  • Inference cost?

Experimental Design

Dataset

  • Official SAT Math questions
  • Official SAT Reading comprehension passages

Independent Variable

  • Prompt language (English vs. non-English)

Dependent Variables

  • Accuracy (correct vs. incorrect answers)
  • Token usage (input + output tokens)
  • Estimated API cost

Controlled Variables

  • Same question sets
  • Same models
  • Same temperature and generation parameters
  • Same evaluation criteria

Metrics

  • Accuracy percentage
  • Average tokens per response
  • Cost per 100 questions
  • Cost vs. accuracy tradeoff analysis

Models Tested

  • OpenAI GPT models
  • Google Gemini models
  • (Add additional models if applicable)

Methodology

  1. Selected SAT Math and Reading questions.
  2. Generated structured prompts in multiple languages.
  3. Queried each model using identical parameters.
  4. Logged:
    • Model responses
    • Correct/incorrect results
    • Input/output token counts
  5. Calculated cost using published API pricing.
  6. Compared accuracy and cost tradeoffs across languages.

Key Findings

(Add your actual results here)

Example format:

  • Chinese prompts improved math reasoning accuracy by X%.
  • English prompts used fewer tokens on reading tasks.
  • Certain models showed higher sensitivity to linguistic structure.
  • Cost-efficiency varied depending on prompt language.

Why This Matters

Prompt language is often overlooked in LLM evaluation.

This experiment provides insights into:

  • Prompt engineering optimization
  • Multilingual LLM behavior
  • Cost-efficient AI deployment
  • Model sensitivity to linguistic structure

The findings are relevant for AI researchers, engineers, and AI product builders.


Tech Stack

  • Python
  • OpenAI API
  • Google Gemini API
  • Pandas
  • Matplotlib
  • Jupyter Notebooks

Repository Structure

/data → SAT question datasets
/experiments → Prompt testing scripts
/analysis → Evaluation + cost analysis
/results → Accuracy + token metrics


Author

Lucas Trinh
Computer Science Student, Irvine Valley College
IVC/Saddleback Symposium Presenter


License

This project is for academic and research purposes.

Data (sat-en.jsonl and sat-math.jsonl) from: https://github.com/ruixiangcui/AGIEval

About

Empirical study on the impact of prompt language (English vs. non-English) on LLM reasoning accuracy, response quality, and token cost, evaluated using SAT Math and Reading datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages