- We are going live with Human annotation as part of Subjective evals
- But along with that we need to have another 1-2 metrics that will get auto computed
- Basis few sample check from Raj's data, we could see Semantic similarity (Via LLM) was making sense besides WER
- But we need to finalise it after checking more data across languages.
ref: https://discord.com/channels/1014768296257654865/1463434906712670446/1471472813411012723