-
Notifications
You must be signed in to change notification settings - Fork 10
Description
This is more of a design consideration for future development. Our current pair generator tries to minimize the difference between scores when generating pairs. This actually enhances the ranking reliability when we have expert judges, as they are able to finely tell the difference between answers. But with untrained judges, it hurts ranking reliability as it's harder for them to tell answers apart.
For untrained judges, having a gap between the scores of two different answers makes it easier for them to tell the answers apart and also give 'incorrectly' judged answers more of a chance to climb back up. We should consider implementing this gap for our pair generator.
There's two additional factors for consideration due to the nature of ComPAIR as a learning tool rather than an assessment tool:
- There might be more pedagogical benefit in having students try to distinguish between two very similar quality answers.
- Even with this score gap, it's recommended that we have around 12-15 rounds of comparisons for a reliable ranking. This is far more comparisons than the usual 3 rounds that is ComPAIR's default.
So perhaps the size of the gap could be made configurable.
Thanks to Peter Thwaites (UCLouvain) for bringing this up and providing the papers below:
- Rangel-Smith and Lynch - 2018 - Addressing the issue of bias in the measurement of.pdf
- Bramley - 2015 - Investigating the reliability of adaptive comparat.pdf
- Bramley and Vitello - 2019 - The effect of adaptivity on the reliability coeffi.pdf
Paper 1 provides recommendations for the score gap size. Papers 2 & 3 details the issue with 'highly adaptive' pair generators like ComPAIR's.