Abstract: Automatic evaluation of sequence generation, which has traditionally relied on metrics such as BLEU and ROUGE, often struggles to capture the semantic accuracy of generated text due to an ...