+86 21 38726186
tbb@tbbbearing.com
A paper published by researchers at Princeton University in April introduced a large data corpus of sarcastic statements for training and evaluating natural language processing (NLP) systems for sarcasm detection.
The Self-Annotating Reddit Corpus (SARC) presents data mined from social media site Reddit, on which users often tag (self-annotate) their sarcastic comments with “/s” so other users know they are not being serious. The SARC corpus contains 1.3 million sarcastic statements, 10 times more than any previous dataset. Previous datasets that mined social media focused on Twitter, on which users can self-annotate with “#sarcasm” and other labels. But the Princeton research group found Reddit more reliable because a much larger percentage of users are likely to self-annotate (0.002% vs. 0.927%).
Detecting sarcasm in textual statements is difficult even for human subjects, so it provides a major stumbling block for NLP programs. The research group says their corpus can be directly applied for training and evaluating sarcasm detection systems. For example, a system can learn to detect sarcasm by using smooth negative inverse frequency (SNIF) weighting on words like sure, totally and wow to determine the likelihood of a sarcastic statement.
“Since sarcasm often involves humans stating something opposed to their beliefs or wants, it is important for chatbots and intelligent assistants to be able to understand when a person is being sarcastic,” lead author Mikhail Khodak said in an interview with The Register.
“It is quite difficult for both machines and humans to distinguish sarcasm without context,” Khodak went on. “One of the advantages of our corpus is that we provide the text preceding each statement as well as the author of the statement, so algorithms can see whether it is sarcastic in the context of the conversation or in the context of the author’s past statements.”
Disclaimers statement: All news (Except for TBB news) are collected from internet,and all copyright reserved by original authors.If they relate to your copyright,please contact us and we will delete in time,thanks.