You may recall reading back in April, Shawn Rutledge’s post that explains human versus machine scoring by relating it to a boxing match, Humans vs. AI in a Sentiment Bout. And the winner is…. Just recently Shawn helped us put together a “Quick Take” on this very topic. It explains the sentiment scoring experiment behind his post a bit more.
Below you’ll find the content of the Quick Take. If you’d like a PDF of this piece please contact us at: community@visibletechnologies.com.
Can a Machine Really Score Sentiment?
It’s a classic case of office politics: the email that was sent with good intentions, but misinterpreted by recipients—or interpreted in several different ways. And email has been around for decades. We shouldn’t be surprised, then, that the same kind of complexity is present in social media posts. Sentiment is a complicated thing to rate: it can be degrees of positive, negative, mixed or neutral. And in a single social media post, an author might intend to express a mixture of sentiments. Even if the sentiment is intended to be purely positive
or negative, readers of a post can respond in very different ways. This makes sentiment scoring particularly challenging. Visible has many ways to review, assess, and refine our scoring of social media content. As part of our ongoing review process, we decided to do a thorough, systematic review of one month’s worth of data for one customer. This Quick Take is a brief overview of what we found.
The Question
How did automated sentiment scoring perform when compared to human scoring?
The Data
We looked at one month’s worth of sentiment scoring for one customer. This included 778 social media posts that had been scored by both a person and a machine. The people doing the scoring are trained professionals, part of dedicated teams that have been labeling social media posts for
more than five years. The teams have established processes to ensure high-quality labeling. The machine scoring uses Visible’s state-of-the-art automated technology, developed by a team of staff scientists with extensive experience in text analytics, natural language processing, machine learning, and information retrieval. And because of Visible’s considerable human labeling practice, our scientists have access to an extensive set of sentiment data.
The Audit
We started with 778 social media posts that had been labeled both by machine and by people who are experienced labelers. We recruited social media professionals to serve as auditors. Each auditor was given an even (50/50) mix of posts that had been scored by machine and by the experienced human
labelers. Each of the posts was reviewed by two auditors, and the auditors didn’t know which posts were scored by machine and which were scored by people.
The Results
First of all, we compared the machine scoring with the initial human scoring, and we found that they were in agreement 89 percent of the time. We found that auditors agreed with human labels 74 percent of the time—and they also agreed with the machine labels 74 percent of the time. There was no statistical difference. In fact, auditors agreed with each other at about the same rate (73 percent of the time).
The chart to the right shows how often one or both auditors agreed with the machine and human assessments of sentiment.
As you can see from the chart, the auditors’ assessments were very similar for machine and human scoring. At least one auditor was in agreement 91 percent of the time—the results were the same for the two types of scoring. And both of the auditors agreed with the initial scoring 57-58 percent of the time—again, the results were statistically the same.
What We Learned
In the data that we looked at, there was no detectable difference between Visible’s machine scoring and scoring by experienced professionals. While auditors agree with each other 73 percent of the time, at least one auditor agreed with the machine and human evaluators 91 percent of the time. One interesting question is why the auditors agreed with each other only 73 percent of the time. The issue of inter-annotator agreement has been widely studied in many domains. Annotators (auditors) can get to very high agreement levels with training and practice, but this isn’t likely to happen across all social media monitoring practitioners. With some effort, however, you could do it for your brand in your department amongst your coworkers. Both human and machine labelers can achieve greater accuracy by learning from consistent training and feedback.
As we noted at the beginning, an expression of sentiment is often complex, and assessment of sentiment is not universal. Human assessment does not always agree, and—not surprisingly—machine assessment does not always agree with human labeling. But in our audit, Visible’s automated scoring did as well as trained, experienced people in deciphering the sentiment expressed in social media posts.
More About Scoring Sentiment
To learn more about assessing and scoring social media sentiment, and Visible’s approach, read some of our other white papers at: http://www.visibletechnologies.com/resources/white-papers