Tuesday, December 25, 2018

Natural Language Processing hackathon, or don't judge the wine by the shape of the bottle

In April of 2018 I went to a Natural Language Processing hackathon. Organized by Women in Data Science Austin, it took place at Dell, where one of the organizers worked. This was not the kind of hackathon where you hack for the whole weekend straight, crashing on a beanbag to catch a few winks in the breakroom of some hipster startup. No, this was a hackathon with work-life balance. It lasted from 10 am to 3 pm on a Saturday, which is just enough time for you to get deeply enough immersed in a subject to fire up your appetite for it, but not get sick of it. There were no minimal viable products produced, and no prizes, but I got to sink my teeth into the basics of Natural Language Processing.

A data scientist named Becky, who does Natural Language Processing for an Austin company, introduced us to the three cornerstone approaches of NLP -- summarization, topic modeling, and sentiment analysis.

Data scientist Becky talks about topic modeling
Data scientist Becky talks about topic modeling.

Sentiment analysis quantifies the subjective emotion in a text, e. g. did the majority of reviewers like or didn't like a particular wine? Data scientists don't take into account just the words, but also such nonverbal information as capitalization (a word in all caps is likely to mean the author feels strongly about it), and emoji. Topic modeling finds abstract concepts that occur in a body of texts, a. k. a. corpus. For exaple, if it finds the words milk, meow, and kitten, it might decide one of the topic of this text is cat. If it finds the words bone, bark, and puppy, it might decide one of the topics is dog.

Summarization reduces a text to several key phrases or a representative sentence. Summarization can be extractive or abstractive. Extractive summarization selects a few representative sentences from the text, while abstractive summarization creates a summary of the text.

As an example, Becky gave a phrase: "The Army Corps of Engineers, rushing to meet President Bush's promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during the storm, according to documents obtained by the Associated Press."

Extractive summarization would extract such phrases from it as:

  • Army Corps of Engineers
  • President Bush
  • New Orleans
  • defective flood-control pumps

In contrast, abstractive summarization would generate such phrases as:

  • government agency
  • presidential orders
  • defective equipment
  • storm preparation
  • hurricane Katrina
Natural Language Processing hackathon hosted by Women in Data Science Austin
As many of the hackathon attendees as could fit in the picture.

I can't quite put my finger on it, but it seems that extractive summarization extracts names of specific entities, but not much information as to what happened to those entities or what did they do. But abstractive summarization seems to "understand" what those entities actually represent and what they do, and thereby extracts more "gist" from the paragraph. I could be wrong about it, of course.

According to Becky, extractive summarization is a mostly solved problem by now. TextRank algorithm takes care of it. But abstractive summarization is a very difficult, unsolved problem, though knowledge graphs help.

At the organizers' suggestion, the attendees arranged themselves into three teams, each focusing on one of those three pillars. The organizers brought with them the corpora, a. k. a. texts to be analyzed. Specifically, they brought wine reviews, lots and lots of them. I suppose that's the second best to bringing the actual wine.

Summarizing wine reviews means extracting an "essence" of what the bulk of the reviewers said about a particular wine. It means identifying certain qualities that most reviewers noticed in a given wine. Sentiment analysis meant identifying whether the reviewers thought mostly positively or mostly negatively about the wine.

I ended up in the summarization team. Lead by Randi, who is a data scientist at a big company, we analyzed the wine reviews. By that I mean we called a bunch of functions from pandas, textacy, sumy and other relevant Python packages. The results were mixed. For example, sumy summarized reviews of Moscato in two sentences, but we had no way to tell whether this summarization is good, i.e. whether those were the most representatives sentences from the reviews. It's funny how this is the kind of problem that one has no way of verifying -- at least none that I learned in my 5 hours of NLP bootcamp. Sure, you could read hundreds of reviews and try to get a "feel" whether those sentences were the most representative, but your "feel" would be subjective.

It makes Natural Language Processing feel like black box, and almost like magic -- until you notice that when you ask for 5-sentence summary, the summary includes duplicates for first two sentences. That looks odd, so you take a closer look at the texts and notice that there are duplicate sentences in the document itself. For all its magic, sumy can't figure that out.

Within sumy, you can choose which summarizer to use. First we used LexRank, and it turned out to be very slow. Then we tried another, LuhnSummarizer, and it was much faster, but the results not nearly as accurate. But how would you decide how accurate a summarization is, given that there are no exact criteria for accuracy that I know of? Well, the first summary described mouthfeel and acidity of Moscato. The second included things like the shape and color of the bottle. It left me with the same feeling one often gets interacting with artificial intelligence, that it's both very smart and very stupid at the same time.

No comments: