Tuesday, March 20, 2018

Introduction to Natural Language Processing with Women Who Code

In 2016 Women Who Code Austin hosted a series of five presentations on Natural Language Processing. The presenter was our member Diana, who has a Ph.D. in linguistics and has worked in the area of computational linguistics for many years. She did demos of some basic text analysis one can do with the Python Natural Language Toolkit, or in short, NLTK.

She presented all this as a Python notebook. A Python notebook is software that lets you combine text, code, and output of that code on one page. You can run a code snippet right there in the notebook, and the resuls will get updated automatically. So equipped, Diana introduced us to the basics of what computational linguists do. Or if that sounds too ambitious, let's just say she showed some simple things one can do with NLTK.

For example:

  • read in the text,
  • tokenize,
  • tag,
  • remove punctuation,
  • remove stopwords...
  • build a frequency hash table from the rest of words.
The first Austin Women Who Code meeting on natural language processing, with our instructor Diana standing in the center
The first Austin Women Who Code meeting on natural language processing, with our instructor Diana standing in the center

She introduced such concepts as collocations and bigrams. Bigrams are pairs of words that are next to each other in a text. Collocations are pairs of words that naturally occur in the language together, i. e., a chance of them occuring together is greater than random. An example of a bigram that's not a collocation is Trump's usage of a phrase "Liar Ted" (this was the spring of 2016, the height of the Republican strife for a presidential nomination). If a bigram is not a collocation, but occurs more often than randomly in a text, that can help to identify who the author is / who the speaker is, and some such qualities. It can be a fingerpint of sorts.

The aforementioned tagging is something we do after tokenizing (roughly speakinng, breaking the text up into words). Tagging assigns a 2-letter tag to each word, marking it as a part of speech, such as noun, adverb, etc. "You can use a big list of tags, or a simplified one. Using a simplified list of tags can help with speed of analysis of your corpus," said Diana.

Here Diana noted that tagging words as parts of speech has inherent ambiguity in it -- exactly the kind of thing that makes language and its computational processing so interesting. Here is an example of parts-of-speech ambiguity in a sentence: "They refuse to permit us to obtain the refuse permit". Still, the Python Natural Language Processing Toolkit correctly tags the first "refuse" and "permit" as verb (VBP) and the second instance of each as noun (NN).

NLTK correctly identifies parts of speech in the sentence 'They refuse to permit us to obtain the refuse permit'
This slide shows how NLTK correctly identifies parts of speech in the sentence above. The first instances of "permit" and "refuse" are "VB" -- verbs, whereas the second ones are "NN" -- nouns.

At the second meeting we did all those actions with a corpus of -- wait for it -- Hillary Clinton's emails. Her emails were available for download from the Kaggle site. This was still the spring of 2016, and we did not yet know how sad the implications of those emails will turn out to be, so the choice of the subject wasn't as... emotionally loaded as it would have been just half a year later. And to say "we" did this is an exaggeration, because it was actually Diana that did all the processing and presented the code and the results to us in a Python notebook.

Here was the complete agenda of the meeting:

  • Getting data: Hillary Clinton's emails;
  • Reading files;
  • Using Pandas to create a Dataframe in Python;
  • Cleaning data: eliminating punctuation, eliminating stopwords, normalizing data: converting to lower case, tokenizing words
  • Visualizing data.

All of this pre-processing of data was done in the Python Natural Language Processing Toolkit (NLTK).

I must say I would have preferred it if Diana had set up this mini-course as a series of exercises for us to do in class and write some code calling NLTK methods ourselves. But if we had done that, we would not have been able to cover even half as much in those four meetups. So I appreciate what Diana did. At least she showed us what kind of beast NLTK is and which fork to eat it with. In the process learned some basic NLP lingo, such as:

  • corpus -- a body of text, plural corpora; it's what you process to extract words and do computations with them;
  • lexicon -- words and their meanings; example: English dictionary.
  • However, you need to consider that different fields will have different lexicons. For example: to a financial investor, the first meaning of the word "bull" is someone who is confident about the market, as compared with the common English lexicon, where the first meaning of the word "bull" is an animal. As such, there is a special lexicon for the financial investors, doctors, mechanics, and so on.

  • token -- each "entity" that is a part of whatever was split up based on rules. For example, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.
  • frequency distribution. The frequency distribution method of NLTK counts the frequency of each vocabulary item in the text. It helps identify the most informative words in a corpus.

So overall I got a little familiar with what are the very basics of what natural language scientists do. But somehow, during those four meetings I was still hoping that we'll get past collecting the statistics about words, and get to some mysterious insights about how language works, evolves, and transforms our thoughts, that only computer analysis of language can provide. Of course, my expectations were unrealistically inflated for a set of introductory lessons.

Going back to Hillary Clinton's emails, here is how you would analyze them. This is an "Exploratory Analysis: Getting and Cleaning Data" slide. Here you see the metadata fields that were extracted from the emails. There are quite a few of them.

Python dataframe with the metadata fields extracted from Hilary Clinton's emails
Python dataframe with the metadata fields extracted from Hilary Clinton's emails. Python dataframe with the metadata extracted from Hilary Clinton's emails
This slide, "Slicing dataframe to extract subject", shows Python method calls that you would use to extract the email subjects from the dataframe shown in the previous image. Presented in a Python notebook, it alternates code with results of that code. The results can be updated on the fly if you make changes to the code. The MetaDataSubject and MetaDataTo fields contain some familiar names and topics that made the news...

The next slide shows the use of the NLTK method "concordance". It produces a list of the words used in the text, with the passages where they are used. So if you want all occurrences of the word "surprise" in Jane Austen's "Emma", with snippets of context, you can call

emmaText.concordance("surprize")

(Here, emmaText is the variable that holds the text of the Jane Austen's novel "Emma".) From this example you can also see that NLTK has corpora of texts from the Gutenberg project, which is pretty handy.

Concordamce: all the places in Jane Austen 'Emma' where the word 'surprize' is used
Concordamce: all the places in Jane Austen 'Emma' where the word 'surprize' is used, obtained by calling a 'concordance' method of NLTK.

Venturing ddeper into natural language processing

The easiest texts to analyze are the news, Diana said. News have very good structure. Sentences tend to be short, and tend to have classical structure: subject, verb, object, etc. Medication instructions are also easy to analyze, since they are required to have readability scores high enough to be suitable for 9-12 year olds. But in literature the sentences are often not conventional and much harder to parse.

At the last meetup we talked a little bit about analyzing texts "for real". And by that I mean a little deeper analysis than just breaking up sentences into parts of speech and gathering statistics about it.

One example where computational linguistics is used is to grade student essays. If you have so many essays that hiring human graders would be cost-prohibitive, natural language processing can help. For example, if an essay is supposed to be on the US Declaration of Independence, the script would check to see if certain words are present in it in a certain way, and will conclude that that student might have a certain level understanding of the topic. (Yes, I know, this raises lots of questions about creativity versus cliche'd, cookie-cutter texts: the latter would be more likely to hit all the points that a grading program is looking for, whereas the former might be difficult for a program to discern. But we didn't cover such questions at the meeting, since it's an uncharted territory.)

We touched upon sentiment analysis, which helps determine how customers feel about an experience they had with a brand or a company. Companies like HomeAway use it to analyze customer reviews of their rental properties. And they discover unexpected things that way. For example, analysis of customer reviews of B&B-type places showed that the greatest predictor of customer satisfaction is whether a house has pots and pans.

Sentiment analysis also shows that, for example, if you try to infer customer satisfaction from the reviews by searching for wait times, you'll get inconsistent results. 15 minutes would be bad for a restaurant, but lightning-fast for an emergency room.

And this is where people try to determine degrees and ways of relatedness or similarity between concepts.

For that, they can use ontologies.

What is Ontology?

A consensus is now established about the definition and the role of an ontology in konwledge engineering: "An ontology is a formal, explicit, specification of a shared conceptualization".

It is used in cognitive modeling.

More about Ontologies

An ontology is a schema (model) describing the types (and possibly some individuals) in a domain, the relationships that may exist between types and individuals, and constraints on the way individuals and properties may be combined.

Here are some examples of ontologies

  • Classes: Project, Person, ProjectManager. ProjectManager is a subclass of Person. People and Projects are disjoint.
  • Relationships: worksOn, manages. Manages is a sub-property of worksOn.
  • Constraints: People work on Projects, not the other way around. Only ProjectManagers can manage Projects.

This simple example enables machine inferences, e.g. if X manages Y, then we can infer that Y is Project, and X is a ProjectManager and therefore a Person.

Onthologies allow people to create trees representing relationships between concepts, like this:

A tree that expresses relationships between conecpts in the academia (Student, Employee, Faculty, etc)
This is an example of a tree that expresses relationships between conecpts in the academia (Student, Employee, Faculty, etc.)

Some people propose ways to neasure the similarity of concepts by some graph metrics, such as the shortest path between two nodes.

Measure of similarity between two concepts in a graph, expressed in terms of a shortest path between two concepts.
Measure of similarity between two concepts in a graph, expressed in terms of a shortest path between two concepts.

More pictures from the Women Who Code Austin meetup series on Natural Language Processing are in my photo gallery.