Tuesday, March 20, 2018

Introduction to Natural Language Processing with Women Who Code

In 2016 Women Who Code Austin hosted a series of five presentations on Natural Language Processing. The presenter was our member Diana, who has a Ph.D. in linguistics and has worked in the area of computational linguistics for many years. She did demos of some basic text analysis one can do with the Python Natural Language Toolkit, or in short, NLTK.

She presented all this as a Python notebook. A Python notebook is software that lets you combine text, code, and output of that code on one page. You can run a code snippet right there in the notebook, and the resuls will get updated automatically. So equipped, Diana introduced us to the basics of what computational linguists do. Or if that sounds too ambitious, let's just say she showed some simple things one can do with NLTK.

For example:

  • read in the text,
  • tokenize,
  • tag,
  • remove punctuation,
  • remove stopwords...
  • build a frequency hash table from the rest of words.
The first Austin Women Who Code meeting on natural language processing, with our instructor Diana standing in the center
The first Austin Women Who Code meeting on natural language processing, with our instructor Diana standing in the center

She introduced such concepts as collocations and bigrams. Bigrams are pairs of words that are next to each other in a text. Collocations are pairs of words that naturally occur in the language together, i. e., a chance of them occuring together is greater than random. An example of a bigram that's not a collocation is Trump's usage of a phrase "Liar Ted" (this was the spring of 2016, the height of the Republican strife for a presidential nomination). If a bigram is not a collocation, but occurs more often than randomly in a text, that can help to identify who the author is / who the speaker is, and some such qualities. It can be a fingerpint of sorts.

The aforementioned tagging is something we do after tokenizing (roughly speakinng, breaking the text up into words). Tagging assigns a 2-letter tag to each word, marking it as a part of speech, such as noun, adverb, etc. "You can use a big list of tags, or a simplified one. Using a simplified list of tags can help with speed of analysis of your corpus," said Diana.

Here Diana noted that tagging words as parts of speech has inherent ambiguity in it -- exactly the kind of thing that makes language and its computational processing so interesting. Here is an example of parts-of-speech ambiguity in a sentence: "They refuse to permit us to obtain the refuse permit". Still, the Python Natural Language Processing Toolkit correctly tags the first "refuse" and "permit" as verb (VBP) and the second instance of each as noun (NN).

NLTK correctly identifies parts of speech in the sentence 'They refuse to permit us to obtain the refuse permit'
This slide shows how NLTK correctly identifies parts of speech in the sentence above. The first instances of "permit" and "refuse" are "VB" -- verbs, whereas the second ones are "NN" -- nouns.

At the second meeting we did all those actions with a corpus of -- wait for it -- Hillary Clinton's emails. Her emails were available for download from the Kaggle site. This was still the spring of 2016, and we did not yet know how sad the implications of those emails will turn out to be, so the choice of the subject wasn't as... emotionally loaded as it would have been just half a year later. And to say "we" did this is an exaggeration, because it was actually Diana that did all the processing and presented the code and the results to us in a Python notebook.

Here was the complete agenda of the meeting:

  • Getting data: Hillary Clinton's emails;
  • Reading files;
  • Using Pandas to create a Dataframe in Python;
  • Cleaning data: eliminating punctuation, eliminating stopwords, normalizing data: converting to lower case, tokenizing words
  • Visualizing data.

All of this pre-processing of data was done in the Python Natural Language Processing Toolkit (NLTK).

I must say I would have preferred it if Diana had set up this mini-course as a series of exercises for us to do in class and write some code calling NLTK methods ourselves. But if we had done that, we would not have been able to cover even half as much in those four meetups. So I appreciate what Diana did. At least she showed us what kind of beast NLTK is and which fork to eat it with. In the process learned some basic NLP lingo, such as:

  • corpus -- a body of text, plural corpora; it's what you process to extract words and do computations with them;
  • lexicon -- words and their meanings; example: English dictionary.
  • However, you need to consider that different fields will have different lexicons. For example: to a financial investor, the first meaning of the word "bull" is someone who is confident about the market, as compared with the common English lexicon, where the first meaning of the word "bull" is an animal. As such, there is a special lexicon for the financial investors, doctors, mechanics, and so on.

  • token -- each "entity" that is a part of whatever was split up based on rules. For example, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.
  • frequency distribution. The frequency distribution method of NLTK counts the frequency of each vocabulary item in the text. It helps identify the most informative words in a corpus.

So overall I got a little familiar with what are the very basics of what natural language scientists do. But somehow, during those four meetings I was still hoping that we'll get past collecting the statistics about words, and get to some mysterious insights about how language works, evolves, and transforms our thoughts, that only computer analysis of language can provide. Of course, my expectations were unrealistically inflated for a set of introductory lessons.

Going back to Hillary Clinton's emails, here is how you would analyze them. This is an "Exploratory Analysis: Getting and Cleaning Data" slide. Here you see the metadata fields that were extracted from the emails. There are quite a few of them.

Python dataframe with the metadata fields extracted from Hilary Clinton's emails
Python dataframe with the metadata fields extracted from Hilary Clinton's emails. Python dataframe with the metadata extracted from Hilary Clinton's emails
This slide, "Slicing dataframe to extract subject", shows Python method calls that you would use to extract the email subjects from the dataframe shown in the previous image. Presented in a Python notebook, it alternates code with results of that code. The results can be updated on the fly if you make changes to the code. The MetaDataSubject and MetaDataTo fields contain some familiar names and topics that made the news...

The next slide shows the use of the NLTK method "concordance". It produces a list of the words used in the text, with the passages where they are used. So if you want all occurrences of the word "surprise" in Jane Austen's "Emma", with snippets of context, you can call

emmaText.concordance("surprize")

(Here, emmaText is the variable that holds the text of the Jane Austen's novel "Emma".) From this example you can also see that NLTK has corpora of texts from the Gutenberg project, which is pretty handy.

Concordamce: all the places in Jane Austen 'Emma' where the word 'surprize' is used
Concordamce: all the places in Jane Austen 'Emma' where the word 'surprize' is used, obtained by calling a 'concordance' method of NLTK.

Venturing ddeper into natural language processing

The easiest texts to analyze are the news, Diana said. News have very good structure. Sentences tend to be short, and tend to have classical structure: subject, verb, object, etc. Medication instructions are also easy to analyze, since they are required to have readability scores high enough to be suitable for 9-12 year olds. But in literature the sentences are often not conventional and much harder to parse.

At the last meetup we talked a little bit about analyzing texts "for real". And by that I mean a little deeper analysis than just breaking up sentences into parts of speech and gathering statistics about it.

One example where computational linguistics is used is to grade student essays. If you have so many essays that hiring human graders would be cost-prohibitive, natural language processing can help. For example, if an essay is supposed to be on the US Declaration of Independence, the script would check to see if certain words are present in it in a certain way, and will conclude that that student might have a certain level understanding of the topic. (Yes, I know, this raises lots of questions about creativity versus cliche'd, cookie-cutter texts: the latter would be more likely to hit all the points that a grading program is looking for, whereas the former might be difficult for a program to discern. But we didn't cover such questions at the meeting, since it's an uncharted territory.)

We touched upon sentiment analysis, which helps determine how customers feel about an experience they had with a brand or a company. Companies like HomeAway use it to analyze customer reviews of their rental properties. And they discover unexpected things that way. For example, analysis of customer reviews of B&B-type places showed that the greatest predictor of customer satisfaction is whether a house has pots and pans.

Sentiment analysis also shows that, for example, if you try to infer customer satisfaction from the reviews by searching for wait times, you'll get inconsistent results. 15 minutes would be bad for a restaurant, but lightning-fast for an emergency room.

And this is where people try to determine degrees and ways of relatedness or similarity between concepts.

For that, they can use ontologies.

What is Ontology?

A consensus is now established about the definition and the role of an ontology in konwledge engineering: "An ontology is a formal, explicit, specification of a shared conceptualization".

It is used in cognitive modeling.

More about Ontologies

An ontology is a schema (model) describing the types (and possibly some individuals) in a domain, the relationships that may exist between types and individuals, and constraints on the way individuals and properties may be combined.

Here are some examples of ontologies

  • Classes: Project, Person, ProjectManager. ProjectManager is a subclass of Person. People and Projects are disjoint.
  • Relationships: worksOn, manages. Manages is a sub-property of worksOn.
  • Constraints: People work on Projects, not the other way around. Only ProjectManagers can manage Projects.

This simple example enables machine inferences, e.g. if X manages Y, then we can infer that Y is Project, and X is a ProjectManager and therefore a Person.

Onthologies allow people to create trees representing relationships between concepts, like this:

A tree that expresses relationships between conecpts in the academia (Student, Employee, Faculty, etc)
This is an example of a tree that expresses relationships between conecpts in the academia (Student, Employee, Faculty, etc.)

Some people propose ways to neasure the similarity of concepts by some graph metrics, such as the shortest path between two nodes.

Measure of similarity between two concepts in a graph, expressed in terms of a shortest path between two concepts.
Measure of similarity between two concepts in a graph, expressed in terms of a shortest path between two concepts.

More pictures from the Women Who Code Austin meetup series on Natural Language Processing are in my photo gallery.

Tuesday, February 20, 2018

Margaret Atwood at the Texas Book Festival in 2015

Margaret Atwood was interviewed at the Texas Book Festival in October of 2015. I have only read one of her books, The Handmaid's Tale, and as we know, it's a depressing and scary book. Considering that, the interview was surprisingly (to me) light-hearted and revolvedheavily around popculture. I got an impression that Margaret Atwood is quite engaged with it. She participates in art / experimental projects that revolve around books and reading.

One of such projects was the Future Library in Oslo. It was started by an artist Katie Patterson. In May of 2014 she planted 1000 trees near a forest in Oslo. These trees will grow for a 100 years. Every year a different writer from around the world, invited by a committee, each writing in a different language and different genre, will contribute a manuscript in a sealed box to the future library. 100 years later all the boxes will opened. There will be enough wood from the trees that have grown to make paper to print the anthology of those stories.

As Margaret Atwood explained, the stories can be in any form: one word, a poem, a short story. No images. And you cannot tell anybody what is in the box, except for the title. But these boxes will be in the future library with the author's name and title visible. You can go into the library, see the names and titles and imagine what could be in them. "So in May (of 2016), I'm going to Norway with my box, tied with a nice blue ribbon," said Margaret Atwood. "I imagine there might be a moment at the immigration checkpoint where they're going to ask me what is in that box, and I'm going to have to tell them, I don't know," she said, adding that that might not go over well.

She also noted that the success of this project was based on a number of assumptions: that people will want to read and will be able to read, that Oslo will still be there. (Not to mention an even more questionable assumption that books in a hundred years will still be printed on paper -- E.)

Margaret Atwood seems to encourage all the ways in which people consume and produce the written word nowadays, including mashups and remakes. For example, she wrote her own version of Shakespeare's play "Tempest" for the Hogarth Shakespeare project, in which modern writers reimagined Shakespeare's works. She had a fan fiction contest for her latest book. (And no, she replied, she wasn't going to read all the thousands of entries herself. She had slush readers for that.) When asked if she was ready for other people to take over her characters, she indicated she had no problem with that. She said: "Fanfiction is very very old, except it wasn't called fanfiction. It started with the Greek mythology. When Don Quixote was published, there were a lot of other books published about Don Quixote by other authors. So Cervantes had to put out a notice that those other books aren't authentic."

She also contributed, even if in a small way, to the Zombies, Run! app. It's an interactive app for exercise, based on the premise that a zombie apocalypse is taking place, and you are running from the zombies. At one point the run takes you to Canada, but the entire Canadian government has been zombified, and the entire NHL hockey league are zombies on skates. However, you can establish contact with Margaret Atwood. Naomi Alderman, co-creator of the Zombies, Run! app, wrote her into the game. The way Margaret Atwood explained it, "I'm a pushover. You want to put me in a zombie game? Okay."

Margaret Atwood at the Texas Book Festival in October of 2015, surrounded by the audience members
Margaret Atwood (left) at the Texas Book Festival in October of 2015, surrounded by the audience members.

Despite the lighthearted tone of the conversation, the interviewer couldn't help but note that we were at the Texas Capitol, the place where Texas Legislature makes laws -- and some or many laws that they passed recently resonated strongly with the themes in Margaret Atwood's most famous dystopian novel "Handmaid's Tale". You could get an impression that Texas Legislature used "Handmaid's Tale", um, aspirationally. So, not surprisingly, the interviewer brought up political topics.

"Margaret, you do a lot of advocacy work. And we are in the Texas state capitol, so I want to ask you about how far we have come and how far we have to go," said the interviewer, Kelly. (I don't remember her last name -- E.)

Margaret Atwood quipped something about making a law from here. (The interview took place literally in the House Chamber of the Texas Legislature. All the audience were sitting at the lawmakers' desks.) Then she said:

"The people who passed it (referring, I think, to a recent law severely restricting availability of abortion -- E.) don't think about the effect there will be down the line. Real people will have to live with these things. The effects will turn out to be not what they thought to be. For example, California reversed its draconian prison legislation because they couldn't afford it. I don't think you can really sustain the society if you alienate a lot of young people, because they're going to move somewhere else, and then who's going to pay for your old age? If you are prohibiting abortions, you may think that there will be lots of babies born, lots of poof children, future serfs? That might not work out that way."

As usual, there was time for audience questions.

A question from the audience. Oslo is building huge library, but a few hundred feet from here there is a huge library that's mostly empty, there's nobody there. (I think he might have been referring to the Austin Public Library central location. -- E.) So why do you think that the Oslo Future Library be successful?

Margaret Atwood replied that some libraries were very heavily used, for example, the New York or Toronto public library systems. "So I don't think it's a question of library or no library, it's a question of what kind of library, how accessible it is, and what kind of interactivity do they do? I believe that access to books and reading is one of the cornerstones of the democracy," she said.

A woman from the audience says she's getting her PhD in literature, and (if I understood correctly) is teaching literature to freshmen. Making them read feels like she's murdering them. She asks if Margaret Atwood sees it a general rule of thumb for this generation (unwillingness to read), and if so, does she have any advice?

Margaret Atwood. Freshmen read all the time. You can't use internet without being able to read. There is a place where they can write anonymously, and post what they're really interested in, which may be vampire stories. Another way you can help them is audiobooks. But sometimes they just want to put in the studying time. When I was teaching grammar to engineering students, I started them on Kafka's parables, which are very short. So you can start your students on flash fiction. They're all 18, it's a difficult age. When I taught the same class to returning students, there was a huge difference. They wanted me to challenge them, they argued with me.

Make your students write a zombie or vampire story. Or an article of economics of vampires. Vampires are always rich. Why is that? They are immortal -- if they became a vampire in 1930, how much money you have accumulated? Have them do a business plan for being a vampire. There are two vampire movies where this accumulation of the riches is done explicitly. 1. An Iranian vampire western movie called "A girl walks home alone at night" - a feminist Iranian vampire, who was killing only bad people, but in the process she accumulated a lot of diamond watches. 2. "Let the right one in", with a 12 year old girl vampire. There is a classic line in it: a little boy says to her when he [starts suspecting something]: 'How old are you really?' She replies: 'I'm really 12. I've been a child for a very long time.'"

A woman from the audience. What words of comfort you have for readers who know they'll never lay their eyes on your contribution to the future library?

Margaret Atwood. There are many books you'll never lay your hands or eyes on, because you've never heard of them. As a tribute to that idea, find a book you never heard of, read it, and find other people who love it.