Monday, December 31, 2018

Book review: Jo Walton "Just City"

This was an easy and pleasant read. But it could have been so much better if it had actually tackled the premise that it promised. The premise seemed ambitious; ir promised speculative fiction with the capital S. But it didn't deliver.

The Just City in the title of the book is the city that Plato talked about in the Republic. The story describes the social experiment of the Republic implemented in real life. It's set on an island somewhere, presumably, in the Mediterranean sea, in an undefined time in the prehistory. The time it's set in is before the rise of the classical world. Even Illiad and Odyssey had not yet been written. It doesn't matter, because the inhabitants of the island are completely shut off from the surrounding world and have no interaction with it. Most of the inhabitants are 10-year-old children (10080 of them), that were bought from slave traders of different eras, and brought to this island, across time, to be raised according to the Platonic concepts of justice. They are expected to implement Plato's Just City in real life. They are schooled by a number of teachers from different eras of history, and all of them have one thing in common is that at some point they prayed to Athene. For it was Athene that set up this experiment, and transported people through time to bring them here.

The plot of the book is rather uneventful, but I was hoping for plot twists based on the moral dilemmas these people face, and how they have to adjust their experiment when it is not turning out as planned. Of course, the experiment does not turn out as Plato envisioned, because people are people and they bring their human natures with them here. They also bring their prejudices, perceptions, beliefs, and ways of doing things from their eras. So, not surprisingly, even in the Just City rape victims are still responsible for their rape.

What's odd is that the "masters" -- the teachers who are responsible for the upbringing of these 10080 children -- do not question the notion of justice beyond the Platonic ideal. This ideal was held by a person who lived 2000+ years ago, and much of it doesn't jibe with our modern notion of justice. And many of the masters were from eras historically close to ours, or even beyond ours.

So it's strange that none of the teachers entertain a more modern paradigm of justice, even in the matters of life and death. Such as availability of modern medicine. Athene "imported" something from technologically advanced era (I won't say what, because one of the plot developments hinges on that), so she could have imported advanced medicine as well. And yet they don't treat sick newborns, but "expose" them, i.e. leave them in the wilderness to die. They oddly think it's more humane than to kill them. They do it even to the babies with small birth defects like cleft lip or palate, which are entirely correctable in our times. What about treatment of injuries and illnesses that surely must have occurred among those 10080 children, because of sheer statistical likelihood? Was their medicine as barbaric as the medicine of the ancient times? To be fair, one of the teachers mentions "mold drugs", so apparently they did import the antibiotics from the future. But what about everything else? I would think that realistically this question would have popped up very early in the existence of the city, and I also think that those teachers who came from the more modern times would find it a gross violation of ethics to not provide lifesaving treatments when they could be brought in from the future. And if you are dedicating your whole life to put a vision of justice in reality, then surely you would assign the utmost importance to ethical questions?

In other words, I expect that realistically in such a city there would be never ending debates, serious arguments, maybe even fights over whose ethical system is considered the most just. Yet none of it happens. Everybody leads largely untroubled existences filled with philosophy, music, arts and sports, and nobody runs into ethically ambiguous situations, in which Plato's vision directly contradicts their own internal sense of justice.

To be fair, something similar does start to happen towards the end, but it was a bit too late to make me "buy" into the book. The whole book seemed like one big missed opportunity to get deeply into ambiguities and paradoxes of justice.

Tuesday, December 25, 2018

Natural Language Processing hackathon, or don't judge the wine by the shape of the bottle

In April of 2018 I went to a Natural Language Processing hackathon. Organized by Women in Data Science Austin, it took place at Dell, where one of the organizers worked. This was not the kind of hackathon where you hack for the whole weekend straight, crashing on a beanbag to catch a few winks in the breakroom of some hipster startup. No, this was a hackathon with work-life balance. It lasted from 10 am to 3 pm on a Saturday, which is just enough time for you to get deeply enough immersed in a subject to fire up your appetite for it, but not get sick of it. There were no minimal viable products produced, and no prizes, but I got to sink my teeth into the basics of Natural Language Processing.

A data scientist named Becky, who does Natural Language Processing for an Austin company, introduced us to the three cornerstone approaches of NLP -- summarization, topic modeling, and sentiment analysis.

Data scientist Becky talks about topic modeling
Data scientist Becky talks about topic modeling.

Sentiment analysis quantifies the subjective emotion in a text, e. g. did the majority of reviewers like or didn't like a particular wine? Data scientists don't take into account just the words, but also such nonverbal information as capitalization (a word in all caps is likely to mean the author feels strongly about it), and emoji. Topic modeling finds abstract concepts that occur in a body of texts, a. k. a. corpus. For exaple, if it finds the words milk, meow, and kitten, it might decide one of the topic of this text is cat. If it finds the words bone, bark, and puppy, it might decide one of the topics is dog.

Summarization reduces a text to several key phrases or a representative sentence. Summarization can be extractive or abstractive. Extractive summarization selects a few representative sentences from the text, while abstractive summarization creates a summary of the text.

As an example, Becky gave a phrase: "The Army Corps of Engineers, rushing to meet President Bush's promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during the storm, according to documents obtained by the Associated Press."

Extractive summarization would extract such phrases from it as:

  • Army Corps of Engineers
  • President Bush
  • New Orleans
  • defective flood-control pumps

In contrast, abstractive summarization would generate such phrases as:

  • government agency
  • presidential orders
  • defective equipment
  • storm preparation
  • hurricane Katrina
Natural Language Processing hackathon hosted by Women in Data Science Austin
As many of the hackathon attendees as could fit in the picture.

I can't quite put my finger on it, but it seems that extractive summarization extracts names of specific entities, but not much information as to what happened to those entities or what did they do. But abstractive summarization seems to "understand" what those entities actually represent and what they do, and thereby extracts more "gist" from the paragraph. I could be wrong about it, of course.

According to Becky, extractive summarization is a mostly solved problem by now. TextRank algorithm takes care of it. But abstractive summarization is a very difficult, unsolved problem, though knowledge graphs help.

At the organizers' suggestion, the attendees arranged themselves into three teams, each focusing on one of those three pillars. The organizers brought with them the corpora, a. k. a. texts to be analyzed. Specifically, they brought wine reviews, lots and lots of them. I suppose that's the second best to bringing the actual wine.

Summarizing wine reviews means extracting an "essence" of what the bulk of the reviewers said about a particular wine. It means identifying certain qualities that most reviewers noticed in a given wine. Sentiment analysis meant identifying whether the reviewers thought mostly positively or mostly negatively about the wine.

I ended up in the summarization team. Lead by Randi, who is a data scientist at a big company, we analyzed the wine reviews. By that I mean we called a bunch of functions from pandas, textacy, sumy and other relevant Python packages. The results were mixed. For example, sumy summarized reviews of Moscato in two sentences, but we had no way to tell whether this summarization is good, i.e. whether those were the most representatives sentences from the reviews. It's funny how this is the kind of problem that one has no way of verifying -- at least none that I learned in my 5 hours of NLP bootcamp. Sure, you could read hundreds of reviews and try to get a "feel" whether those sentences were the most representative, but your "feel" would be subjective.

It makes Natural Language Processing feel like black box, and almost like magic -- until you notice that when you ask for 5-sentence summary, the summary includes duplicates for first two sentences. That looks odd, so you take a closer look at the texts and notice that there are duplicate sentences in the document itself. For all its magic, sumy can't figure that out.

Within sumy, you can choose which summarizer to use. First we used LexRank, and it turned out to be very slow. Then we tried another, LuhnSummarizer, and it was much faster, but the results not nearly as accurate. But how would you decide how accurate a summarization is, given that there are no exact criteria for accuracy that I know of? Well, the first summary described mouthfeel and acidity of Moscato. The second included things like the shape and color of the bottle. It left me with the same feeling one often gets interacting with artificial intelligence, that it's both very smart and very stupid at the same time.

Tuesday, March 20, 2018

Introduction to Natural Language Processing with Women Who Code

In 2016 Women Who Code Austin hosted a series of five presentations on Natural Language Processing. The presenter was our member Diana, who has a Ph.D. in linguistics and has worked in the area of computational linguistics for many years. She did demos of some basic text analysis one can do with the Python Natural Language Toolkit, or in short, NLTK.

She presented all this as a Python notebook. A Python notebook is software that lets you combine text, code, and output of that code on one page. You can run a code snippet right there in the notebook, and the resuls will get updated automatically. So equipped, Diana introduced us to the basics of what computational linguists do. Or if that sounds too ambitious, let's just say she showed some simple things one can do with NLTK.

For example:

  • read in the text,
  • tokenize,
  • tag,
  • remove punctuation,
  • remove stopwords...
  • build a frequency hash table from the rest of words.
The first Austin Women Who Code meeting on natural language processing, with our instructor Diana standing in the center
The first Austin Women Who Code meeting on natural language processing, with our instructor Diana standing in the center

She introduced such concepts as collocations and bigrams. Bigrams are pairs of words that are next to each other in a text. Collocations are pairs of words that naturally occur in the language together, i. e., a chance of them occuring together is greater than random. An example of a bigram that's not a collocation is Trump's usage of a phrase "Liar Ted" (this was the spring of 2016, the height of the Republican strife for a presidential nomination). If a bigram is not a collocation, but occurs more often than randomly in a text, that can help to identify who the author is / who the speaker is, and some such qualities. It can be a fingerpint of sorts.

The aforementioned tagging is something we do after tokenizing (roughly speakinng, breaking the text up into words). Tagging assigns a 2-letter tag to each word, marking it as a part of speech, such as noun, adverb, etc. "You can use a big list of tags, or a simplified one. Using a simplified list of tags can help with speed of analysis of your corpus," said Diana.

Here Diana noted that tagging words as parts of speech has inherent ambiguity in it -- exactly the kind of thing that makes language and its computational processing so interesting. Here is an example of parts-of-speech ambiguity in a sentence: "They refuse to permit us to obtain the refuse permit". Still, the Python Natural Language Processing Toolkit correctly tags the first "refuse" and "permit" as verb (VBP) and the second instance of each as noun (NN).

NLTK correctly identifies parts of speech in the sentence 'They refuse to permit us to obtain the refuse permit'
This slide shows how NLTK correctly identifies parts of speech in the sentence above. The first instances of "permit" and "refuse" are "VB" -- verbs, whereas the second ones are "NN" -- nouns.

At the second meeting we did all those actions with a corpus of -- wait for it -- Hillary Clinton's emails. Her emails were available for download from the Kaggle site. This was still the spring of 2016, and we did not yet know how sad the implications of those emails will turn out to be, so the choice of the subject wasn't as... emotionally loaded as it would have been just half a year later. And to say "we" did this is an exaggeration, because it was actually Diana that did all the processing and presented the code and the results to us in a Python notebook.

Here was the complete agenda of the meeting:

  • Getting data: Hillary Clinton's emails;
  • Reading files;
  • Using Pandas to create a Dataframe in Python;
  • Cleaning data: eliminating punctuation, eliminating stopwords, normalizing data: converting to lower case, tokenizing words
  • Visualizing data.

All of this pre-processing of data was done in the Python Natural Language Processing Toolkit (NLTK).

I must say I would have preferred it if Diana had set up this mini-course as a series of exercises for us to do in class and write some code calling NLTK methods ourselves. But if we had done that, we would not have been able to cover even half as much in those four meetups. So I appreciate what Diana did. At least she showed us what kind of beast NLTK is and which fork to eat it with. In the process learned some basic NLP lingo, such as:

  • corpus -- a body of text, plural corpora; it's what you process to extract words and do computations with them;
  • lexicon -- words and their meanings; example: English dictionary.
  • However, you need to consider that different fields will have different lexicons. For example: to a financial investor, the first meaning of the word "bull" is someone who is confident about the market, as compared with the common English lexicon, where the first meaning of the word "bull" is an animal. As such, there is a special lexicon for the financial investors, doctors, mechanics, and so on.

  • token -- each "entity" that is a part of whatever was split up based on rules. For example, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.
  • frequency distribution. The frequency distribution method of NLTK counts the frequency of each vocabulary item in the text. It helps identify the most informative words in a corpus.

So overall I got a little familiar with what are the very basics of what natural language scientists do. But somehow, during those four meetings I was still hoping that we'll get past collecting the statistics about words, and get to some mysterious insights about how language works, evolves, and transforms our thoughts, that only computer analysis of language can provide. Of course, my expectations were unrealistically inflated for a set of introductory lessons.

Going back to Hillary Clinton's emails, here is how you would analyze them. This is an "Exploratory Analysis: Getting and Cleaning Data" slide. Here you see the metadata fields that were extracted from the emails. There are quite a few of them.

Python dataframe with the metadata fields extracted from Hilary Clinton's emails
Python dataframe with the metadata fields extracted from Hilary Clinton's emails. Python dataframe with the metadata extracted from Hilary Clinton's emails
This slide, "Slicing dataframe to extract subject", shows Python method calls that you would use to extract the email subjects from the dataframe shown in the previous image. Presented in a Python notebook, it alternates code with results of that code. The results can be updated on the fly if you make changes to the code. The MetaDataSubject and MetaDataTo fields contain some familiar names and topics that made the news...

The next slide shows the use of the NLTK method "concordance". It produces a list of the words used in the text, with the passages where they are used. So if you want all occurrences of the word "surprise" in Jane Austen's "Emma", with snippets of context, you can call


(Here, emmaText is the variable that holds the text of the Jane Austen's novel "Emma".) From this example you can also see that NLTK has corpora of texts from the Gutenberg project, which is pretty handy.

Concordamce: all the places in Jane Austen 'Emma' where the word 'surprize' is used
Concordamce: all the places in Jane Austen 'Emma' where the word 'surprize' is used, obtained by calling a 'concordance' method of NLTK.

Venturing ddeper into natural language processing

The easiest texts to analyze are the news, Diana said. News have very good structure. Sentences tend to be short, and tend to have classical structure: subject, verb, object, etc. Medication instructions are also easy to analyze, since they are required to have readability scores high enough to be suitable for 9-12 year olds. But in literature the sentences are often not conventional and much harder to parse.

At the last meetup we talked a little bit about analyzing texts "for real". And by that I mean a little deeper analysis than just breaking up sentences into parts of speech and gathering statistics about it.

One example where computational linguistics is used is to grade student essays. If you have so many essays that hiring human graders would be cost-prohibitive, natural language processing can help. For example, if an essay is supposed to be on the US Declaration of Independence, the script would check to see if certain words are present in it in a certain way, and will conclude that that student might have a certain level understanding of the topic. (Yes, I know, this raises lots of questions about creativity versus cliche'd, cookie-cutter texts: the latter would be more likely to hit all the points that a grading program is looking for, whereas the former might be difficult for a program to discern. But we didn't cover such questions at the meeting, since it's an uncharted territory.)

We touched upon sentiment analysis, which helps determine how customers feel about an experience they had with a brand or a company. Companies like HomeAway use it to analyze customer reviews of their rental properties. And they discover unexpected things that way. For example, analysis of customer reviews of B&B-type places showed that the greatest predictor of customer satisfaction is whether a house has pots and pans.

Sentiment analysis also shows that, for example, if you try to infer customer satisfaction from the reviews by searching for wait times, you'll get inconsistent results. 15 minutes would be bad for a restaurant, but lightning-fast for an emergency room.

And this is where people try to determine degrees and ways of relatedness or similarity between concepts.

For that, they can use ontologies.

What is Ontology?

A consensus is now established about the definition and the role of an ontology in konwledge engineering: "An ontology is a formal, explicit, specification of a shared conceptualization".

It is used in cognitive modeling.

More about Ontologies

An ontology is a schema (model) describing the types (and possibly some individuals) in a domain, the relationships that may exist between types and individuals, and constraints on the way individuals and properties may be combined.

Here are some examples of ontologies

  • Classes: Project, Person, ProjectManager. ProjectManager is a subclass of Person. People and Projects are disjoint.
  • Relationships: worksOn, manages. Manages is a sub-property of worksOn.
  • Constraints: People work on Projects, not the other way around. Only ProjectManagers can manage Projects.

This simple example enables machine inferences, e.g. if X manages Y, then we can infer that Y is Project, and X is a ProjectManager and therefore a Person.

Onthologies allow people to create trees representing relationships between concepts, like this:

A tree that expresses relationships between conecpts in the academia (Student, Employee, Faculty, etc)
This is an example of a tree that expresses relationships between conecpts in the academia (Student, Employee, Faculty, etc.)

Some people propose ways to neasure the similarity of concepts by some graph metrics, such as the shortest path between two nodes.

Measure of similarity between two concepts in a graph, expressed in terms of a shortest path between two concepts.
Measure of similarity between two concepts in a graph, expressed in terms of a shortest path between two concepts.

More pictures from the Women Who Code Austin meetup series on Natural Language Processing are in my photo gallery.

Tuesday, February 20, 2018

Margaret Atwood at the Texas Book Festival in 2015

Margaret Atwood was interviewed at the Texas Book Festival in October of 2015. I have only read one of her books, The Handmaid's Tale, and as we know, it's a depressing and scary book. Considering that, the interview was surprisingly (to me) light-hearted and revolvedheavily around popculture. I got an impression that Margaret Atwood is quite engaged with it. She participates in art / experimental projects that revolve around books and reading.

One of such projects was the Future Library in Oslo. It was started by an artist Katie Patterson. In May of 2014 she planted 1000 trees near a forest in Oslo. These trees will grow for a 100 years. Every year a different writer from around the world, invited by a committee, each writing in a different language and different genre, will contribute a manuscript in a sealed box to the future library. 100 years later all the boxes will opened. There will be enough wood from the trees that have grown to make paper to print the anthology of those stories.

As Margaret Atwood explained, the stories can be in any form: one word, a poem, a short story. No images. And you cannot tell anybody what is in the box, except for the title. But these boxes will be in the future library with the author's name and title visible. You can go into the library, see the names and titles and imagine what could be in them. "So in May (of 2016), I'm going to Norway with my box, tied with a nice blue ribbon," said Margaret Atwood. "I imagine there might be a moment at the immigration checkpoint where they're going to ask me what is in that box, and I'm going to have to tell them, I don't know," she said, adding that that might not go over well.

She also noted that the success of this project was based on a number of assumptions: that people will want to read and will be able to read, that Oslo will still be there. (Not to mention an even more questionable assumption that books in a hundred years will still be printed on paper -- E.)

Margaret Atwood seems to encourage all the ways in which people consume and produce the written word nowadays, including mashups and remakes. For example, she wrote her own version of Shakespeare's play "Tempest" for the Hogarth Shakespeare project, in which modern writers reimagined Shakespeare's works. She had a fan fiction contest for her latest book. (And no, she replied, she wasn't going to read all the thousands of entries herself. She had slush readers for that.) When asked if she was ready for other people to take over her characters, she indicated she had no problem with that. She said: "Fanfiction is very very old, except it wasn't called fanfiction. It started with the Greek mythology. When Don Quixote was published, there were a lot of other books published about Don Quixote by other authors. So Cervantes had to put out a notice that those other books aren't authentic."

She also contributed, even if in a small way, to the Zombies, Run! app. It's an interactive app for exercise, based on the premise that a zombie apocalypse is taking place, and you are running from the zombies. At one point the run takes you to Canada, but the entire Canadian government has been zombified, and the entire NHL hockey league are zombies on skates. However, you can establish contact with Margaret Atwood. Naomi Alderman, co-creator of the Zombies, Run! app, wrote her into the game. The way Margaret Atwood explained it, "I'm a pushover. You want to put me in a zombie game? Okay."

Margaret Atwood at the Texas Book Festival in October of 2015, surrounded by the audience members
Margaret Atwood (left) at the Texas Book Festival in October of 2015, surrounded by the audience members.

Despite the lighthearted tone of the conversation, the interviewer couldn't help but note that we were at the Texas Capitol, the place where Texas Legislature makes laws -- and some or many laws that they passed recently resonated strongly with the themes in Margaret Atwood's most famous dystopian novel "Handmaid's Tale". You could get an impression that Texas Legislature used "Handmaid's Tale", um, aspirationally. So, not surprisingly, the interviewer brought up political topics.

"Margaret, you do a lot of advocacy work. And we are in the Texas state capitol, so I want to ask you about how far we have come and how far we have to go," said the interviewer, Kelly. (I don't remember her last name -- E.)

Margaret Atwood quipped something about making a law from here. (The interview took place literally in the House Chamber of the Texas Legislature. All the audience were sitting at the lawmakers' desks.) Then she said:

"The people who passed it (referring, I think, to a recent law severely restricting availability of abortion -- E.) don't think about the effect there will be down the line. Real people will have to live with these things. The effects will turn out to be not what they thought to be. For example, California reversed its draconian prison legislation because they couldn't afford it. I don't think you can really sustain the society if you alienate a lot of young people, because they're going to move somewhere else, and then who's going to pay for your old age? If you are prohibiting abortions, you may think that there will be lots of babies born, lots of poof children, future serfs? That might not work out that way."

As usual, there was time for audience questions.

A question from the audience. Oslo is building huge library, but a few hundred feet from here there is a huge library that's mostly empty, there's nobody there. (I think he might have been referring to the Austin Public Library central location. -- E.) So why do you think that the Oslo Future Library be successful?

Margaret Atwood replied that some libraries were very heavily used, for example, the New York or Toronto public library systems. "So I don't think it's a question of library or no library, it's a question of what kind of library, how accessible it is, and what kind of interactivity do they do? I believe that access to books and reading is one of the cornerstones of the democracy," she said.

A woman from the audience says she's getting her PhD in literature, and (if I understood correctly) is teaching literature to freshmen. Making them read feels like she's murdering them. She asks if Margaret Atwood sees it a general rule of thumb for this generation (unwillingness to read), and if so, does she have any advice?

Margaret Atwood. Freshmen read all the time. You can't use internet without being able to read. There is a place where they can write anonymously, and post what they're really interested in, which may be vampire stories. Another way you can help them is audiobooks. But sometimes they just want to put in the studying time. When I was teaching grammar to engineering students, I started them on Kafka's parables, which are very short. So you can start your students on flash fiction. They're all 18, it's a difficult age. When I taught the same class to returning students, there was a huge difference. They wanted me to challenge them, they argued with me.

Make your students write a zombie or vampire story. Or an article of economics of vampires. Vampires are always rich. Why is that? They are immortal -- if they became a vampire in 1930, how much money you have accumulated? Have them do a business plan for being a vampire. There are two vampire movies where this accumulation of the riches is done explicitly. 1. An Iranian vampire western movie called "A girl walks home alone at night" - a feminist Iranian vampire, who was killing only bad people, but in the process she accumulated a lot of diamond watches. 2. "Let the right one in", with a 12 year old girl vampire. There is a classic line in it: a little boy says to her when he [starts suspecting something]: 'How old are you really?' She replies: 'I'm really 12. I've been a child for a very long time.'"

A woman from the audience. What words of comfort you have for readers who know they'll never lay their eyes on your contribution to the future library?

Margaret Atwood. There are many books you'll never lay your hands or eyes on, because you've never heard of them. As a tribute to that idea, find a book you never heard of, read it, and find other people who love it.