Tuesday, December 24, 2019

Book review: R. F. Kuang "The Poppy War"

I liked this book more than most other fantasy books I read this year. It drew me in from the beginning. The story is not exactly light-hearted, but -- at least at first -- it didn't lack in humor. The heroine Runin's (Rin for short) situation somewhat resembles Harry Potter's: she is an orphan with unusual talents growing up in an adopted family that mistreats and undervalues her. The family wants to get rid of her as soon as possible by marrying her off as a teen. But her talents, persistence, and cunning lets her escape her family and the looming marriage, and achieve a future that no one of her social class could dream of.

Hopefully this is not too much of a spoiler, because this happens relatively early in the book. Rin is admitted into the nation's top school, where, despite some teachers' attempts to derail her, she persists and gains exclusive, esoteric knowledge that's unattainable even for the elite students of that school. All throughout that, the book has a Harry Potter'esque "wizard school novel" feel, except that Rin is more like Hermione than Harry. Clever and doggedly stubborn, she outwits the stodgy adults that consider her unworthy of being there and thwart her at every step.

But the tone of the book completely changes in the second chapter, about a third into the novel. It changes so much that I wondered whether the first chapter and the rest of the book were initially separate novels featuring different protagonists, and only later for some reason were fused into one. The humor of the first chapter is gone, and the book takes a dark turn. The country is at war, and Rin is now a member of a small squad called the Cike, which is roughly a roving band of wizards. The Cike have supernatural powers. In this book, magic comes in a form of connecting to a god (in this nation's pantheon there are several) and asking them to do the dirty work for you. Often the practitioners of "lore", or magic, need to take consciousness-altering drugs to connect with gods.

Their magic powers, however, don't make the Cike superheroes. For all their formidable abilities, they still are unable to stand up against the vicious aggressor armies. This is part of what I liked about this book. It shows the limitations of magic very clearly. And it shows how the wizards' superpowers can lead them down a tragic path. They can't help but spiral into the ultimate arms race. Since the very beginning, Rin's old lore teacher -- the one who taught her to connect with the gods -- tells her that she should not under any circumstances try to "weaponize" them, i. e. call on their powers in a war. What the gods will unleash on Earth will be far more terrible than the damage done by war, he warns her. And, as you might expect, the Cike -- who are in their teens and early twenties, and have knowledge, but not much wisdom -- quickly get drawn into the cycle of aggression and revenge. They pull the gods into the war to exact worse and worse punishment, which, in turn, provokes more aggression from the invading army.

The dilemma is presented in the book very vividly. The aggressor is so horribly cruel that in the reader's mind there is not even a doubt that it's worth calling upon gods to destroy them -- until a wise man like Rin's teacher Jiang makes a case that maybe you really, really shouldn't. The reader gets to see the points both pro and con, and those are not strawmen arguments. They are weighty and well-balanced. True, the brutality of the enemy army can seem so excessive that it's at times ridiculous, but it's probably nothing that hasn't happened in some part of Earth at some point or another.

Being forced to choose between different evils, when it's hard to even tell which one is bigger, makes for a good tension source in a book. I liked that this book didn't have a happy ending. At most it had an ending that could be described as "not the worst possible".

Thursday, December 19, 2019

Book review: C. J. Cherryh "Foreigner"

It was a slow-paced book, and I was afraid I wasn't going to finish it -- I no longer force myself to finish books that don't sufficiently appeal to me -- but I finished it because, despite the slowness, it had some indescribable satisfactory quality. Perhaps because it was a book you could watch unfold before your eyes like a movie. Sometimes you read a book where every sentence falls apart into a pile of words as soon as you are finished with it, without adding up to an image in your head. This is the opposite. This book is highly immersive. Just for that quality you might like to continue reading it even when the plot is not very compelling. But those who like fast-moving narrative might not find it to their taste.

The main character, Bren is an ambassador of sorts to an alien race, called atevi that lives on a planet where humans are guests. Or perhaps he is more like a translator between humans and atevi. His official title in atevi language is paidhi, and that's how he is referred to throughout the book. He lives in the royal court of one of the planet's several kings, or aiji. Humans are permanent, though unwanted, guests on this world, because they ended up there by mistake and can't get off of it. Humans live in just one continent, maintain a truce with the atevi, and have been slowly trickling out their technologies to the atevi. At present they brought the local technology up to roughly the level of early 21st century Earth. The locals are civil to the humans, but (as behooves aliens) inscrutable.

One day someone attempts to assassinate paidhi Bren. In response to that, the king / aiji quickly orders him whisked away to a remote corner of the country. It's done under the guise of the paidhi's protection, but it quickly becomes clear that it's more like imprisonment. He is exiled to a place where he is completely isolated and has no way to contact any humans.

This happens fairly early in the book, and then for the next 300-something pages neither he, nor we, the readers, know what was the true reason of his abduction, or where all this is going. The book slows down as Bren tries to figure out where he stands with his captor-protectors based on short, fragmented conversations he has with them.

He is not sure where their loyalty lies. Are they loyal to him? He strongly suspects not. Are they loyal to their employer(s), such as the aiji, or other organizations and alliances? Nor is he sure whether it is useful to them to keep him alive. He knows (but is not sure if the natives known) that the atevi can't use him as a pawn to extract something of value from the humans, because if his life is threatened, the human government will let him die. They said so from the start and he took the job with the full understanding of this. So, in this situation, he knows there is nothing protecting his life but his captors' whim.

He tries to probe their minds via short, fragmented conversations, but those conversations always skirt the essence of the topic. Yet they occupy the next 300-something pages of the book. Those talks are fraught with misunderstandings, some of them absurd, but not in a funny way. For example, the atevi can't fathom that the word "like" has many meanings, and that to like a food is very different than to like a person. This seemed rather unLIKEly to me. Bren even speculates that the locals don't have feelings. At the same time, it is obvious that they have feelings of dignity and pride, and that pride is rather easily wounded by a foreigner asking the wrong kinds of questions.

Those conversations don't go very far, and three quarters into the book we still don't have a clue who Bren can or cannot trust. So we are still waiting for the other shoe to drop, which is to say we are waiting for this low-grade suspense to lead to a huge revelation. There are so many minor shoes dropping throughout the book that you can never tell which of them is "the real thing" as opposed to random incident. Then, finally, around 3/4 into the book, his situation goes from merely uncomfortable to much worse. Only then the key point is revealed, and we find out the real reason he is kept captive. The pace of the book picks up after that.

I didn't understand what conclusion he reached at the ending either. Maybe I need to reread it. It seems like he was faced with a hard conclusion that humans were not welcome on this planet, but found a way to negotiate with atevi that could lead to permanent peace. But if there was an a-ha! moment in this book, it was rather subtle.

To summarize, this is a book for those who like science fiction with lots of psychological nuance. I personally like it too, but this wasn't the kind of nuance I could relate to. But then I'm known to be a robot. If you can tolerate the plot advancing very slowly, and if you are intrigued by characters trying to figure out what another character meant by their every utterance or gesture, with cultural differences thrown in, then it may be a book for you. I have to say, for me, the character's ruminations supplied just enough intrigue not to put the book aside, but ultimately did not add to something satisfying.

Friday, February 01, 2019

Editing like a boss with Tex Thompson: ArmadilloCon 2018 panel

This was another of the wonderful panels / mini-workshops on various aspects of writing -- this time, on editing your own work -- by Arianne "Tex" Thompson. Like everything by Tex Thompson, her advice on editing was broken down into bullet points and sub-bullet points, each of which contained examples of how to accomplish it.

Five hot tips for content and developmental editing

1. Eliminate happy coincidences. The coincidences that make the protagonist's life harder are mostly OK. Turn "but fortunately" into "oh, shit".

Example: if you have characters who are willing to help the protagonist, turn them into characters that are not really able to help. Or into character that are able to help, but not willing. Why should I help you? You should earn it. Or characters that are able and willing to help, but their help comes with strings attached.

2. Blow up the boring parts. You are bored reading them, but you don't know how your story should get from part A to part B.

Here are some examples how to make boring parts more exciting.

Instead of having a breakup conversation in a private place like home, or a Starbucks or a restaurant, have it in an unusual setting: in the middle of a traffic jam in a car, when no one could escape, on a whaling ship, or at an 8-year-old's birthday party at a roller rink. Can we do it at a paintball match? This can help you to spice it up and put some interesting twist on it. The world is dropping from under our feet, but we still have to do the hokey pokey, since it's an 8-year-old's birthday party. Or at the roller rink somebody falls and breaks their leg.

Arianne 'Tex' Thompson 'Editing like a Boss' panel
Arianne 'Tex' Thompson 'Editing like a Boss' panel

In a novel "Matterhorn" (by Karl Marlantes? There are other novels by that title, but I assume that's the one Tex meant -- E.), there is a long infodump when a character goes around a military camp and is introduced to lots of people and is told their military ranks and names. That would be boring, but at the same time there is a medical drama brewing, where somebody has to be medevacuated, but helicopters can't land because of high winds. So there is a ticking clock. The infodumpy introductions are alternated with the medical drama.

A race against time can definitely spice up the boring parts.

Another way to introduce suspense is to let your readers know that something dangerous or terrible is about to befall the characters, but the characters don't know it. For example, the audience knows there is a monster under a child's bed, but the kid doesn't know it. So any time when the kid rolls over and his arm drops off the bed, the audience winces.

3. Target accidental repetitions

Make them deliberate or delete them! A word or phrase repeated twice looks like accidental echo, but repeated three times sounds like you know what you are doing.

This applies not just to word usage, but to plot elements as well. For example: if the characters in your book take a road trip and are staying in motels, make the motels shabbier and shabbier as the characters run out of money. So when they are pulling up to the next motel, the reader will be cringing: what kind of bad things will be lurking at this place?

4. Sharpen relevant contrasts

Conflict is not enough, says Tex Thompson. Contrast is everything.

5. Multitask relentlessly

A great page should do at least two out of three: advance the story, develop the backstory or the setting, and build or reveal character.

Other tips

Line editing

Tex Thompson also gave tips on line editing, though I can't put them into nifty numbered-bullet-point format, because I didn't write all of them down. But here are some:

  • Before every editing pass, change the format of the manuscript, such as the font or font size. The words line up differently. That way you'll see it more like a new reader. You'll see more what's actually there, not what you think is there. Have Stephen Hawking's robotic voice read it out to you. If your book sounds good while read in robotic monotone, it's good.
  • Read it backwards (a basic rule of proofreading). Microsoft Office has a read-it-backward option.
  • Delete distancing words: thought, said, saw, heard, felt, realized, wondered. They emphasize the distance between the character and the reader. We want the opposite -- immersion. Too much of that distance and you feel like you are watching someone playing a video game. You can google "filter words fiction" or "distancing words fiction" to find out which words you should consider deleting.
  • Tex Thompson mentioned some software that can help with various aspects of writing, and the audience threw in their own suggestions. For example, Prowriting Aid is a good program that shows you how many times you've used various words. Hemingway can tell you when your sentences are too complicated. Also it's a good idea to get a readability score for your text, and the grade level. In the early chapters, while the reader doesn't yet care about the story, it's good to keep it lower grade.
  • Do at least one "fast pass". Read the whole thing in a day, the way a reader who binges on your work would read it. That's the best way to find overused words / phrases. Also, you will catch inconsistencies.

Tex Thompson also gave tips for gathering and interpreting feedback.

  • Try giving beta readers single chapters first. Don't give them the whole novel, because they most likely will get scared off, because they were not preprared to read this much material.
  • Look for points of convergence. What comments do you keep getting? Are there common themes among them? Also remember this: people who notice a problem in your writing are usually right. People who suggest a solution are usually wrong.
  • Strive to have a mix of both readers AND writers among your beta readers. Each kind will be valuable in their own way. People who are just readers but not writers haven't internalized the rules of writers, they haven't chopped up the Hero's Journey and snorted if off of a mirror. They care more about the story. Does it hold their attention?

    Also ask readers-that-are-not-writers: what other books that you've read would you compare it to? Hopefully they won't say, it's like War and Peace: I didn't finish it.

    It is important to write at least as well as Dan Brown. If you pass the Dan Brown test, you're good. This is a guy who writes "he picked up the phone with one of his two hands", but his stories get people hooked.

Sunday, January 27, 2019

Writing dialogue: an ArmadilloCon 2018 panel

Authors Arianne "Tex" Thompson and Mark London Williams gave a panel on writing good dialogue. Here is some of their advice.

Mark London Williams. In emotional situations, the characters will often be indirect.

Tex Thompson agrees. When a horse approaches an object, it does not go straight to it, does not make a beeline. That's a predator move. A horse comes up at an angle to get a better view at an object. Similarly, good dialogue does not say something directly. It makes several approaches, several passes. It suggests (for smart readers to get), and then confirms, so that everyone could get on the wagon.

Specific dialog problems

Tex Thompson addresses the audience. How many of you had in your own writing struggled with a scene where you had a dialogue bouncing back and forth for pages ans pages, but not getting to the point?

She then asks Mark London Williams: What advice would you have to overcome this?

Mark London Williams. Start a scene as late as possible. Start with at teacup already smashed on the floor, and a woman says to a man: "I can't believe you did it! You always do this!" -- now the readers are forced to wonder: he did what? What does he always do? Smashes teacups? Hurts her feelings?

Tex Thompson gives another example. Let's say the dialogue starts with a line: "So the school called again today". Now the readers want to read further, because they have a sense that someone is in trouble, and they are wondering who did what.

Tex Thompson. Instead of "he said, she said", put in a sentence describing action.

"You always do this." She picked up a broken piece.

There is a rule: one paragraph for one actor.

"You always do this." Her face was calm, but under the table she was picking at her 500 hundred dollar French manicure.


The discussion also covered dialects, accents, slang and vernacular. One of the general advices on that topic is: avoid writing out a dialect or an accent phonetically as it sounds (like "ze" instead of "the" in a stereotypical French character's speech), because that quickly becomes grating and annoying. In small amounts it can be OK, just don't write entire paragraphs like that. I don't remember most of other advice, but I remember these interesting observations:

Where more than one language is spoken, the lower-prestige language contributes the grammar, while the higher-prestige language, the vocabulary. This happened, for example, to English language after the Norman conquest of England, when French became the language of the court, while English remained the language of the peasantry.

Similarly, the names for raw foods come from the native / lower-prestige language (cow, pig), whereas the names for cooked food come from the higher-prestige language (beef, dessert).

Monday, December 31, 2018

Book review: Jo Walton "Just City"

This was an easy and pleasant read. But it could have been so much better if it had actually tackled the premise that it promised. The premise seemed ambitious; ir promised speculative fiction with the capital S. But it didn't deliver.

The Just City in the title of the book is the city that Plato talked about in the Republic. The story describes the social experiment of the Republic implemented in real life. It's set on an island somewhere, presumably, in the Mediterranean sea, in an undefined time in the prehistory. The time it's set in is before the rise of the classical world. Even Illiad and Odyssey had not yet been written. It doesn't matter, because the inhabitants of the island are completely shut off from the surrounding world and have no interaction with it. Most of the inhabitants are 10-year-old children (10080 of them), that were bought from slave traders of different eras, and brought to this island, across time, to be raised according to the Platonic concepts of justice. They are expected to implement Plato's Just City in real life. They are schooled by a number of teachers from different eras of history, and all of them have one thing in common is that at some point they prayed to Athene. For it was Athene that set up this experiment, and transported people through time to bring them here.

The plot of the book is rather uneventful, but I was hoping for plot twists based on the moral dilemmas these people face, and how they have to adjust their experiment when it is not turning out as planned. Of course, the experiment does not turn out as Plato envisioned, because people are people and they bring their human natures with them here. They also bring their prejudices, perceptions, beliefs, and ways of doing things from their eras. So, not surprisingly, even in the Just City rape victims are still responsible for their rape.

What's odd is that the "masters" -- the teachers who are responsible for the upbringing of these 10080 children -- do not question the notion of justice beyond the Platonic ideal. This ideal was held by a person who lived 2000+ years ago, and much of it doesn't jibe with our modern notion of justice. And many of the masters were from eras historically close to ours, or even beyond ours.

So it's strange that none of the teachers entertain a more modern paradigm of justice, even in the matters of life and death. Such as availability of modern medicine. Athene "imported" something from technologically advanced era (I won't say what, because one of the plot developments hinges on that), so she could have imported advanced medicine as well. And yet they don't treat sick newborns, but "expose" them, i.e. leave them in the wilderness to die. They oddly think it's more humane than to kill them. They do it even to the babies with small birth defects like cleft lip or palate, which are entirely correctable in our times. What about treatment of injuries and illnesses that surely must have occurred among those 10080 children, because of sheer statistical likelihood? Was their medicine as barbaric as the medicine of the ancient times? To be fair, one of the teachers mentions "mold drugs", so apparently they did import the antibiotics from the future. But what about everything else? I would think that realistically this question would have popped up very early in the existence of the city, and I also think that those teachers who came from the more modern times would find it a gross violation of ethics to not provide lifesaving treatments when they could be brought in from the future. And if you are dedicating your whole life to put a vision of justice in reality, then surely you would assign the utmost importance to ethical questions?

In other words, I expect that realistically in such a city there would be never ending debates, serious arguments, maybe even fights over whose ethical system is considered the most just. Yet none of it happens. Everybody leads largely untroubled existences filled with philosophy, music, arts and sports, and nobody runs into ethically ambiguous situations, in which Plato's vision directly contradicts their own internal sense of justice.

To be fair, something similar does start to happen towards the end, but it was a bit too late to make me "buy" into the book. The whole book seemed like one big missed opportunity to get deeply into ambiguities and paradoxes of justice.

Tuesday, December 25, 2018

Natural Language Processing hackathon, or don't judge the wine by the shape of the bottle

In April of 2018 I went to a Natural Language Processing hackathon. Organized by Women in Data Science Austin, it took place at Dell, where one of the organizers worked. This was not the kind of hackathon where you hack for the whole weekend straight, crashing on a beanbag to catch a few winks in the breakroom of some hipster startup. No, this was a hackathon with work-life balance. It lasted from 10 am to 3 pm on a Saturday, which is just enough time for you to get deeply enough immersed in a subject to fire up your appetite for it, but not get sick of it. There were no minimal viable products produced, and no prizes, but I got to sink my teeth into the basics of Natural Language Processing.

A data scientist named Becky, who does Natural Language Processing for an Austin company, introduced us to the three cornerstone approaches of NLP -- summarization, topic modeling, and sentiment analysis.

Data scientist Becky talks about topic modeling
Data scientist Becky talks about topic modeling.

Sentiment analysis quantifies the subjective emotion in a text, e. g. did the majority of reviewers like or didn't like a particular wine? Data scientists don't take into account just the words, but also such nonverbal information as capitalization (a word in all caps is likely to mean the author feels strongly about it), and emoji. Topic modeling finds abstract concepts that occur in a body of texts, a. k. a. corpus. For exaple, if it finds the words milk, meow, and kitten, it might decide one of the topic of this text is cat. If it finds the words bone, bark, and puppy, it might decide one of the topics is dog.

Summarization reduces a text to several key phrases or a representative sentence. Summarization can be extractive or abstractive. Extractive summarization selects a few representative sentences from the text, while abstractive summarization creates a summary of the text.

As an example, Becky gave a phrase: "The Army Corps of Engineers, rushing to meet President Bush's promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during the storm, according to documents obtained by the Associated Press."

Extractive summarization would extract such phrases from it as:

  • Army Corps of Engineers
  • President Bush
  • New Orleans
  • defective flood-control pumps

In contrast, abstractive summarization would generate such phrases as:

  • government agency
  • presidential orders
  • defective equipment
  • storm preparation
  • hurricane Katrina
Natural Language Processing hackathon hosted by Women in Data Science Austin
As many of the hackathon attendees as could fit in the picture.

I can't quite put my finger on it, but it seems that extractive summarization extracts names of specific entities, but not much information as to what happened to those entities or what did they do. But abstractive summarization seems to "understand" what those entities actually represent and what they do, and thereby extracts more "gist" from the paragraph. I could be wrong about it, of course.

According to Becky, extractive summarization is a mostly solved problem by now. TextRank algorithm takes care of it. But abstractive summarization is a very difficult, unsolved problem, though knowledge graphs help.

At the organizers' suggestion, the attendees arranged themselves into three teams, each focusing on one of those three pillars. The organizers brought with them the corpora, a. k. a. texts to be analyzed. Specifically, they brought wine reviews, lots and lots of them. I suppose that's the second best to bringing the actual wine.

Summarizing wine reviews means extracting an "essence" of what the bulk of the reviewers said about a particular wine. It means identifying certain qualities that most reviewers noticed in a given wine. Sentiment analysis meant identifying whether the reviewers thought mostly positively or mostly negatively about the wine.

I ended up in the summarization team. Lead by Randi, who is a data scientist at a big company, we analyzed the wine reviews. By that I mean we called a bunch of functions from pandas, textacy, sumy and other relevant Python packages. The results were mixed. For example, sumy summarized reviews of Moscato in two sentences, but we had no way to tell whether this summarization is good, i.e. whether those were the most representatives sentences from the reviews. It's funny how this is the kind of problem that one has no way of verifying -- at least none that I learned in my 5 hours of NLP bootcamp. Sure, you could read hundreds of reviews and try to get a "feel" whether those sentences were the most representative, but your "feel" would be subjective.

It makes Natural Language Processing feel like black box, and almost like magic -- until you notice that when you ask for 5-sentence summary, the summary includes duplicates for first two sentences. That looks odd, so you take a closer look at the texts and notice that there are duplicate sentences in the document itself. For all its magic, sumy can't figure that out.

Within sumy, you can choose which summarizer to use. First we used LexRank, and it turned out to be very slow. Then we tried another, LuhnSummarizer, and it was much faster, but the results not nearly as accurate. But how would you decide how accurate a summarization is, given that there are no exact criteria for accuracy that I know of? Well, the first summary described mouthfeel and acidity of Moscato. The second included things like the shape and color of the bottle. It left me with the same feeling one often gets interacting with artificial intelligence, that it's both very smart and very stupid at the same time.

Tuesday, March 20, 2018

Introduction to Natural Language Processing with Women Who Code

In 2016 Women Who Code Austin hosted a series of five presentations on Natural Language Processing. The presenter was our member Diana, who has a Ph.D. in linguistics and has worked in the area of computational linguistics for many years. She did demos of some basic text analysis one can do with the Python Natural Language Toolkit, or in short, NLTK.

She presented all this as a Python notebook. A Python notebook is software that lets you combine text, code, and output of that code on one page. You can run a code snippet right there in the notebook, and the resuls will get updated automatically. So equipped, Diana introduced us to the basics of what computational linguists do. Or if that sounds too ambitious, let's just say she showed some simple things one can do with NLTK.

For example:

  • read in the text,
  • tokenize,
  • tag,
  • remove punctuation,
  • remove stopwords...
  • build a frequency hash table from the rest of words.
The first Austin Women Who Code meeting on natural language processing, with our instructor Diana standing in the center
The first Austin Women Who Code meeting on natural language processing, with our instructor Diana standing in the center

She introduced such concepts as collocations and bigrams. Bigrams are pairs of words that are next to each other in a text. Collocations are pairs of words that naturally occur in the language together, i. e., a chance of them occuring together is greater than random. An example of a bigram that's not a collocation is Trump's usage of a phrase "Liar Ted" (this was the spring of 2016, the height of the Republican strife for a presidential nomination). If a bigram is not a collocation, but occurs more often than randomly in a text, that can help to identify who the author is / who the speaker is, and some such qualities. It can be a fingerpint of sorts.

The aforementioned tagging is something we do after tokenizing (roughly speakinng, breaking the text up into words). Tagging assigns a 2-letter tag to each word, marking it as a part of speech, such as noun, adverb, etc. "You can use a big list of tags, or a simplified one. Using a simplified list of tags can help with speed of analysis of your corpus," said Diana.

Here Diana noted that tagging words as parts of speech has inherent ambiguity in it -- exactly the kind of thing that makes language and its computational processing so interesting. Here is an example of parts-of-speech ambiguity in a sentence: "They refuse to permit us to obtain the refuse permit". Still, the Python Natural Language Processing Toolkit correctly tags the first "refuse" and "permit" as verb (VBP) and the second instance of each as noun (NN).

NLTK correctly identifies parts of speech in the sentence 'They refuse to permit us to obtain the refuse permit'
This slide shows how NLTK correctly identifies parts of speech in the sentence above. The first instances of "permit" and "refuse" are "VB" -- verbs, whereas the second ones are "NN" -- nouns.

At the second meeting we did all those actions with a corpus of -- wait for it -- Hillary Clinton's emails. Her emails were available for download from the Kaggle site. This was still the spring of 2016, and we did not yet know how sad the implications of those emails will turn out to be, so the choice of the subject wasn't as... emotionally loaded as it would have been just half a year later. And to say "we" did this is an exaggeration, because it was actually Diana that did all the processing and presented the code and the results to us in a Python notebook.

Here was the complete agenda of the meeting:

  • Getting data: Hillary Clinton's emails;
  • Reading files;
  • Using Pandas to create a Dataframe in Python;
  • Cleaning data: eliminating punctuation, eliminating stopwords, normalizing data: converting to lower case, tokenizing words
  • Visualizing data.

All of this pre-processing of data was done in the Python Natural Language Processing Toolkit (NLTK).

I must say I would have preferred it if Diana had set up this mini-course as a series of exercises for us to do in class and write some code calling NLTK methods ourselves. But if we had done that, we would not have been able to cover even half as much in those four meetups. So I appreciate what Diana did. At least she showed us what kind of beast NLTK is and which fork to eat it with. In the process learned some basic NLP lingo, such as:

  • corpus -- a body of text, plural corpora; it's what you process to extract words and do computations with them;
  • lexicon -- words and their meanings; example: English dictionary.
  • However, you need to consider that different fields will have different lexicons. For example: to a financial investor, the first meaning of the word "bull" is someone who is confident about the market, as compared with the common English lexicon, where the first meaning of the word "bull" is an animal. As such, there is a special lexicon for the financial investors, doctors, mechanics, and so on.

  • token -- each "entity" that is a part of whatever was split up based on rules. For example, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.
  • frequency distribution. The frequency distribution method of NLTK counts the frequency of each vocabulary item in the text. It helps identify the most informative words in a corpus.

So overall I got a little familiar with what are the very basics of what natural language scientists do. But somehow, during those four meetings I was still hoping that we'll get past collecting the statistics about words, and get to some mysterious insights about how language works, evolves, and transforms our thoughts, that only computer analysis of language can provide. Of course, my expectations were unrealistically inflated for a set of introductory lessons.

Going back to Hillary Clinton's emails, here is how you would analyze them. This is an "Exploratory Analysis: Getting and Cleaning Data" slide. Here you see the metadata fields that were extracted from the emails. There are quite a few of them.

Python dataframe with the metadata fields extracted from Hilary Clinton's emails
Python dataframe with the metadata fields extracted from Hilary Clinton's emails. Python dataframe with the metadata extracted from Hilary Clinton's emails
This slide, "Slicing dataframe to extract subject", shows Python method calls that you would use to extract the email subjects from the dataframe shown in the previous image. Presented in a Python notebook, it alternates code with results of that code. The results can be updated on the fly if you make changes to the code. The MetaDataSubject and MetaDataTo fields contain some familiar names and topics that made the news...

The next slide shows the use of the NLTK method "concordance". It produces a list of the words used in the text, with the passages where they are used. So if you want all occurrences of the word "surprise" in Jane Austen's "Emma", with snippets of context, you can call


(Here, emmaText is the variable that holds the text of the Jane Austen's novel "Emma".) From this example you can also see that NLTK has corpora of texts from the Gutenberg project, which is pretty handy.

Concordamce: all the places in Jane Austen 'Emma' where the word 'surprize' is used
Concordamce: all the places in Jane Austen 'Emma' where the word 'surprize' is used, obtained by calling a 'concordance' method of NLTK.

Venturing ddeper into natural language processing

The easiest texts to analyze are the news, Diana said. News have very good structure. Sentences tend to be short, and tend to have classical structure: subject, verb, object, etc. Medication instructions are also easy to analyze, since they are required to have readability scores high enough to be suitable for 9-12 year olds. But in literature the sentences are often not conventional and much harder to parse.

At the last meetup we talked a little bit about analyzing texts "for real". And by that I mean a little deeper analysis than just breaking up sentences into parts of speech and gathering statistics about it.

One example where computational linguistics is used is to grade student essays. If you have so many essays that hiring human graders would be cost-prohibitive, natural language processing can help. For example, if an essay is supposed to be on the US Declaration of Independence, the script would check to see if certain words are present in it in a certain way, and will conclude that that student might have a certain level understanding of the topic. (Yes, I know, this raises lots of questions about creativity versus cliche'd, cookie-cutter texts: the latter would be more likely to hit all the points that a grading program is looking for, whereas the former might be difficult for a program to discern. But we didn't cover such questions at the meeting, since it's an uncharted territory.)

We touched upon sentiment analysis, which helps determine how customers feel about an experience they had with a brand or a company. Companies like HomeAway use it to analyze customer reviews of their rental properties. And they discover unexpected things that way. For example, analysis of customer reviews of B&B-type places showed that the greatest predictor of customer satisfaction is whether a house has pots and pans.

Sentiment analysis also shows that, for example, if you try to infer customer satisfaction from the reviews by searching for wait times, you'll get inconsistent results. 15 minutes would be bad for a restaurant, but lightning-fast for an emergency room.

And this is where people try to determine degrees and ways of relatedness or similarity between concepts.

For that, they can use ontologies.

What is Ontology?

A consensus is now established about the definition and the role of an ontology in konwledge engineering: "An ontology is a formal, explicit, specification of a shared conceptualization".

It is used in cognitive modeling.

More about Ontologies

An ontology is a schema (model) describing the types (and possibly some individuals) in a domain, the relationships that may exist between types and individuals, and constraints on the way individuals and properties may be combined.

Here are some examples of ontologies

  • Classes: Project, Person, ProjectManager. ProjectManager is a subclass of Person. People and Projects are disjoint.
  • Relationships: worksOn, manages. Manages is a sub-property of worksOn.
  • Constraints: People work on Projects, not the other way around. Only ProjectManagers can manage Projects.

This simple example enables machine inferences, e.g. if X manages Y, then we can infer that Y is Project, and X is a ProjectManager and therefore a Person.

Onthologies allow people to create trees representing relationships between concepts, like this:

A tree that expresses relationships between conecpts in the academia (Student, Employee, Faculty, etc)
This is an example of a tree that expresses relationships between conecpts in the academia (Student, Employee, Faculty, etc.)

Some people propose ways to neasure the similarity of concepts by some graph metrics, such as the shortest path between two nodes.

Measure of similarity between two concepts in a graph, expressed in terms of a shortest path between two concepts.
Measure of similarity between two concepts in a graph, expressed in terms of a shortest path between two concepts.

More pictures from the Women Who Code Austin meetup series on Natural Language Processing are in my photo gallery.