Web Crawling: You Have Options!

If you have ever wanted to build a custom search engine, this is the post for you.

Web crawlers/scrapers/spiders might seem, at first glance, to be rather complicated. Not only must your crawler navigate its way through the sprawling landscape of the internet, but it must also manage to make sense of this landscape and parse it into readable text. However, once you delve into the problem, you realize that it’s actually quite simple. This is because all the tools you need to make a web crawler have already been made for you.

One powerful, open-source, out-of-the-box web crawler I recommend is Apache Nutch. Not only can you start running it pretty much right off the bat, but it is readily compatible with other Apache tools – for instance, Apache Solr, an open-source search engine. Combine Nutch with Solr, and you can have a fully functional search engine up in less than half an hour.

Of course, sometimes you don’t want or need all the extra features that come with a crawler like Nutch. Sometimes you just want a simple, flexible little web crawler whose source code you can easily modify to do whatever you want. In that case, I would say: “Write it yourself!” (But don’t write everything yourself. Do a Google search to see if you can find someone who’s done most – or all – of it for you already.)

Personally, I am partial to Perl for all things I/O related. Perl has a couple modules that can help you out with parsing HTML, like WWW::Mechanize, but I like Mojo. It makes things simple while still keeping it easy to modify, and also lets you select CSS elements and parse out their text, links, and/or images. Neat!

Here is a guide to help you write a web crawler in Perl using the Mojo module.

Advertisements

Prolog and Logic Programming

hasFlavor(F) :- instock(F).
instock(chocolate).

This is the first Prolog program I’ve written. Doesn’t look too impressive, does it? These two lines of code tell me that if a flavor is in stock, then we have a flavor. Then it states that the “chocolate” flavor is in stock, which therefore means that hasFlavor(chocolate) is true. If I were to query the program, saying

?- hasFlavor(chocolate)

it would print out Yes.

I wrote another prolog program using ASP (Answer Set Programming) that represents a family.

sibling(X,Y) :- parent(Z,X), parent(Z,Y).
mother(X,Y) :- parent(X,Y), female(X).
spouse(X,Y) :- parent(X,Z), parent(Y,Z).
female(kimberly). female(katherine). female(joanne). male(david).
parent(joanne, kimberly). parent(joanne, katherine).
parent(david, kimberly). parent(david, katherine).

This program prints out: 

parent(david,katherine) parent(david,kimberly) parent(joanne,katherine) parent(joanne,kimberly) male(david) female(joanne) female(katherine) female(kimberly) spouse(david,david) spouse(joanne,david) spouse(david,joanne) spouse(joanne,joanne) mother(joanne,katherine) mother(joanne,kimberly) sibling(katherine,katherine) sibling(kimberly,katherine) sibling(katherine,kimberly) sibling(kimberly,kimberly) 
True

Unlike other types of programming, in logic programming everything happens at once. There is no way to tell a program to perform steps in order, like you would in Java. How this works is that when it reaches statements like hasFlavor(F) :- instock(F), it searches for all terms that can substituted in for the argument (i.e. instock(chocolate)) and puts them in. In this way, the code is not read in a linear fashion, but rather “all at once”.

Free Software?

There is a controversial issue in the programming world that has been around for decades: whether it would be better for software to be free and open to everybody, or whether the current system of licensing and selling software is more beneficial. GNU (Gnu’s Not Unix) supports the former, whilst Bill Gates’ Letter to Hobbyists expresses the latter opinion.

While free and open software sounds tantalizing, there are serious drawbacks to the concept. Writing, debugging, and distributing software requires work – and the truth is, we live in a largely capitalist world where nobody does anything for free. Salaries and pay raises give people incentive to do the best work they can; without it, products that they churn out will be inferior at best. 

On the other hand, the power of the people is not to be underestimated. If software were made free to the public, and if enough competent people felt motivated to improve it, the results would not be unimpressive. However, these are big if’s. In the end it really comes down to your view on human nature; basically, if you handed a silver bracelet to the mob, would you expect to get it back spit-shined and clean, or would you expect it to be tarnished by the grubby hands of the masses?

Feeling the Heat?

While I cannot speak for my readers, the area I am in has been sweltering. The summer camp I am participating in has no air conditioning, and it’s always hot in the computer cluster. If only, I thought, we had a thermostat. Which brings me to my point: the difference between thermometers and thermostats.

Thermostats are able to take action on their surroundings and create change, such as by heating up or cooling down an area. Thermometers, on the other hand, can only note that it is hot and suffer silently (like me). The difference between thermostats and thermometers is a great analogy for the difference between agency and autonomy; if one has agency they are able to create change, whilst one with autonomy is not, but can make their own choices.

Computers, for example, have agency but little autonomy. The actions they perform are dictated by their software and by user input. Humans, on the other hand, have much more autonomy than computers. The amount of agency we have varies widely depending on the amount of clout we wield, but from the very moment we are born we are able to make our own choices.

Final Project: Story Generation

Paper Abstract/Summary

From legends to myths to folklore, stories have always been an integral part of human culture. It is natural for us to be able to draw on prior experiences and convey them in a story format. But if one provides an experience to a computer, how would it translate that into a structured tale? That is what the field of story generation aims to answer. Story generators use algorithms to generate literature that can be considered stories, in that they have some sort of character, premise, or plot. In this field, stories are defined in a functional rather than an aesthetic sense, meaning that generating elegant language is not a core focus. In this post, I propose several solutions of varying complexity for generating such functional stories.

Introduction/Background

There have been attempts to write story generators since the 70’s, and though these generators have grown increasingly more sophisticated, there have been no synthetic Shakespeares or Tolkiens. Early story generators were capable of generating stories with very basic structure; for example, a brief story by TALESPIN [Meehan 1977]:

“John Bear is somewhat hungry. John Bear wants to get some berries. John Bear wants to get near the blueberries. John Bear walks from a cave entrance to the bush by going through a pass through a valley through a meadow. John Bear takes the blueberries. John Bear eats the blueberries. The blueberries are gone. John Bear is not very hungry.”

Later generators like BRUTUS [Bringsjord & Ferruci 1999] are capable of generating short stories with more complex characters that focus on abstract concepts like betrayal. Yet even BRUTUS had some sentences that seemed stilted and unnatural, as well as some odd jumps in the story. In other words, the stories it generated were understandable but not human.

Application

Story generators offer an alternative to the heavily scripted storylines prevalent in video games. Were a computer capable of changing and altering a story with respect to user input, interactive storytelling would be taken to a whole new level. Game playthroughs would be truly unique for each individual player.

Previous Work

As stated previously, story generators have been around since the 70’s. Some well-known story generators would be: Novel Writer [Klein et al. 1973], TALESPIN [Meehan 1977], AUTHOR [Dehn 1981], UNIVERSE [Lebowitz 1983], BRUTUS [Bringsjord & Ferrucci 1999], and more recently FABULIST [Riedl & Young 2010].

Current Problems in the Area

In the field of computer programming, programmers usually give computers very clear instructions detailing what they are supposed to do. A definition of the computer’s goal is given, along with a set of instructions on how to achieve it. For this to be possible, the solution set must be defined in some way, there must be rules for choosing what order to consider each solution, and there must be a way to determine which solutions are actually viable. Each of these requirements appears equally intractable when considering story generation. The first requires one to explain what a story is – already difficult to do, given the broad array of literature the word encompasses, and doubly difficult when communicating with a computer, which only understands ones and zeros. The second and third requirements need some way to rank possible solutions in a way that allows the computer to generate stories with consistent characters, plotline, and even writing style. This requires intimate knowledge of not only plot structures and character building, but also language conventions and word associations.

Since a functional story is the goal, not an elegant one, writing style can be struck off the to-do list. However, this still leaves us with the problem of creating a coherent and believable series of events. Events in a story must follow certain laws; for example, the pigs in Three Little Pigs cannot suddenly sprout wings and fly away from the wolf, because it would make no sense given the laws of the story’s world. This means that the computer also needs to know some pre-defined rules which its story cannot break, and it must be able to identify possible solutions that would break these rules.

Algorithms that attempt to satisfy all these conditions are numerous, but very few, if any, are able to succeed. The sheer complexity and inherent fuzziness of the problem make it near impossible to do.

Proposed Solutions

Before considering solutions for generating actual words and sentences, we must consider how a computer is able to come up with a story. Stories are driven by conflict; conflict is driven by character motivations. Thus, it follows that the first goal a computer should achieve is to create a character who has some motivation – who wants something. From there, the plot will follow.

While the concept of this solution is simple, the implementation is not. A simple way to achieve this is to have a “character” object with attributes describing their motivations, personality, backstory, and relationships with other characters. These attributes would be picked from a pool of pre-determined data. From these characters arises conflicts of interest that then resolve themselves into stories; for instance, if two characters have opposing motivations, they can be selected as protagonist and antagonist.

Image

With the problem of generating characters and plot behind us, we can focus on generating the actual prose. There are several ways to do this.

1. Have pre-written prose that can be mixed and matched each time a story is generated. This approach is highly inflexible and requires a lot of manual work, but in terms of programming, it is the simplest.

2. Use pre-written template sentences that are tagged with categories describing the mood they imply (i.e. scary, angry, sad, happy) and with key words that can be swapped to fit the situation (i.e. if the template is “Bob kicked Alice on the shin”, it could be altered to “Jordan kicked Alyssa on the ankle”).

3. Generate sentences from scratch. This is much more difficult because it would require the computer to understand word associations and how to correctly put together a sentence.

None of these solutions are very optimal, but they serve their purpose.

Conclusion

Quantifying stories is already a difficult task, and instructing a machine to generate them is no simpler. There are countless varied approaches to solving the problem of story generation, but not a one of them is perfect. Like many other givens in human culture, storytelling is a skill that eludes the understanding of machines.

Future Work

As of now, story generators remain firmly entrenched in text-based mediums. However, given creative implementation, such stories could be translated into animations, games, or even short films. A successful story generator would also be a triumph for the ongoing struggle to understand what defines “natural language” – that is, language that appears to have been generated by a human as opposed to a machine. This could in turn elucidate how we interpret words to recognize sarcasm and hidden meanings.

References

Gervás, Pablo. “Story Generator Algorithms.” The Living Handbook of Narratology. N.p., 24 Sep 2012. Web. 26 Jul 2013. < http://wikis.sub.uni-hamburg.de/lhn/index.php/Story_Generator_Algorithms >.

Riedl, Mark. N.p.. Web. 26 Jul 2013. < http://www.cc.gatech.edu/~riedl/pubs/mm.pdf >.

A Book Related to the Project

http://www.amazon.com/Tell-Story-Narrative-Intelligence-Rethinking/dp/0810113139

Classifying Data

Image

Pictured above is the result of my attempt to use kNN to classify data. The orange and light blue dots are two different types of data in my “training set”, which is the data set that has already been classified. The red and darker blue dots are the data of my “testing set”, which is the data that I classified using the training set.

The data could also have been classified using a naive Bayes classifier. The placement of orange and light blue dots shows that the position of the dots correlates with their classification. Thus, it is possible to classify the dots according to their positions only; the data on the fringes of the set would be dark blue, and the dots closer to the center would be red. Of course, this approach would likely be less accurate than kNN.

Induction

Can you prove these equations? (I did. The link to my solutions is here.)

Though it may seem to be a bit of a jump, the study of algorithms like these ones is actually closely related to the study of artificial intelligence. After all, if you strip down famous AIs like Watson or Cleverbot, all they really are are a lot of algorithms and if-else statements. Powerful algorithms allow these AIs to give the illusion of human-like intelligence by stringing together sentences and analyzing auditory input, among other things.