Category Archives: Posts

Web Crawling: You Have Options!

If you have ever wanted to build a custom search engine, this is the post for you.

Web crawlers/scrapers/spiders might seem, at first glance, to be rather complicated. Not only must your crawler navigate its way through the sprawling landscape of the internet, but it must also manage to make sense of this landscape and parse it into readable text. However, once you delve into the problem, you realize that it’s actually quite simple. This is because all the tools you need to make a web crawler have already been made for you.

One powerful, open-source, out-of-the-box web crawler I recommend is Apache Nutch. Not only can you start running it pretty much right off the bat, but it is readily compatible with other Apache tools – for instance, Apache Solr, an open-source search engine. Combine Nutch with Solr, and you can have a fully functional search engine up in less than half an hour.

Of course, sometimes you don’t want or need all the extra features that come with a crawler like Nutch. Sometimes you just want a simple, flexible little web crawler whose source code you can easily modify to do whatever you want. In that case, I would say: “Write it yourself!” (But don’t write everything yourself. Do a Google search to see if you can find someone who’s done most – or all – of it for you already.)

Personally, I am partial to Perl for all things I/O related. Perl has a couple modules that can help you out with parsing HTML, like WWW::Mechanize, but I like Mojo. It makes things simple while still keeping it easy to modify, and also lets you select CSS elements and parse out their text, links, and/or images. Neat!

Here is a guide to help you write a web crawler in Perl using the Mojo module.

Advertisements

Prolog and Logic Programming

hasFlavor(F) :- instock(F).
instock(chocolate).

This is the first Prolog program I’ve written. Doesn’t look too impressive, does it? These two lines of code tell me that if a flavor is in stock, then we have a flavor. Then it states that the “chocolate” flavor is in stock, which therefore means that hasFlavor(chocolate) is true. If I were to query the program, saying

?- hasFlavor(chocolate)

it would print out Yes.

I wrote another prolog program using ASP (Answer Set Programming) that represents a family.

sibling(X,Y) :- parent(Z,X), parent(Z,Y).
mother(X,Y) :- parent(X,Y), female(X).
spouse(X,Y) :- parent(X,Z), parent(Y,Z).
female(kimberly). female(katherine). female(joanne). male(david).
parent(joanne, kimberly). parent(joanne, katherine).
parent(david, kimberly). parent(david, katherine).

This program prints out: 

parent(david,katherine) parent(david,kimberly) parent(joanne,katherine) parent(joanne,kimberly) male(david) female(joanne) female(katherine) female(kimberly) spouse(david,david) spouse(joanne,david) spouse(david,joanne) spouse(joanne,joanne) mother(joanne,katherine) mother(joanne,kimberly) sibling(katherine,katherine) sibling(kimberly,katherine) sibling(katherine,kimberly) sibling(kimberly,kimberly) 
True

Unlike other types of programming, in logic programming everything happens at once. There is no way to tell a program to perform steps in order, like you would in Java. How this works is that when it reaches statements like hasFlavor(F) :- instock(F), it searches for all terms that can substituted in for the argument (i.e. instock(chocolate)) and puts them in. In this way, the code is not read in a linear fashion, but rather “all at once”.

Free Software?

There is a controversial issue in the programming world that has been around for decades: whether it would be better for software to be free and open to everybody, or whether the current system of licensing and selling software is more beneficial. GNU (Gnu’s Not Unix) supports the former, whilst Bill Gates’ Letter to Hobbyists expresses the latter opinion.

While free and open software sounds tantalizing, there are serious drawbacks to the concept. Writing, debugging, and distributing software requires work – and the truth is, we live in a largely capitalist world where nobody does anything for free. Salaries and pay raises give people incentive to do the best work they can; without it, products that they churn out will be inferior at best. 

On the other hand, the power of the people is not to be underestimated. If software were made free to the public, and if enough competent people felt motivated to improve it, the results would not be unimpressive. However, these are big if’s. In the end it really comes down to your view on human nature; basically, if you handed a silver bracelet to the mob, would you expect to get it back spit-shined and clean, or would you expect it to be tarnished by the grubby hands of the masses?

Feeling the Heat?

While I cannot speak for my readers, the area I am in has been sweltering. The summer camp I am participating in has no air conditioning, and it’s always hot in the computer cluster. If only, I thought, we had a thermostat. Which brings me to my point: the difference between thermometers and thermostats.

Thermostats are able to take action on their surroundings and create change, such as by heating up or cooling down an area. Thermometers, on the other hand, can only note that it is hot and suffer silently (like me). The difference between thermostats and thermometers is a great analogy for the difference between agency and autonomy; if one has agency they are able to create change, whilst one with autonomy is not, but can make their own choices.

Computers, for example, have agency but little autonomy. The actions they perform are dictated by their software and by user input. Humans, on the other hand, have much more autonomy than computers. The amount of agency we have varies widely depending on the amount of clout we wield, but from the very moment we are born we are able to make our own choices.

Induction

Can you prove these equations? (I did. The link to my solutions is here.)

Though it may seem to be a bit of a jump, the study of algorithms like these ones is actually closely related to the study of artificial intelligence. After all, if you strip down famous AIs like Watson or Cleverbot, all they really are are a lot of algorithms and if-else statements. Powerful algorithms allow these AIs to give the illusion of human-like intelligence by stringing together sentences and analyzing auditory input, among other things.

Privacy for Security: Fair Trade?

The main idea of government is that the people give up some of their power to the government in exchange for security. But a question has recently come to light as a result of the uncovered PRISM program and the potential for Google Glass to record what its wearer sees: how far can the government go to ensure its people’s security before it goes too far?

I would like to be able to offer a definite line dividing acceptable amounts of surveillance from unacceptable, but in truth there is no such thing. The line is fuzzy and liable to shift. In the end, it all boils down to how much the people trust the government. For instance, one argument in favor of the PRISM program is that the government will only look closely at data that could aid them in finding terrorists and such, and that the correspondence of your every-day Joe would fly under the radar. For a populace that does not trust its government, however, such assurances are not enough to allay their fears that their every secret and illicit affair might be aired to prying eyes.

An article in the New York Times brushed against this topic when it mentioned that cameras in the hands of the people are a far different beast than they are in the hands of the government. As the article quotes from Jay Stanley, senior policy analyst with the American Civil Liberties Union in Washington, “In the hands of an individual, the video camera can be a very empowering thing. When it’s employed by the government to watch over the citizens, it has the opposite effect.”

Link to the article

Scientist, Engineer, Artist, Professional?

                  Where I live.

Living in the Silicon Valley means that I’m around a lot of scientists and engineers. But what exactly is the difference between a scientist and an engineer? The difference is that scientists are more interested in the theory behind something, and engineers prefer to actually build it. You could say that scientists like to learn new concepts while engineers like to apply them.

Artists are obviously different from scientists and engineers. Professionals, on the other hand, may not be so obvious. Professionals are educators who share knowledge, instead of dedicating their time to gathering or applying it.