Google+

Google+ Logo

A while back, I got my invite to Google’s new social network – Google+. I have to say that I’m really liking it so far, and favoring it over Facebook a lot! The negative side? My network isn’t on it quite yet…

If you’re interested in trying it out yourself, I’ve got plenty of invites left. Simply contact me somehow with your e-mail address and I’ll send you one of the invites. Comments should work just fine for that! ;-)

Labs: Uploadr

HTML5 Logo

HTML 5! Although it can’t be used in any real projects yet due to lack of browser support (or even definitions), I feel like it’s important to stay up to date with its possibilities. On a quiet Thursday I decided to try and make a drag & drop upload interface, inspired by something I’ve seen on Stolen Camera Finder. I quickly found out that the methods I’ll be needing are the HTML5 File API and XMLHttpRequest2. So I went to work, and the result is Uploadr.

E-mail Aliases

Here’s a quick tip I found to be very useful myself:

When entering an e-mail address on websites (for example, registration forms), it might be desirable to use an e-mail alias. Take the first part of your e-mail address (before the “@” sign), add a “+” sign behind it, and add an alias behind the “+” sign. For example:

myemail@example.com

becomes

myemail+somewebsite@example.com

When delivering e-mail, the part behind the “+” sign is ignored. This means that the e-mail will still be delivered to your normal e-mail address, even with the sign added. This is useful for identifying which websites are leaking your e-mail address. Combined with the powerful Gmail filters, this can also prove useful for filtering certain messages (or automatically forwarding/deleting them!).

at

Scraping HTML documents in PHP

These days, most web services offer some sort of API to fetch and use information from them. But what to do if you need to fetch information from a website that doesn’t offer such an API? It’s commonly referred to as screenscraping, and regular Expressions may not be the best solution; here’s how to parse an HTML document using PHP’s DOMDocument and DOMXPath.

To solve this problem, one might first think to do this by using regular expressions (I know I did). And although regular expressions are a powerful tool, I’ve learned it is not always the right one. Especially in the case of parsing structured formats such as XML and HTML, it might be a good idea to use other tools that are available, and make a better fit for the job.

PHP conveniently provides us with the DOMDocument class. The purpose of this class is to, as the name might suggest, help us work with DOM Documents. So how do we use this class to help us extract information from another website?

Step 1 – Loading the document

Let’s take this sample page. First of all, we somehow have to retrieve the information from this web page. Looking at the DOMDocument documentation, we can see 2 ways are available to load HTML into the class. We can choose to use the loadHTML() function to load HTML from a string, or to use the loadHTMLFile() to load HTML from a file. We want to load HTML directly from a file, so that’s settled then.

This means we have enough information to write our first piece of code:

<?php
$doc = new DOMDocument();
$doc->loadHTMLFile('http://labs.bobbievdheuvel.net/page/parsing-html-documents-in-php-sample.php');

It’s nice to have that piece of code, but it isn’t actually showing anything. Now I like to be able to actually see things happen, so just for debugging purposes, let’s add this to the end of the file:

echo $doc->saveHTML();

Run the script now, and you’ll see that it gives the exact same HTML output as the original sample page. Loading the document has now been taken care of – moving on!

Step 2 – Getting our data

Let’s say we want to get the current date and time from that page. We already have the source stored in our $doc variable, now we just need to filter the information we want to have. In this example, we will be using DOMXPath to do just that.

Looking at the documentation, we see that a DOMXPath object requires a DOMDocument to be passed to the constructor. We already have one of those, so let’s do that then:

$doc = new DOMDocument();
$doc->loadHTMLFile('http://labs.bobbievdheuvel.net/page/parsing-html-documents-in-php-sample.php');
$xpath = new DOMXPath($doc);

Now that we’ve created the DOMXPath object, we can easily use it to filter out the elements we want to have. In this case, the element containing our data has an ID given to it, which makes this very easy for us! A DOMXPath object can be queried using its query() method. The query we will be using looks like this:

$element = $xpath->query("//*[@id='interesting_data']")->item(0);

Looks scary, but it isn’t all that complicated! Let’s go through it step by step, and let’s start with the expression we are using for our query.

  • The double slashes (“//”) simply mean “search on any level in the document”. This means it will look everywhere in the document, including any nested tags.
  • The asterisk (“*”) means we are looking in all tags on that level. This means it will search through all tags it encounters. Because our element happens to be a div, we could easily have replaced it with div.
  • The square brackets mean we are looking at the current tag. To look at a specific attribute in the tag, we prefix it with an @ sign, followed by the attribute name (in our case ID), and the value it should have.
  • The query method returns a list of items. We just want the first item (since the ID should be unique), so we immediately request the first item (zero-indexed) from the list.

If that seemed to go a bit too fast, it’s because it did! I will not go in-depth on XPath expressions, but if you want to learn more about them, this should be a nice read. XPath is mainly used for XML, but as you can see, it can be applied to HTML as well.

Now all that is left to do is to get the contents of the tag we just selected. Once again, the PHP documentation on the returned object shows us exactly what to do. The object has a nodeValue attribute containing the contents of the tag AKA node. Let’s echo it to make sure everything is working like it should:

<?php
$doc = new DOMDocument();
$doc->loadHTMLFile('http://labs.bobbievdheuvel.net/page/parsing-html-documents-in-php-sample.php');
$xpath = new DOMXPath($doc);
$element = $xpath->query("//*[@id='interesting_data']")->item(0);
echo $element->nodeValue;

When we run the script, we see that it is, in fact, that easy! The variable element contains the contents we wanted to have, and now we can do whatever we want with it! Done!

That’s all?

Well perhaps it isn’t (always) that simple, seeing as the information you want isn’t always contained so nicely all by itself in an element with an ID. However, with XPath expressions, we can go quite far in finding and selecting the right tags. Combine that with the wide range of weapons string functions that PHP has in store for us, and it should be fairly easy to get any information we need!

…Well, as long as the website you’re reading from outputs something that vaguely resembles valid HTML.

Do keep in mind that loading data from other websites means you are relying on those websites for your own website to load. Your website may take longer to load because your server needs to fetch data from an external source first. This can be solved by caching the data, or fetching the data on the client side instead.

Spy Sappin’ Mah Website!
Play nice! Stealing content from other websites is not a good idea. Make sure you have the author’s approval before you start taking their precious data. If there is an API available, always use that instead!

Comments, questions and other feedback are, of course, always appreciated!

Happy New Year!

Well I had almost forgotten about this website, so here’s my late happy new year wishes to everybody for 2011!

2011 Is promising to be an interesting and exciting year for me! This year I’m hoping to get 1 year and about 2000 kilometers closer to my goals. Also, I should definitely make sure to get that drivers license this year.

Kaboooom!
Page 2 of 3«123»