Christian Heilmann

Posts Tagged ‘php’

cURL – your “view source” of the web

Friday, December 18th, 2009

What follows here is a quick introduction to the magic of cURL. This was inspired by the comment of Bruce Lawson on my 24 ways article:

Seems very cool and will help me with a small Xmas project. Unfortunately, you lost me at “Do the curl call”. Care to explain what’s happening there?

What is cURL?

OK, here goes. cURL is your “view source” tool for the web. In essence it is a program that allows you to make HTTP requests from the command line or different language implementations.

The cURL homepage has all the information about it but here is where it gets interesting.

If you are on a Mac or on Linux, you are in luck – for you already have cURL. If you are operation system challenged, you can download cURL in different packages.

On aforementioned systems you can simply go to the terminal and do your first cURL thing, load a web site and see the source. To do this, simply enter

curl "http://icant.co.uk"

And hit enter – you will get the source of icant.co.uk (that is the rendered source, like a browser would get it – not the PHP source code of course):

showing with curl

If you want the code in a file you can add a > filename.html at the end:

curl "http://icant.co.uk" > myicantcouk.html

Downloading with curl by  you.

( The speed will vary of course – this is the Yahoo UK pipe :) )

That is basically what cURL does – it allows you to do any HTTP request from the command line. This includes simple things like loading a document, but also allows for clever stuff like submitting forms, setting cookies, authenticating over HTTP, uploading files, faking the referer and user agent set the content type and following redirects. In short, anything you can do with a browser.

I could explain all of that here, but this is tedious as it is well explained (if not nicely presented) on the cURL homepage.

How is that useful for me?

Now, where this becomes really cool is when you use it inside another language that you use to build web sites. PHP is my weapon of choice for a few reasons:

  • It is easy to learn for anybody who knows HTML and JavaScript
  • It comes with almost every web hosting package

The latter is also where the problem is. As a lot of people write terribly shoddy PHP the web is full of insecure web sites. This is why a lot of hosters disallow some of the useful things PHP comes with. For example you can load and display a file from the web with readfile():

<?php
  readfile('http://project64.c64.org/misc/assembler.txt');
?>

Actually, as this is a text file, it needs the right header:

<?php
  header('content-type: text/plain');
  readfile('http://project64.c64.org/misc/assembler.txt');
?>

You will find, however, that a lot of file hosters will not allow you to read files from other servers with readfile(), or fopen() or include(). Mine for example:

readfile not allowed by  you.

And this is where cURL comes in:

<?php
header('content-type:text/plain');
// define the URL to load
$url = 'http://project64.c64.org/misc/assembler.txt';
// start cURL
$ch = curl_init(); 
// tell cURL what the URL is
curl_setopt($ch, CURLOPT_URL, $url); 
// tell cURL that you want the data back from that URL
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
// run cURL
$output = curl_exec($ch); 
// end the cURL call (this also cleans up memory so it is 
// important)
curl_close($ch);
// display the output
echo $output;
?>

As you can see the options is where things get interesting and the ones you can set are legion.

So, instead of just including or loading a file, you can now alter the output in any way you want. Say you want for example to get some Twitter stuff without using the API. This will get the profile badge from my Twitter homepage:

<?php
$url = 'http://twitter.com/codepo8';
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
curl_close($ch);
$output = preg_replace('/.*(<div id="profile"[^>]+>)/msi','$1',$output);
$output = preg_replace('/<hr.>.*/msi','',$output);
echo $output;
?>

Notice that the HTML of Twitter has a table as the stats, where a list would have done the trick. Let’s rectify that:

<?php
$url = 'http://twitter.com/codepo8';
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
curl_close($ch);
$output = preg_replace('/.*(<div id="profile"[^>]+>)/msi','$1',$output);
$output = preg_replace('/<hr.>.*/msi','',$output);
$output = preg_replace('/<?table>/','',$output);
$output = preg_replace('/<(?)tr>/','<$1ul>',$output);
$output = preg_replace('/<(?)td>/','<$1li>',$output);
echo $output;
?>

Scraping stuff of the web is but one thing you can do with cURL. Most of the time what you will be doing is calling web services.

Say you want to search the web for donkeys, you can do that with Yahoo BOSS:

<?php
$search = 'donkeys';
$appid = 'appid=TX6b4XHV34EnPXW0sYEr51hP1pn5O8KAGs'.
         '.LQSXer1Z7RmmVrZouz5SvyXkWsVk-';
$url = 'http://boss.yahooapis.com/ysearch/web/v1/'.
       $search.'?format=xml&'.$appid;
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
curl_close($ch);
$data = simplexml_load_string($output);
foreach($data->resultset_web->result as $r){
  echo "<h3><a href=\"{$r->clickurl}\">{$r->title}</a></h3>";
  echo "<p>{$r->abstract} <span>({$r->url})</span></p>";
}
?>

You can also do that for APIs that need POST or other authentication. Say for example to use Placemaker to find locations in a text:

$content = 'Hey, I live in London, England and on Monday '.
           'I fly to Nuremberg via Zurich,Switzerland (sadly enough).';
$key = 'C8meDB7V34EYPVngbIRigCC5caaIMO2scfS2t'.
       '.HVsLK56BQfuQOopavckAaIjJ8-';
$ch = curl_init(); 
define('POSTURL',  'http://wherein.yahooapis.com/v1/document');
define('POSTVARS', 'appid='.$key.'&documentContent='.
                    urlencode($content).
                   '&documentType=text/plain&outputType=xml');
$ch = curl_init(POSTURL);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, POSTVARS);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  
$x = curl_exec($ch);
$places = simplexml_load_string($x, 'SimpleXMLElement',
                                LIBXML_NOCDATA);    
echo "<p>$content</p>";
echo "<ul>";
foreach($places->document->placeDetails as $p){
  $now = $p->place;
  echo "<li>{$now->name}, {$now->type} ";
  echo "({$now->centroid->latitude},{$now->centroid->longitude})</li>";
};
echo "</ul>";
?>

Why is all that necessary? I can do that with jQuery and Ajax!

Yes, you can, but can your users? Also, can you afford to have a page that is not indexed by search engines? Can you be sure that none of the other JavaScript on the page will not cause an error and all of your functionality is gone?

By sticking to your server to do the hard work, you can rely on things working, if you use web resources in JavaScript you are first of all hoping that the user’s computer and browser understands what you want and you also open yourself to all kind of dangerous injections. JavaScript is not secure – every script executed in your page has the same right. If you load third party content with JavaScript and you don’t filter it very cleverly the maintainers of the third party code can inject malicious code that will allow them to steal information from your server and log in as your users or as you.

And why the C64 thing?

Well, the lads behind cURL actually used to do demos on C64 (as did I). Just look at the difference:

horizon 1990

haxx.se 2000

TTMMHTM: Email clients survey, Disney Steampunk, RFID luggage, search interfaces, Wave protocol open sourced and a great colour schemer

Thursday, July 30th, 2009

Things that made me happy this morning:

TTMMHTM: Evangelist Handbook, Billboard charts API, collaborative editing, IE6 bashing, pretty JSON, fancy fast food and terrible bugs.

Friday, July 24th, 2009

Things that made me happy this morning

Introducing Placemaker – a talk at the YDN Tuesday in London

Tuesday, July 7th, 2009

This is a talk I’ve given at the YDN Tuesday on 7th of July 2009 in London. It is an introduction to Placemaker, a new geolocation service by Yahoo. Check the slides, audio and the notes below.

Notes

Introducing Placemaker

Hello, I am Chris. Hacker by passion. When I went to the first WhereCamp about two years ago I thought nobody can out-geek me. I was wrong. Geolocation and Geocoding is quite some hard-core branch of geekery. So let me tell you about a nice little product that makes things easy for you.

Placemaker

This is Yahoo Placemaker and it is an API. You give it a URL to get data from or a text to extract geographical information from. Here are the docs. Now go forth and build cool stuff. OK then… Let’s take a look at the need for something like Placemaker.

A web of information

The web is full of information. Which is cool. The problem is that we accumulated and still accumulate more and more information without giving it proper structure.

Searching and finding

Search engines help us find stuff. However, as being found means making money the first search results are not necessarily the best – only the ones that have been promoted the best way.

Analysing and deciding

Analysing all the data of the web is a massive job. And computers are stupid. Computers are decision engines that would be thoroughly stumped when asked “do I look fat in this dress” as they forget the underlying dangers in answering this question in one way or another.

Human additives

This is why we need humans. By enriching our content with structured, easier to parse data we make it easier for machines to harvest only the necessary parts of our documents. In the past that was keywords, now we use microformats and tagging. The latter is very useful as it can be crowdsourced. People tagging my photos on flickr or my site on del.icio.us make it easier for them to find them later on and give me an idea what keywords I hadn’t thought of.

Mobility

This is all fine and good, but the real change we see in behaviour of web users is that we become more and more mobile. Laptops, Mobile devices and Netbooks are a very common sight and wireless networks and fast 3G connectivity allows people to enjoy the web on the go.

This also means that people can locate themselves on the planet and expect information from their physical surroundings rather than just looking for words, matching and hoping the “night in paris” information they are looking for doesn’t end up in imagery of a disappointing night vision movie experience. In other words, for our content to be found we need to have geographical information in there that defines the locality of the text, not only what it talks about.

Finding the hidden goodies

And this is what Placemaker does for us – give it a text or a url and it returns you the geographical information in it, defined as names, a where on earth ID and as latitude and longitude.

Say I throw the following text at it:

First we take Manhattan and then we take Berlin.

If you get an API key you can POST this information to the Placemaker API endpoint like this:

http://wherein.yahooapis.com/v1/document

documentContent=First+we+take+Manhattan+and+then+we+take+Berlin.
documentType=text/plain
appid=my_appid

Using different parameters

Placemaker takes different parameters that help you filter down the results to what you want.

appid
nothing happens without it!
inputLanguage
fr-CA,de-DE…
outputType
xml or RSS
documentContent
text to analyse
documentTitle
additional title
documentURL
url to analyse
documentType
MIME type of doc
autoDisambiguate
remove duplicates, set to false to get more results
focusWoeid
filter around a woeid – 400km radius

Placemaker result sets

With the above data and parameters you will get the following XML document back:




0.001987
build 090508
48


0
Undefined


0
0



1
Supername


0
0




52.5161
13.377


40.6838
-74.0477


52.6675
13.7262




638242
Town


52.5161
13.377


0
1
8



12589342
County


40.791
-73.9659


0
1
8



12589342
14
23
1

plaintext



638242
41
47
1

plaintext





Working with Placemaker results

Placemaker results have a lot of cool things in them, all explained in detail in the docs. Let’s concentrate on the things we really want to play with here.

First up is a list of places the API found in the text. These are PlaceDetails elements with a nested place element:



12589342
County


40.791
-73.9659


0
1
8

This is cool, but it doesn’t tell us where this information came from. For this there is a referenceList element with an array of references


638242
41
47
1

plaintext

Notice the element with the name woeIds, as – oh joy of joys – a reference can have several woeids it is connected with. In order to find out where the text Placemaker found as a match is located in the document, you either get start and end for text content of the XPATH for structured content (XML/RSS). This is pretty sweet, of course.

Annoyances

There are a few annoyances when it comes to working with Placemaker.

The first is a limit of 50,000 bytes for the text to be analyzed which is less than you think when you remember just how much we pack into our web documents.

The second is that the web is simply not a clean and nice dataset. When you read an HTML document from a live site you’ll find that in many cases Placemaker chokes – for starters only valid UTF-8 documents go through.

The third is that Placemaker has no JSON output at the moment, which means you cannot use results in JavaScript without writing an own converter.

Placemaker only allows for POST requests which makes it a bit less easy to play with than with GET enabled APIs (as you can simply open them in a browser window).

My biggest annoyance is the disconnect of places and references. This is not a problem of Placemaker as it wasn’t meant exclusively to match content to places – just find places. But it makes my favourite use cases – embedding geo location at the right spot in a document – harder.

Workarounds

Of course there are workarounds for all these issues (except for the 50000 byte limit).

Fixes

As with anything on the web, there are ways to work around these annoyances.

Fixing the broken web with YQL

The first trick is to use YQL to load the HTML and filter it before you send it to Placemaker.


$key = ‘YOUR_API_KEY’;
if(isset($_GET[‘url’])){
$realurl =’http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%20%3D%20%22’.urlencode($_GET[‘url’]).’%22&format=xml’;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $realurl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$c = curl_exec($ch);
curl_close($ch);
if(strstr($c,’<’)){
$c = preg_replace(“/.*|.*/”,’‘,$c);
$c = preg_replace(“/ ” encoding=”UTF-8”?>/”,’‘,$c);
$c = strip_tags($c);
$c = preg_replace(“/[r?n]+/”,”“,$c);
$ch = curl_init();
define(‘POSTURL’, ‘http://wherein.yahooapis.com/v1/document’);
define(‘POSTVARS’, ‘appid=’.$key.’&documentContent=’.urlencode($c).
‘&documentType=text/html&outputType=xml’);
$ch = curl_init(POSTURL);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, POSTVARS);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$x = curl_exec($ch);
header(‘content-type:text/xml’);
echo $x;
}

}
?>

The YQL implementation of select * from html runs the HTML through Tidy to fix issues and is encoding agnostic. That way Placemaker now can get data from sources it normally chokes on. Notice that it is a good idea to filter out tags and whitespace to save on byte-size.

Matching references and places

The best way to explain this is to build a small implementation. For example a simple form that allows a user to enrich a text with Geo Microformats:

The code is not hard, the main trick is to create an array from the known places and then match them with the IDs of a reference in a nested loop.



	

// if some text was sent through
if(isset($_POST[‘analyze’])){
$content = $_POST[‘analyze’];
$template = $_POST[‘template’];
// define the API key and do the call to Placemaker
$key = ‘C8meDB7V34EYPVngbIRigCC5caaIMO2scfS2t’.
‘.HVsLK56BQfuQOopavckAaIjJ8-’;
$ch = curl_init();
define(‘POSTURL’, ‘http://wherein.yahooapis.com/v1/document’);
define(‘POSTVARS’, ‘appid=’.$key.’&documentContent=’.
urlencode($content).
‘&documentType=text/plain&outputType=xml’);
$ch = curl_init(POSTURL);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, POSTVARS);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$x = curl_exec($ch);

// create an object from the XML
$places = simplexml_load_string($x, ‘SimpleXMLElement’,
LIBXML_NOCDATA);
// WTF?
// loop over places and create an array with
// the woeid as the key
$foundplaces = array();
foreach($places->document->placeDetails as $p){
$woeid = ‘woeid’.$p->place->woeId;
$foundplaces[$woeid] = array(
‘name’ => str_replace(‘, ZZ’,’‘,$p->place->name.’‘),
‘type’ => $p->place->type.’‘,
‘woeId’ => $p->place->woeId.’‘,
‘lat’ => $p->place->centroid->latitude.’‘,
‘lon’ => $p->place->centroid->longitude.’’
);
}

// loop over the references and over the woeids
$refs = $places->document->referenceList->reference;
$microformats = array();
foreach($refs as $r){
foreach($r->woeIds as $wi){
// get dataset connected with the current woeid
$currentloc = $foundplaces[“woeid”.$wi];

// check if all interesting data exists
// get the template and replace the
// placeholders
if($r->text != ‘’ && $currentloc[‘name’] != ‘’ &&
$currentloc[‘lat’] != ‘’ && $currentloc[‘lon’] != ‘’){
$lat = $currentloc[‘lat’];
$lon = $currentloc[‘lon’];
$mf = preg_replace(‘/%place%/’,$r->text,$template);
$mf = preg_replace(‘/%lat%/’,$lat,$mf);
$mf = preg_replace(‘/%lon%/’,$lon,$mf);
$content = preg_replace(‘/’.$r->text.’/’,$mf,$content);
}

}
}

}
?>

Making Placemaker GET it with YQL

Another thing YQL allows developers to do is to extend it with own open tables that run JavaScript conversions on the server side. One of those is the YQL open table which does all the things Placemaker does but on the server and offers JSON output.

The great thing about the JSON output is that it already matches up places and references for us:

{
“query”:{
count“,
created“,
lang“,
updated“,
uri“,
“diagnostics”:{
publiclyCallable“,
“url”:[{
execution-time“,
content
},
{

execution-time“,
content
},
{

execution-time“,
content
}

],
“javascript”:{
instructions-used
},
user-time“,
service-time“,
build-version
},
“results”:{
“matches”:{
“match”:[{
“place”:{
woeId“,
type“,
name, Berlin, DE”,
“centroid”:{
latitude“,
longitude
}

},
“reference”:{
woeIds“,
start“,
end“,
isPlaintextMarker“,
text“,
type“,
“xpath”:””
}

},
{

“place”:{
woeId“,
type“,
name, New York, NY, US”,
“centroid”:{
latitude“,
longitude
}

},
“reference”:{
woeIds“,
start“,
end“,
isPlaintextMarker“,
text“,
type“,
“xpath”:””
}

}
]

}
}

}
}

Using the open table we can easily use Placemaker in JavaScript:

function gotit(o){
var matches = o.query.results.matches.match;
for(var i=0,j=matches.length;i console.log(‘Name: ’ + matches[i].place.name);
console.log(‘lat: ’ + matches[i].place.centroid.latitude);
console.log(‘lon: ’ + matches[i].place.centroid.longitude);
console.log(‘Match: ’ + matches[i].reference.text);
}

}

var content = ‘First we take Manhattan and then we take Berlin’;
var yql = ‘select * from geo.placemaker where documentContent = “’ +
content + ‘” and documentType=”text/plain” and appid = “”’;
var url = ‘http://query.yahooapis.com/v1/public/yql?’ +
‘format=json&callback=gotit&env=’ +
‘http%3A%2F%2Fdatatables.org%2Falltables.env&q=’ +
encodeURIComponent(yql);
var s = document.createElement(‘script’);
s.setAttribute(‘src’,url);
document.getElementsByTagName(‘head’)[0].appendChild(s);

Implementations

  • Yahoo News Map – Yahoo News Map uses the Yahoo RSS feed run through Placemaker to show news on a map and allow to navigate with the map.
  • TweetLocations – Tweetlocations shows a map of your latest tweets.
  • Geo this! (Greasemonkey) – Geo This! is a Greasemonkey script for WordPress that adds a button to analyze and tag the content before submitting the blog post.
  • GeoMaker – GeoMaker is a frontend to Placemaker that turns a URL or a text into a map.
  • GeoMaker API – GeoMaker also has an own API that makes it easy to convert URLs to all kind of handy formats.
  • JS-PlacemakerJS Placemaker is a JavaScript wrapper for Placemaker using the open YQL table.

You have the data, you have the tools, now go and make some ideas a reality

That’s all I have for you today. Check the resources coming up and have a play with Placemaker. Contact me once you’ve done something cool and I’ll be happy to tell the team about it. Here are some ideas:

Flickr knows woeid

Using YQL and open tables you can get geolocated photos from a text analysed with Placemaker:

select * from flickr.photos.info where photo_id in
(

select id from flickr.photos.search where woe_id in
(

select match.place.woeId from geo.placemaker where
documentContent = “First we take Manhattan and then we take Berlin”
and documentType=”text/plain” and appid = “”
)

and license=4
)

Other Yahoo Geo resources

The Yahoo Geo section of the Yahoo Developer Network has all the other Geo goodies for us. Maps, FireEagle and even the Placemaker dataset for download is all there for you to use.

The Guardian Data Store

The Guardian has a really nice blog/resource that always has new data for you to play with: the Guardian Data Store. Have a look and a play.

TTMMHTM: iPhone GPS and Atari sourcecode, reboot Britain, augmented reality tube app, cougars and sun in San Francisco

Monday, July 6th, 2009

Things that made me happy this morning: