HOW-TO write Media Info Scrapers (the complete dummies guide)

Scraper creation for dummies, a step-by-step guide.

Chapter one
First, some very important reference information, not to read it right now but keep these URL links at hand...


 * Reference to scraper layout and structure: Scrapers
 * Tool to test scrapers: Scrap (Download NOW both files referenced there, scrap.exe & libcurl.dll)
 * Basic XML syntax knowledge.
 * Some info about regular expressions (RegExp), see: Regular Expression (RegEx) Tutorial
 * More info on regular expressions from wikipedia: http://en.wikipedia.org/wiki/Regex

Introduction to writing a scraper
So how do one write these mysterious XML files that we call scrapers? Well, there is nothing like an example so here we are going to start by going through a basic IMDb scraper in detail, explaining key stuff along the way. Let us get off to an easy start, so here is the root tag:

As you can see we give it three parameters, a name (used for display in GUI), a content type and a thumb. The thumb-url is relative to where ever the XML file is stored (so relative to "special://xbmc/system/scrapers/video" in XBMC). Note! It is very important that the content type of the scraper is exactly "movies" as this is used by XBMC to filter scrapers in the set content dialog.

Okay, that was easy, no? The first "function" called from XBMC will be CreateSearchUrl. This function has a sole purpose: Given the string describing whatever we want to lookup in url friendly form (i.e. + instead of spaces etc), it should return the url we should fetch to do the search. Okay, many words, lets see the actual code for it:

We immediately notice a lot of funky numbers. Before we return to the example it is best we here first explain the buffers and how they work. There are two sets of buffers we need to relate to. The first one should hopefully be known - these are the regular expression selections. The second set is the scraper buffers - these are the buffers you have available to get your regular expressions to return the sought data, and they are also the means by which we 'communicate' with a scraper from outside - certain buffers will hold certain input data upon a call to the scraper, and after the method has been run, the values one of these buffers are taken as the output. You have 9 buffers to play with, numbered 1-9.

Now, back to our CreateSearchUrl method. Hopefully you have a suspicion now what those numbers are all about. Correct, they are buffer numbers. Any method in a scraper has a dest parameter. This is the number of the buffer that will hold the final result. CreateSearchUrl has one input parameter - buffer1 - whatever we want to search for.

Moving on, we are now in the body of the scraper. We see we have two kinds of tags,  and a child. This is the only two tags we need. The  tag has 3 parameters - input, output and dest. To take the easy one first; dest="3" means stuff the results into buffer 3. This is indeed consistent with the return param from the entire method (we are only doing a single regexp). output=".." is the replace string for the regular expression - note the usage of \1 - selection 1, not to be confused with buffer 1. Then there's the input parameter. Here we note we have two $'s in front of the number 1. This means 'the contents of buffer 1'. This simply states: 'I want to run this expression on the contents of buffer 1'. Note that you can use several buffers here, like input="$$1$$2" to run the expression on the contents of buffer 1 followed by the contents of buffer 2.

Then we have the tag. This is where the actual regular expression is given. We have one parameter - noclean. It is given the value 1, this means 'do not strip tags from selection 1 prior to inserting it into the output buffer'. You can specify more than one selection by using a comma separated list.The default behaviour is to strip off any html tags to make results display fine in xbmc. NOTE: This default may be changed, if so i will announce it loud and clear. The tag has not value - this is interpreted by the scraper as 'select it all' (.*). So, let's run through it shall we.

We are given the search string in buffer 1, let us say it has the value "foo". The scraper finds the  http://akas.imdb.com/find?s=tt;q=foo

We have now created our search URL.

How a scraper works
In a nutshell:


 * 1) If there is movie.nfo, use it (section NfoUrl) and then go to the last step
 * 2) Otherwise, with the file's name generate a search URL (section CreateSearchUrl) ang get the results
 * 3) With the results generate a listing (section GetSearchResults) that has for each "candidate" movie a user-friendly denomination and one (or more) associate URLs
 * 4) Show the listing to the user for him to choose and select the associate URL(s)
 * 5) Get the URL's content and extract from it (section GetDetails) the apropriate data for the movie to store in videodb

Each one of that four sections is made as a RegExp entry that has this structure: INPUT is usually the content of a buffer (in a moment we see what that is) OUTPUT is a string that is build up by the RegExp DEST is the name of the buffer where OUTPUT will be stored EXPRESSION is a regular expression that somehow manipulates INPUT to extract from it information as "fields". If EXPRESSION is empty, automatically a field "1" is created which contains INPUT

Here a "buffer" is just a memory section that is used for communication between each section and the rest of XBMC. There are twenty buffers named 1 to 20. To express the content of a buffer you use "$$n", where n is the number of the buffer.

The fields get extracted from the input by EXPRESSION just by selecting patterns with "(" and ")" and get named as numbers sequentially; the first one is \1, the second \2 up to a maximum of 9.

A very easy example:
 * As input the content of buffer 1 is used
 * The output will be stored in buffer 3
 * As expression is empty, all the input ($$1) will be stored on field \1
 * As output is simply \1, al its content will be used for output, that is, $$1

So, the end result will be that the content of buffer 1 will be stored on buffer 3

If you do not know anything about regular expressions, this is the moment to make a quick study of the principles of them from the references above.

Another example, this time we use a string as input and use a very simple regular expression to select part of it There, when we apply the expression to the input, the selected pattern (.*) becomes field 1, in this case it gets assigned "The Dark Knight". The output will so be "The title is The Dark Knight" and will be stored in buffer 3.

The most important sections in a scraper
Now, let's have a look into the 3 "important" sections: CreateSearchUrl, GetSearchResults and GetDetails. first there is some basic information about them we need to know.

CreateSearchUrl must generate the URL that will be used to get the listing of possible movies. To do that, you need the name of file selected to be scraped and that is stored by XBMC in buffer 1.

GetSearchResults must generate the listing of movies (in user-ready form) and their associate URLs. The result of downloading the content of the URL generated by CreateSearchResult is stored by XBMC in buffer 5. The listing must have this structure: Each must have a (the text that will be show to the user) and at least one, although there can be up to 9. You can generate as many as you need, they will become a listing show to the user to choose.

Once the user has selected a movie, the associated URL(s) will be downloaded.

Last, GetDetails must generate the listing of detailed information about the movie in the correct format, using for that the content of the URL(s) selected from GetSearchResults. The first one will be in $$1, the second in $$2 and so on.

The structure that the listing must have is this: Notes:
 * Some fields can be missing or empty
 * contains the URL of the image to be downloaded later
 * ,, and can be repeated as many times as needed

Some important details to remember:


 * When you need to use some special characters into the regular expression, do not forget to "scape" them:
 * \ &rarr; \\
 * ( &rarr; \(
 * . &rarr; \.
 * (etc)
 * Since the scraper itself is a XML file, the characters with meaning in XML cannot be used directly and so you must use its aliases:
 * &amp;amp; &rarr; &
 * &amp;lt;  &rarr; <
 * &amp;gt;  &rarr; >
 * &amp;quot; &rarr; "
 * &amp;apos; &rarr; '
 * If you use non-ASCII characters in your XML to be used in the output (umlauts, ñ, etc), they must be coded with the appropriate encoding as expressed in the XML file (in our example it was iso-8859-1, as you see in the code)

Our first working scraper
Now, with all that information, let's create our first scraper. Just create a dummy.xml file with this content and study it a little, it should be fairly easy to understand with what we already know:

A really stupid scraper with no meaningful use whatsoever: be it any movie feeded, it will always generate the same (fake) data, also it will download information from www.nada.com and not use it at all, but nevertheless we have our first working scraper, congratulations!

To test it in windows, put in any directory the files scrap.exe and libcurl.dll that are referenced at Scrap and the dummy.xml file and then execute for example this:


 * scrap dummy.xml "Hello, world"

It should execute without errors and show you each step and its output.

You can also try it in a "real" XBMC, just copy dummy.xml to XBMC\system\scrapers\video, start XBMC, choose any directory from your sources that contains a video file not incorporated into the library, "set content" of the directory to use "dummy" as scraper and finally select "movie info" over the video file. All our fake data will be incorporated into the video database.

Introduction
Now that we know how to create a skeleton scraper, let's re-create a real one. I've chosen one fairly simple, the one used to scrape the spanish site culturalia.es (in fact the URL is http://www.culturalianet.com). First of all, we must know how works the site we intend to write the scraper for.

Open http://www.culturalianet.com. To perform a search, write "la noche es nuestra" (spanish title for "we own the night") in the buscar:box in the top of the page. When you press the Buscar ("Search") button, the URL opened is:


 * http://www.culturalianet.com/bus/resu.php?texto=la+noche+es+nuestra&donde=1

GetSearchURL
so, very easy, our search URL will be "http://www.culturalianet.com/bus/resu.php?texto=" + (text to search) + "&donde=1"

For example: So far, so good; in field 1 goes the input (the name of the movie, already stripped by XBMC of the file extension and some common words like "divx", "ac3" and so on), and to generate the output we just write \1 at the point we need.

GetSearchResults
Now we must understand how the results page is formatted; for that, the function "View selection source" of firefox is very useful. Just select the end of the header of the listing and some of the first entries and "view selection source", this is what I get: See? we simply need to select for each entry, the title and maybe some information and then the URL, and repeat that for all the entries in the listing. Fortunately, XBMC offers us some resources to help that we haven't seen yet: the "expression" part of RegExp can have some attributes, in this case, to repeat the appliying of to the input as many times as there are data for ir, we simply add 'repeat="yes"' as an attribute:

and now let's go for the expression. We will extract the culturalianet's ID of the article about the movie, the spanish title, the original title, the name of the director and the year of the movie. The ID we get from:

is just a string of numbers, to select it as a field we surround it with parentheses:

after that, there is the spanish title, ending in a dot and followed by , so we select as our second field a string of any lenght (must have at least one character) that does not contain "<":

Then there is some formatting and, surrounded by &lt;i> and &lt;/i>, the original title (again a string of one or more characters). we jump over the formatting with [^&lt;i>]* and select our third field:

Then there is &lt;/i> and the literal "De " followed by the director's name up until the year of the movie that appears surrounded by parentheses:

and our fourth and last field is the movie year, ending (but not including) the character ")":

all put together and exchanging "&amp;lt;" for "<" etc, this is our :

there, the fields will be:
 * \1 ID of the movies's article in culturalianet.com
 * \2 Spanish title
 * \3 Original title
 * \4 Director's name
 * \5 Movie's year of first exhibition

Each of our will have a in the form: or, with our actual fields:
 * 'Noche es nuestra, la' (We own the night) de James Gray (2007)
 * '\2' (\3) de \4 (\5)

Also there will be a generated by:
 * http://www.culturalianet.com/art/ver.php?art=\1

Like we did with our dummy scraper, we add all the necessary headings and this is the result: There are a few things there we have not seen yet. For starters, see that there are two anidated regexp; they get evaluated from the inner ones to the outer ones. Also, there is an attribute for we haven't seen yet, 'noclean="1"'; by default, XBMC will strip the expression of all HTML formatting, but here we do not want that, so we add that to indicated that we do not want XBMC to clean our input before using it.

also, and this is a XML standard, you can shorten empty XML clauses like by writing instead:

So, how does XBMC execute this? it goes to the inner regexp and using input="$$1" (the content of our search url), applies to it expression and generates our fields:

In the previous line, for clarity, I'm using < instead of &amp;lt;

That code generates this output to buffer 5: repeats it as long there is a match in input, generating as many, and all goes to $$5

Then, the outer regexp gets executed, it uses as input $$5 that has just been generated; it does not modify anithing (empty means all input goes to \1) but remember to use the noclean clause to maintain the necessary formatting. Simply takes all the s generated and inserts them in the correct xml structure:

All output goes to buffer 8.

GetDetails
Now, XBMC will show the user the list of movies and one will be selected. The associated URL, the article page of the movie, will be downloades and fed to buffer 1, and that we need to parse to extract the information we need.

Go now to a movie article, like http://www.culturalianet.com/art/ver.php?art=29405 and select and look at the underlying HTML code. Very much like we did when parsing the search results page, we must detect the patterns in the page that allows us to select the correct fields and then use them to build our XML structure. Some parts are fairly straightforward, like title, duration, plot or year; this expression extracts the spanish title, the original title and the year into fields 1, 3 and 2 respectively:

and the output for that (we add an clause that is not needed nor used by XBMC right now, maybe in future versions will get used):

Some data is much more difficult to obtain; the actors, writers and directors could be one or more, and in the page, the structure is different werther culturalia has a page of the specific artist or not (there is just the name or the name becomes a link). The "actors" block is surrounded by "Actores:" and "Productor:", we simply extract that block into, for example, $$7

Then we parse the $$7 buffer, in it the name of each actor will be anything between > and < that is at least one character long:

This will be our actors' output:

the full regexp for the actors (remember that it evals from inner to outer regexp):

So, to get the end result, we simply put one after another all the regexp that generate our listing. When regexps are one after another (not nested) they simply execute in order.

Something we haven't used before, when we want the output to append to an existing buffer, not overwriting it, we simply write, for example, dest="7+"

We generate all the items into the "8" buffer and use the "7" buffer as temporary in each regexp.

For the final version of the scraper, we use some different attributes for, with this meaning:
 * repeat="yes" -> will repeat the expression as long as there are matches
 * noclean="1" -> will NOT strip html tags and special characters from field 1. Field can be 1 ... 9. By default, all fields are "cleaned"
 * trim="1" -> trim white spaces of field 1. Field can be 1 ... 9
 * clear="yes" -> if there is no match for the expression, dest will be cleared. By default, dest will keep it previous value

NfoUrl
There is a variation of the inner working of the scraper: when there is a nfo file in the same directory and with the same name as our actual video file, it is searched; it could contain info in a format similar to the "GetDetails" result, and so the info is directly incorporated into the video database without executing the scraper, or it could contain just a URL; if that's the case, each scraper is tested against the URL by calling its NfoUrl function with the URL in $$1 (the actual URL, not its contents!!!). If there is a match (the dest generated is not empty), then it is assumed that 1) the scraper is the appropriate one for this video file and 2) the correct URL is the one referenced in the nfo file, so neither GetSearchUrl nor GetSearchResults need to be executed, and GetDetails is called directly with the contents of the URL stored in $$1. One important consideration: the URL found in the movie.nfo file is passed onto our NfoUrl function, but the URL that is actually fetched and passed to the GetDetails function is the one our NfoUrl function returns!

This is a possible implementation of NfoUrl:

Final version of the scraper
So, without further ado, this is the whole scraper. The extraction of the different fields is similar to the "actors" field we saw. This is just one of many possible ways of getting the info and probably not the best one, it is slightly different to the original culturalia.xml scraper. One additional comment: to avoid trouble with some special characters ("é" in "Género", for example) that can get different encodings depending on your text editor and can be difficult to type, I'm using a dot instead, since a non-scaped dot means "any character" when used in regular expressions.

If you've been trying to watch the images as our scraper originally got them, maybe you've already noticed that they do not work, gives an error! that's because culturalianet.com, as do some other sites, prevents direct downloading of its images unless the user was actually browsing the site. They detect that by the "referrer" field in the HTTP connection. To simulate that, XBMC wisely allows the use of a parameter for URLs that allows simulating that "referrer" field; first, we must enclose the URL of between tags (we hadn't used them because in some fields, like or  they are optional), and we add (between quotes) the referrer as a parameter named "spoof" like this:

I've written the explicit XML for clarity, you can see the actual code at line 41 of the scraper; with that simple spoof="referrer" modification, culturalia gives us our desired image and we can use it as thumbnail.

Getting info elsewhere
Well, so we've actually finished our scraper and all information that culturalia gives and can be imported (there is a corresponding field in movie info in XBMC) has been incorporated to the video database. But even so, there are a lot of fields that are empty, because the needed information is not in the site! what can we do?

Well, we can obtain it from another site, and one of the best is IMDB. There were some reasons to get the info from culturalia: we had the movies with a translated title (for example, "Sleepless in Seattle" was called "Algo para recordar" in Spain), that makes identifying the movie somewhat difficult in IMDB, also, since the movie was dubbed into spanish, the users did not probably know english and so the plot taken from IMDB would be uncomprenhensible to them. But some other fields, like "studio" and "rating" are more or less language-independent, and also we could obtain the actors listing from IMDB instead of culturalia and so get the field that is absent there.

We must so do some things to accomplish that:


 * We must perform a search for the movie in IMDB and get back that search results.
 * We must select the correct movie URL from the search results.
 * We must download the IMDB URLs.
 * We must extract the info from them and load it in the result tags.

It is very much like a scraper-in-a-scraper thing... the actual download of URLs we build can be done with custom functions (in a moment we see how that works), the selection of the correct movie should be properly done by the same procedure as before (we perform a search, parse the results into user-readable form and ask the user to select the movie). Unfortunately, as of this writing, XBMC does not offer the functionality to do that, we can ask the user just once. It could be, though, a little confusing for the user to have to select the correct movie again (and this time with the original title instead of the translated one). So we must select the movie ourselves... well, since we have the correct spelling (we hope) of the title and the release year, we can be pretty specific (although some years there have been two or more movies with exactly the same title!!)

To call a custom function, in the OUTPUT clause of a regexp we insert For example, this: will call the function MyFunction with the contents of the IMDB homepage in $$1

Of course we must use &amp;quot; instead of "... ans so the previous line should be written like this:

? The following is uncertain yet

The behaviour of the custom function calling is:


 * Buffers are local.
 * $$1 contains the fetched URL.
 * dest simply states a placeholder for the output, can be any buffer no matter if it is used inside or outside the function, as they are local.
 * The buffers other than 1 are empty by default.
 * We can if we want use a copy of the contents of all the buffers (except buffer 1) as they were at the point when the function was called. For that we insert the option clearbuffers="no" in the function definition; the manipulations we may do to that contents will not be preserved when the custom function ends and returns to the point after it was called.
 * Whatever we generate as output will substitute the url...  structure used to call the custom function

Scaling "Rating" Results
Scaling of the rating can be achieved via, where foo is equal to the maximum rating possible. For example, if scraping a site that has percentage ratings (for example Rotten Tomatoes), should be used. This will scale a rating to within the normal range for XBMC, which is 0-10. Thus a percentage value of 97% would be converted to 9.7. In the same way, a site which used ratings from 0 - 5 could use. You also need to ensure you escape all the relevant characters as usual. The max attribute is not required if the rating is from 1-10 but will not have any adverse affects if it is included.