Saturday, December 17, 2011

APIs vs Scraping - Cl@rity & Yperdiavgeia

Typically there are two main mechanisms to search and retrieve data from a website: either through an Application Programming Interface commonly known as an API (if available) or via screen scraping. The first one is better, faster and more reliable. However, there is not always a search API available or perhaps even if there exists one, it may not fully cover your needs. In such cases, web robots, also called agents, are usually used in order to simulate a person searching the target website/ online database through a web browser and capture bits of interest by utilizing scraping techniques.

An API that has attracted some attention over the last few months in Greece is the Opendata API that the "Cl@rity" program ("Διαύγεια" in Greek) is offering. Since the 1st of October 2010, all Greek Ministries are obliged to upload their decisions and expenditure on the Internet, through the Cl@rity program. Cl@rity is one of the major transparency initiatives of the Ministry of Interior, Decentralization and e-Government. Each document uploaded is digitally signed and given a transaction unique number automatically by the system.
The Opendata API offers a variety of search parameters such as organization, type, tag (subject), ada (the unique number assigned), signer and date. However, there are still a lot of parameters and functionality missing such as full text search as well as searching by certain criteria like beneficiary's name, VAT registration number (ΑΦΜ in Greek), document title and other metadata fields.

A remarkable alternative for searching effectively through the documents of the Greek public organizations is yperdiavgeia.gr ("ΥπερΔιαύγεια" in Greek), a web-based platform built by the expert in digital libraries and institutional repositories Vangelis Banos. Yperdiavgeia is a mirror site of Cl@rity that gets updated on a daily basis and it provides a powerful and robust OpenSearch API which is far more usable and easy to harness. Its great advantage is that it facilitates full text searching. Currently, it lacks some parameters support but it seems that they are going to be added soon since it is under active development.
Even though both APIs mentioned above are really remarkable (especially for communicating and exchanging data with third party programs) there is still some room for utilizing scraping techniques and coming up with some "magic". In a previous post we had described in detail an application we developed mostly for downloading a user-specified number of the latest PDF documents of a specific organization uploaded to Cl@rity. We believe that this little utility we created (offering both a GUI as well as a command line version) can be quite useful for many people working in the public sector and potentially save a lot of time and effort. For further information about it, please check out this post (although it is written in Greek).

So, in this short post we just wanted to point out that there are quite a lot of great APIs out there, provided mostly by large organizations (e.g., firms, governments, cultural institutions and digital libraries - collections) as well as the major players of the IT industry such as Google, Amazon, etc, offering amazing features and functionality. Nevertheless, scraping the native web interface of a target site can still be useful and sometimes come up with a solution that overpasses difficulties or/and inefficiencies of APIs and yield an innovative outcome. Moreover, there are numerous websites that do not offer an API, thus a scraper could perhaps be deployed in case data searching, gathering or exporting would be needed. Therefore, the "battle" between APIs and scraping still rages.. and we are eager to see how things will evolve. Truth be told, we love them both!

No comments:

Post a Comment