LMI For All

Documentation & Development

User Tools

Site Tools


Sidebar

Start Pages

Team Pages

Upcoming Events

Apr 24 Modding Day
tech:phase_2a_report

Deadlines:

  • 6 May outline of sections needs to be fleshed out (IER to submit on 8 May to UKCES)
  • 15 May section to be completed (IER to incorporate into report and submit to UKCES on 22 May)

Outline of technical team section

The following is only a suggestion and can be changed. UKCES have requested that we include technical developments in phase 2a, technical issues (including server space) and implications for phase 2b. I will leave for you to refine!

3. Accessibility and open data: Phase 2a Technical developments

This section outlines the technical developments that have been undertaken during phase 2a of the project, highlights technical issues encountered and solutions found and discusses implications for Phase 2B of the project.

3.1. Activity in phase 2a

Aims

The overall aim of technical development in phase 2a of the project was threefold. First was the development of a secure and robust infrastructure for the data providing a secure environment high could be extended in future as new data becomes available and as the number of end users increases.

The second was the provision of modern and flexible tools to allow the querying of the database by eternal users.

The third was the provision of tools and spaces for documenting the process and allowing public access to the project.

Main outcome

The main outcome of the technical development under Phase 2A of the project is a data tool that can be released on the open market and which:

  • Provides the key areas of data individuals need to make decisions about learning, skills and careers, building on the data tool that was developed during the pilot phase.
  • Provides data that has been quality-assured in terms of statistical quality and confidentiality.
  • Includes data that has been cleansed and standardised.
  • Has been tested with developers and further iterations made as required.

Access to the data tool is at xxx

Platform and Database

The original pilot project has been developed on a MySQL database. Although a powerful environment, it was anticipated that a stronger industrial database would be required for implementing stages 2a and 2b of the project; especially if development of more automated processes for the updating of data were to be developed. The main choices were between Microsoft SQL server and the Oracle SQL database. The decision to install a Microsoft SQL server was taken on the grounds of flexibility and price.

To host the database environment, we have installed a dedicated Windows 2012 Server with MS SQL Server 2012 Standard Edition. Consideration was given to where and how to host the server. It was finally decided to host it at Dediserve Ltd. This is a cloud-based server that can be extended when needed offering flexibility. Cloud computing is the use of computing resources (hardware and software) which are available in a remote location and accessible over a network. Cloud computing allows organisations to get their applications up and running faster, with improved manageability and less maintenance, and enables IT to more rapidly adjust resources to meet fluctuating and unpredictable business demand. This is particularly important, as it is difficult to predict the physical size of the database, given that different data sources are a major topic being researched through the project. The environment allows us to dynamically add resources and create new or expand existing servers. Secondly it is very hard to estimate future demands for the server from external users, so once more a Cloud solution offers more flexibility as it is relatively easy to add capacity to the server in a short time period. We were also concerned to select a provider who offered a secure solution with managed services for server maintenance and security.

At the moment we run three separate servers:

1. data.lmiforall.org.uk. This server holds the database environment. It runs Windows 2012 Server Edition and MS SQL Server 2012 Standard Edition.

2. api.lmiforall.org.uk. This server holds the API layer. It runs Ubuntu Linux LTS

3. collab.lmiforall.org.uk. This server holds the collaboration environment and the public website. It runs Ubuntu Linux LTS

The separation of the different servers allows us to run dedicated environments and systems for different platforms and usages. It also provides greater security as the database server can only be accessed though the API.

Issues and solutions

Three main issues emerged during the development process.

The first is that we underestimated the size of the data in the original specification. This was overcome through adding extra storage capacity to the server.

The second was the issue of speed in querying the database. Due to the amount of data and the number of different databases, initial queries were sometimes slow.We have overcome this through improving the automatic indexing of data to achieve significantly faster searching. However, this also required more storage capacity.

The third issue, which resulted in discussion with the project development team, was the trade offs between two different ways of developing and querying the database for the wage data. couldn't do on fly as iterative process.

The first was to prepare dedicated datasets for the particular aspects of querying that would be facilitated through the project data tools. The second was to run the queries algorithmically and on the fly i.e. in real time. The first method would provide a simpler database structure but would require significantly larger datasets. The second, through performing live filtering and aggregation of data would reduce the database size and potentially result in quicker returns.

Although the second option was attractive it could have introduced considerable issues in the stability of the API especially with a high rate of simultaneous hits.

However it proved possible to reduce the number of files and thus the size and overhead on the server and to calculate a number of variables 'on the fly in the API'. These include estimates of pay by age and estimates of replacement demand.

ETLs

Despite the Government’s commitment to open data, there remain major challenges in developing applications based on this data. Although openly published the formats of data are often not suited to being automatically read by machines. Furthermore this data can require significant cleaning or manipulation prior to being integrated into a database. Also, with data that is frequently updated, for instance Labour market Information, the formats of the data frequently change between different releases.

This can require significant work in preparing data and also result in delays in the frequency of updates. The technical solution to this is to develop Extract, Transform and Load (ETL) processes for database usage and including extracting data from outside sources; transforming it to fit operational needs, which can include quality levels; loading it into the end database.

MS SQL Server provides a professional database environment that allows for the implementation of formal automated ETL processes. We have implemented ETL processes for all current data sources that minimize manual tasks in the import and integration process.

The use of namespaces is important for the technical environment. Namespaces are separate sections within the database. We have introduced two separate namespaces in the MS SQL database environment: ‘raw’ and ‘production’. The ‘raw’ namespace holds tables that store the raw imported data from the various datasources. The ‘production’ namespace holds the intergrated production tables.

There are scripts in place that automate ETL processes for each datasource that

a. Import the raw data from the import stream to the ‘raw’ namespace.

b. Transform and clean the data

c. Integrate the data in the ‘production’ namespace.

Each table in ‘production’ has been properly indexed to improve performance and supply the API layer with quick results.

Issues and solutions

Although we have successfully developed ETL processes these only minimise manual tasks in the import and integration process. Whilst data continues to be issued in different formats and due to the requirements over non-disclosure, there remains some considerable human overhead in the preparation of data. However, the present ETLs represent a large step forward over the manual processes used in Stage 1 of the project and there remains scope to further automate data management processes.

Data security and data disclosure

The database server cannot be directly accessed through the internet. Data can only be queried through the API layer, which itself sets strict rules as to what data can be accessed. This alleviates concerns out disclosure. The API layer is only able to access the production space and has no access to the raw namespace.

Issues and solutions

We had considerable discussions around this issue in the early stages of the project with the data team and technical team working together to find robust solutions. The issue has been overcome through the adoption of two namespaces and strict api querying restrictions.

Wiki for tracking project development

A wiki has been developed for tracking different stages in project development. Whilst not a public facing space, the wiki is critical for the project. Different data sets are made available at different points throughout the project. These datasets may be new or may be updates of previous releases. They will have various attached dimensions, values and coding. The datasets themselves are often too large for manual inspection. Thus it is critical to quality control and process control that each dataset is comprehensively described and documented. The wiki also serves for process control, with the documentation being signed off by the technical team, before being downloaded and installed in the database.

The wiki also provided technical documentation and access to application developers and was used in the hack day and the modding day. It provides a flexible and collaborative working environment, with a managed permission environment (i.e. we can limit access to particular pages or to the wiki as a whole) which can be rapidly updated.

Issues and solutions

Prior to the development of the wiki, based on the open source docuwiki platform, we trialled a number of different collaborative tools and environments. The issue is to find a tool which allows sufficient functionality for collaborative working, but at the same time requires a limited learning curve for non technical partners from the data team. The wiki appears to achieve this compromise.

It took some time to develop a shared understanding of the standards and form of documentation between members of the data team and the technical team. A template was produced which was subsequently iteratively amended. We are now confident that the process and documentation is robust and offers a quality standards approach for project collaboration.

Improved API Reliability

The Application Programming Interface (API) developed for the project allows external queries to be made to the database. The results of these queries e.g over employment or wage rates associated with a particular occupation - can then be displayed in a web page or eternal application.

The LMI for All API has been rewritten from the ground up (compared to Phase 1) to offer much greater robustness and performance. It is now compliant with a variety of industry standard server containers (needs explanation), allowing for more flexibility in deployment and management.

We have developed a dashboard which allows us to monitor the performance of the database in real time, and identify particular bottlenecks. It also monitors the performance of the API in returning queries. Automated monitoring and statistics have been deployed to try and detect problems arising out of usage before they become critical.FIXME

Issues and solutions

The previous version of the API was written in PHP, which is a widely used general purpose language for web applications. For phase 2, we focused on key parts of the API such as speed, reliability, robustness and ease of deployment. This was facilitated by changing the programming language and programming framework to Scala and Scalatra, which is intended for use in precisely this kind of scenario. The previous version of the API also had almost no monitoring built in. This was fine, since it was a prototype intended mainly for modding days and hackers. The phase 2 API will be facing the public, so automated monitoring has become a necessity to detect critical performance problems early and prevent service outages. System performance is now sampled in real-time and visualised on a system status dashboard. Automated watchers detect critical errors and outages and alert the tech team by email in case service is disrupted.

Improved API Documentation and Client Code Generation

The new API has been designed to be developer friendly as it allows developers to easily explore the functionality implemented in the API, make test calls on the fly. Many APIs are documented by hand. This can be a laborious process and requires considerable experimentation before testing and developing applications. The API automatically generates documentation from the running API code. This means that the documentation and actual functionality of the API are always in sync and up-to-date. Since the documentation is automatically generated by machines, it follows a structure that is also machine-readable. Software may interpret this documentation using automated processes and generate human-readable documents, exploratory applications, or even client code.

Having a machine-discoverable API makes it very simple to automatically generate client code for a variety of programming platforms. These API client libraries significantly reduce the time it takes developers to use the API and implement the various methods it supports. They provide code for different programming languages to correctly communicate with the API layer. API client libraries are automatically generates for the LMI for All API in many programming languages (most notably Python, Ruby, JavaScript, Java/Scala, Objective-C and PHP). An API discovery webapp is provided to developers, which automatically explores the API and generates sample requests for data.

Issues and solutions

The phase 1 version of the API also had auto-generated documentation, but it was less detailed and not machine-readable, since the source for the auto-generation was basically a collection of hand-typed text snippets. The phase 2 documentation generator actually analyzes the structure of the API itself and thus provides quite strict and detailed documentation (enhanced with human-written comments and explanations) in machine-readable form. This machine-readable form can be explored by developers using the provided API Explorer, which has received a high level of acceptance by developers on the hack and modding days.

Tracking and API keys

Functionality has been developed to track API usage across websites, and authorize API access via the use of per-entity access keys. This functionality is however currently disabled, pending server upgrades and discussions with stakeholders about probable usage and potential cost-sharing agreements with high-volume users of the API.

issues and solutions

It is a common practice to limit the rate of access to public APIs to avoid overloading the server. On the hack day, we alreay had one team of developers basically downloading a dataset by bulking queries – which is fine if you're relatively alone on the server. This can hardly be guaranteed in the case of a public release.

The envisaged access model for the API is that queries will be limited to a rate of 30 queries in 60 seconds by IP address (i.e., individual end users). Institutions with higher volumes of data will need to apply for an API key, which raises the rate based on the number of IP addresses / computers the institution has deployed. The API could also track the originating websites that display LMI to their end users, and limit rates based on this information. If a single website uses up a great volume of API query capacity (say, a large careers counseling service), this could perhaps be limited until the website agrees to a cost sharing plan.

However, none of these methods is currently active. Keeping track of IPs and originating websites requires memory capacity that the phase 1 servers currently do not have. We recommend that for phase 2 (and after providing enough memory), a position on limits and cost sharings be found, depending on the volume of queries the API will receive in practice. It seems moot to try and predict the query volume on a fully released public API, so the best way forward is a “ramp-up” model, where system capacity is scaled along with usage.

Web site

We have developed a lightweight web site at …..

The present site is designed to provide access to info0rmation about the project and data tools for both careers advisers in different organisations and to application and web site developers. The web site uses a Wordpress as a content management system. this is a simple to use system which has a strong permissions environment and should allow both members of the data team and of UKCES to edit the site if they wish with minimal training needs.

Issues and solutions

One issue which has been discussed is web site branding. The decision. as that the site would include the UKCES logo but with light branding at present.

We also held some discussion over the aims for the web site, target groups and content. Once more it was agreed the site should be lightweight but we would return to this issue after the release of the data tools (see issues for phase 2B below)

One objective under Phase 2A of the project was to improve the search functionality on the database. We suggested three different ways in which this might be done:

  1. providing a semi-”naturalised” query interface (think Google search box or Wolfram Alpha),
  • developing a set of pre-made parameterised queries for high-interest topics (query macros),
  • in the case of vacancies information, improvements on data cleanliness, geo=tagging and full-text search.

We commented that it was quite unlikely that all three can be completed in time, so further discussion on priorities with UKCES would be necessary here.

In the event, the selection of which process to follow was largely determined on objective grounds. Due to the launch of the Universal Jobsmatch service, providing access through an API, but limited by a proprietary occupational coding system, there was little we could do to improve data cleanliness or geo-tagging from this source of data.

Equally until the release of the APi and database we do not know what will be the high-interest topics.

Therefore was have worked on the development of a semi naturalized query interface. This would provide extra value since many end users are already familiar with interfaces like Google. However, labour market information is insufficiently keyworded to yield many useful matches. As such, a working query interface hasn't been completed yet.

issues and solutions

Improvements have been made on searching for SOC codes. Since we now have detailed textual explanations of the requirements, tasks and other descriptions for every SOC code, it is comparatively easy to run fuzzy text searches over this data, yielding better matches than the phase 1 SOC code search.

However, the other data is not so explicitly described, so breaking down a textual query into its constituent LMI dataset parts is not easy. A list of keywords, conjunctions and syntactical positions will have to be developed that maps query terms more or less accurately to actual queries on the data.

TOSS

The process of developing Terms and Conditions of Service for the API has been somewhat complicated. This work was to be undertaken jointly between the technical development team and UKCES. There was agreement that we wished to release the tools under an Open Government like license. However, since we are not the original owners of the data, we have had to seek advice on the best way of doing this.

FAQ

We have developed an initial draft of Frequently Asked Questions. This has been based on questions we have been asked by both technical developers and providers of careers advice.

In phase 2B of the project we may need to provide a more interactive means of developing the FAQ including the provision of a forum.

3.2. Implications for phase 2

Server infrastructure

One big issue is the server and technical infrastructure required for Stage 2b of the project. This is problematic for a number of reasons and it should be noted that in Phase 2a of the project we underestimated this factor.

Most web based technical services start with a closed system, before releasing a closed beta and a more open beta, prior to open release. There are reasons for this. One is that it allows the debugging of systems. It also allows feedback on features, design and functionality. But critically, it also allows the testing of the system under real user conditions, albeit with a controlled number of users. Bottlenecks can be identified and systems optimised for speed and reliability. We do not have this luxury.

We have offered the APi for use form the start of phase 2B of the project. And it is quite possible that the early adopters will be national career organisat5ions which will bring considerable numbers of users. The development of an advanced dashboard is of considerable assistance to us in being able to monitor performance of different components of the system in real time.

One way of overcoming this issue by start up companies is simply to add extra server capacity as it is needed and the use of could computing greatly helps in this process. Over demand is seen as a good thing in that such ventures often are running at a loss but are able to raise more capital if they are seen as attracting considerable a user base.

We do not have this luxury in that LMI for All is a public service, supported by public – nut limited – funding. Therefore we have attempted to develop a series of scenarios to estimate the infrastructure required and provide costings for different phases of 2b.

Public + Team Web and collab Shared light Shared light Shared light

DEVELOPMENT API server (lin) Shared with production Light server Light server DB Server (win) Shared with production Medium server Medium server

PRODUCTION API Server (lin) Light server Medium server Medium server DB Server (win) Medium server Heavy server Heavy server

		Heavy server

SLA 2x 2x 3x

Licences Windows 2012 1 2 3 SQL Server Std 1 2 3

Pricing GBP/Month ex. VAT 600 920 1400

We have estimated the different requirements over time as fool,ows

June - November 2013 LOW

Dec 2013 - June 2014 MEDIUM to coincide with release of data tool 2 Aug 2014 - April 2015 HIGH

Access to the API is tracked and we are able to monitor the number of hits from different applications and organisations.

This allows us to build in rate limiting, meaning that we can limit the number of queries from any one source over a given time period to ensure the database does not become overloaded. In particular the use of rate limiting allows us to block bulk querying (see also: Tracking and API Keys, above).

Without accurate data on the likely number of queries it is difficult to advocate any one approach. The estimates provided are as good as can be given at the moment. It is also quite likely that server costs will continue to fall over the next two years, easing resources. The most important step is that we closely monitor usage, especially when large careers services come on line and regularly review performance and infrastructure needs.

Linked data and SPARQL

In the current setup the API Layer acts as a firewall between the data users and the actual database. Data access is restricted to the queries that are run by the API layer. There is no direct data access allowed.

At present the API allows queries through JSON, a text-based open standard designed for human-readable data interchange. it is language-independent, with parsers available for many languages, is an industry standard familiar to most if not all developers. The JSON format is often used for serializing and transmitting structured data over a network connection. It is used primarily to transmit data between a server and web application,

Some developers have asked us to provide enhanced access to the database and in the future we may want to look at the possibility to provide SPARQL or RDF access to a dedicated data mart that only contains guaranteed non-disclosing information. SPARQL (a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format. SPARQL is a format favoured by linked data proponents as it allows advanced queries and the ability to query between different datasets.

It is at least technically feasible to provide most or all of the LMI For All data via the SPARQL query language. However this is a nontrivial amount of work and will likely require additional developer capacity in the tech team. It seems prudent to first project the actual usage of SPARQL queries on the API before coming to a decision on this topic. SPARQL is not widely in use outside the open/linked data communities, and – as a very general impression – most 'regular' developers seem fairly ambivalent about it.

We also have to consider whether most of our early adaptors and users in terms of a service, will come from the open data movement or from developers offering more standard services for careers web sites. Furthermore we may be able to offer better access to linked data through extending the present API and improving indexing, rather than investing heavily in SPARQL development.

Loosely linked data

The design of the LMI for All project was based on data integration through the SOC 4 occupational classification system. However it became clear in Phase 2a that some data was not available at SOC 4 level and the data would become disclosive at this level of disaggregation. For that reason we have used various statistical and technical approaches to providing data at a higher level of aggregation to overcome these issues.

We have also held discussions with a number of organisations who have their own APIs to data but do not use the SOC 4 classifications system. Ins some cases this data is seen by developers and end users as extremely important in terms of linked data. One example is the Universal Jobmatch service which provides access to employment opportunities and for which there is also an API. However the classification system used does not match SOC4 and appears to be a proprietary standard.

One answer to this is to use a technical approach called approximate string matching (often colloquially referred to as fuzzy string searching). This technique is based on finding strings that match a pattern approximately (rather than exactly).

The downside to this approach is obviously it is not as accurate as matches based on the SOC 4 classification system. However, it allows us access a wider range of data than would otherwise be possible and as long as the limitations are understood by end developers could provide much more data for innovative applications.

Web site and APP store

The present website has been developed to provide ready access to the outcomes of phase 2a. At present it targets two main end users – careers professionals who may be interested in using the API as a service for their clients and technical developers interested in building web or mobile applications based on the API. The hack and modding days provided us a view of the wide range of potential applications as well as the potential use of LMI for All by many different end user groups.

There are a number of open issues here.

The first is to what extent we wish to drive application development.

The second is to what extent we provide assistance and help for both careers professionals and application developers.

These lead to the question off how to best develop the web site, for instance through highlighting instances of effective or innovative use and how much we can encourage and support developers and users in sharing ideas and applications. One way to do this might be through our own application (or app) store allowing easy access to highlighted example of implementations of the project.

On the pro side the development and maintenance of a vibrant web site with support services for users will promote uptake, on the other hand this will require resources which might better be expended in further data and technical development.

Database integration and architectures

To a considerable extent we anticipated some of the difficulties and limitations that we would encounter in the database. However, within the timing and resourcing of phase 2A of the project we were unable to develop the more complex and efficient data marts that we consider represent the best architecture for LMI for All.

However, we have been able to upgrade to MS SQL Server which provides a professional database environment that allows for the implementation of formal automated ETL processes and we have implemented ETL processes for all current data sources that reduce manual tasks in the import and integration process.

We have also prepared the environment to allow for the introduction of separate datamarts for various functional needs. At the moment we maintain a single data mart in the form of the current ‘production’ namespace.

In stage 2B we prose to move to a more advanced infrastructure:

  • Data Integration Layer – Data are imported from the external data sources into the raw data storage section in the Data Integration Layer. The ETL processes (ETL stands for Extraction, Transformation and Load) that are used for this initial input will have basic data cleaning tasks built-in. The resulting data in the raw data storage section will be error free and ready for use. From the raw data storage section a separate set of ETLs is used to combine the data in single staging area. In this step the data will be reindexed (and where necessary recalculated) to a common set of dimensions.
  • Data Marts – This layer holds a number of multidimensional data cubes or relational databases that hold precalculated query results. Depending on the measures and dimensions that are needed to feed the data requested through the API there may be a number of separate marts defined in this layer.
  • Data Access – This layer holds the API and its interface to the available Data Marts. This would also be the access point for SPARQL or RDF if a decision was taken to develop this. External applications can access the pre-calculated data in the data marts by using strictly defined API calls to the Data Access layer.

The data marts should greatly improve the efficiency of the data base and overcome some of the issues encountered in phase 2A. However, as we noted in the original project application the Open Data White Paper (HM Government, 2012) highlights how data gathered by the public sector is not always readily accessible. Quality of the data, intermittent publication and a lack of common standards are also barriers. A commitment is given to change the culture of organisations, to bring about change: ‘This must change and one of the barriers to change is cultural’ (p. 18). As part of this process, the Government intends to adopt the Five Star Scheme as a measure of the usability of its Open Data (p. 24).

Most data providers have been keen to collaborate with us. However access and formats of the data remain varied and this will impact on the extent to which we can automate ETL processes.

tech/phase_2a_report.txt · Last modified: 2013-05-21 11:30 by Philipp Rustemeier