About this Web Application

Information retrieved from analysis of the Europass CV collection from June to September 2019 is exposed using this interactive web application. Particularly, after data collection 353,518 CVs were considered "qualified" for statistical analysis. The information gathered were in structured and free text form and it is standardized using custom or standardized (ESCO) ontologies. It is organised in four main pillars: Demographics, Occupations, Skills/Competences and Qualifications. The user can explore these pillars as well as retrieve statistics for the free text, inspect and download the full standardized or raw anonymized datasets through informative codebooks.

Methodology

We followed a reproducible approach based on literate programming to support the data flow architecture (Xie 2015; Boettiger 2015). The statistical analysis of the received CVs specifies a data flow of strict order in which data are transformed from a raw dataset to statistical information and graphics. Each step contains multiple methods and processes of data cleansing, data wrangling and information retrieval (see the following info-graphic). The tidy data standard has been used to facilitate exploration and analysis of the data (Wickham 2014) and the visualizations pipeline is based on Wilkinson’s (2010) grammar of graphics. Parts of the analysis are inspired by Cedefop’s research project on Online job vacancies and skills analysis (Cedefop 2019).

Data Wrangle Collected CVs

The initial data are stored in JSON format. Data wrangling processes following a split, apply, combine strategy (Wickham 2011) generate a set of meaningful data sets that can be used for statistical analysis, visualizations and machine learning applications. These datasets are the “backbone” of the data model upon which relies all subsequent data flow. For computational efficiency the JSON data are parsed in batches and the relevant fields are measured. For each batch a tabular dataset is calculated and saved in binary form. After a phase of exploratory data analysis, the initial information extraction consists of the following steps,

Deduplication: The process decompresses the JSON CVs and removes duplicated data by keeping the latest CV commit per unique email. Out of the initial 714,804 CVs, 392,812 were identified as unique based on their email address. Among those, 353,518 qualified for statistical analysis based on the degree of their completion.
Demographics-like data collection: The following fields were identified as meaningful variables concerning the unique features of each respondent.
1. Locale
2. Creation Date
3. Last Update Date
4. Address Contact: Country
5. Address Contact: PostalCode
6. Birthdate
7. Sex
8. Nationality
9. Job Applied For
Work Experience data collection: Information relevant to work experience is stored in 4 variables. Each row of the resulting dataset reflects an observation of a work experience entry in a user’s CV.
1. Recruitment Date
2. Termination Date
3. Job Position (categorical or free text)
4. Employers Name
Qualification data collection: Most variables related to qualifications consists of free text, and the following variables are identified as key features describing the education level of a respondent,
1. Title
2. Enrollment Date
3. Graduation Date
4. Organisation Name (free text)
5. Organisation Address: Country
Skills data collection: The following fields are identified as meaningful variables concerning the skills of the respondents. All measured features of this dataset can be categorical and free text,
1. Mother Tongue
2. Foreign Language
3. Communication skills
4. Organisational / managerial skills
5. Computer skills
6. Job-related skills
7. Driving license

The four resulting datasets are linked through a unique code attached to each CV submission.

Matching with the ESCO model

ESCO is a live taxonomy, and its concepts are constantly updated reflecting the current job market. In general terms, the data model of ESCO is structured on the basis of three pillars, representing a searchable database in 26 languages.

The occupations pillar
The skills/competences pillar
The qualifications pillar

Using this taxonomy we proceed by using text mining techniques to match multilingual free text with the ESCO data model. After evaluating the language of a given CV we proceed to three extensive phases to extract ontologies of the three ESCO pillars (Rajaraman and Ullman 2011). In other words, we map free text (semi-structured information) to categorical variables (structured information) for descriptive statistics.

Occupations

Firstly, the process performs all the wrangling and cleansing of the necessary information retrieved from ESCO to run the occupation identification algorithm. The multilingual ESCO classification model is used together with the frequency of the ESCO codes provided by the respondents to generate the necessary numerical statistics. An optimized language agnostic matching method is used to identify free text with the ESCO vocabulary. A text mining algorithm retrieves information about the association between free text and ESCO occupations using the precalculated numerical statistics. The process can be reduced to the two steps below,

Loading, processing and tidying ISCO and ESCO classification data: After data cleansing and wrangling, a frequency table is generated by the ESCO codes provided by the respondents who used the drop-down box to select their occupation. Each occupation’s main label in the ESCO classification, as well as any alternative labels it has, are used to create a corpus for each ESCO code. In addition, the multilingual mappings as they are encoded in ESCO are digested by the system. All datasets are cleaned, transformed and weighted. The more common a word is, the less weight it gets. This reflects the fact that commonly used words encode less information. The weight is also analogous to the frequency of an ESCO code as measured by the drop-down box statistics.
Occupations Matching (Weighted): Each CV may have multiple work experience entries, and each one is treated individually. The classification is performed by mapping the free text of each job position to the ESCO model and then climbing up the ISCO hierarchy. The process consists of three phases. First, the top 10 most probable ontologies are retrieved for each free text. With the former in hand, a weighted voting leads to a suggested occupation in a higher level of the hierarchy (level 3 ISCO occupation). Finally, the ESCO occupations not correlated to the higher hierarchical ontology elected are discarded and the one with the highest weight is kept.

From the technical standpoint, the match algorithmic function is used to identify a free text word in the ESCO vocabulary. The process is language agnostic and the languages used are those in ESCO classification for occupations. The large amount of data, the multilingual nature of the problem and the high computational complexity encountered all demand scalable solutions. A low-level implementation in C is used for matching and all intensive calculations are parallelized (Hofert and Mächler 2016; Loo 2014).

Due to the reachness of the taxonomy in hand, to increase the precision of the statistical matching the analysis is targeted in the highest levels of the hierarchy. In fact, we observed that even for the most frequent events of ESCO matching the percentage compared to the total entries is ~1%. Thus, noise and dispersion of information is expected in the lowest levels of the occupations hierarchy. As for example, it is easier to distinguish an ISCO level 3 occupation (eg. Cook) than an ESCO one (eg. Diet cook, Fish cook, Industrial cook). The more we climb the hierarchy (e.g. ISCO level 1), the more certain the predictions are and thus we focus our analysis at this level.

Skills

The ESCO classification categorizes distinct skills and competences much like it does for occupations. Additionally, it identifies skills related to specific occupations. Using this information, we attempt to map the skills-related free text inserted by users to the taxonomy through the following steps,

Data Collection and Cleansing: The system first digests the multilingual information concerning skills, cleans free text and keeps only relevant information as it is provided by the ESCO taxonomy. Specifically, the main and alternative labels of each ESCO skill are utilized to create a corpus for each distinct code in the classification. API endpoints are used to access the graph data model to retrieve the occupations to skills relationship. Afterwards, all necessary data transformations and numerical statistics are generated in a similar manner with occupations.
Information retrieval: Firstly, linguistic skills of the respondents are identified as they are most likely generated by a drop-down menu. The remaining skills are all treated as free-text and matched to the given taxonomy following the same process as occupations. However, due to the richness of the classification model, which consists of more than 13,000 unique skills, the event of a taxonomy matching is less likely to happen and the precision is reduced. Thus, we also use a second classification model by inferring suggested skills using the predicted occupations. Given the suggested occupations of a respondent, a list of skills is retrieved using the ESCO model of skills to occupations correspondence. Finally, to increase the precision of the matching, a committee classifier strategy is followed using the two methods of skills matching.

Qualifications

The qualifications pillar is retrieved by downloading the HTML documents detailing the different qualifications as defined by ESCO in their web portal. The underlying XML nodeset is parsed and transformed to tabular format using the JSONlite library. Due to the high volume of data required to be downloaded and processed, this is done in a number of batches for the qualification levels. The data are cleaned and tranformed into tidy format to be compatible with the system’s pipeline. ESCO exposes over 9,500 qualifications through their website. These are used to assist two distinct information retrieval operations. One for determining the equivalent education level of each CV’s education-related entry, and one for determining its corresponding field.

EQF level

Each ESCO qualification corresponds to one distinct education level defined by the European Qualifications Framework. Namely,

Level 1-4: Primary and secondary education.
Level 5: Comprehensive, specialized, factual and theoretical knowledge within a field of work or study and an awareness of the boundaries of that knowledge.
Level 6: Advanced knowledge of a field of work or study, involving a critical understanding of theories and principles.
Level 7: Highly specialized knowledge, some of which is at the forefront of knowledge in a field of work or study, as the basis for original thinking and/or research.
Level 8: Knowledge at the most advanced frontier of a field of work or study and at the interface between fields.

After the corpus has been taken tidy format, we perform the steps,

Data Collection and Cleansing: The relevant fields of the scrapped data, specifically the main and alternative titles of each ESCO qualification across the different locales and their equivalent EQF levels, are initially brought to tidy form. Then, after evaluating the corpus for each language, the system augments the corpus for inadequate or entirely missing languages using a translation/crawling approach. The corpus is further improved through labeled data acquired from CVs that used the drop-down box, and through prototypes collected from the definitions of each country’s National Qualifications Framework. With the corpus in hand, wrangling and cleansing is performed to run the classification algorithm. Two tables with the necessary numerical statistics are generated: one with respect to EQF level and another based on each specific ESCO qualification, and all free text qualification titles and organization names provided by the respondents are tokenised and cleaned.
Information retrieval: This process performs a matching of a CV’s education-related input with a particular EQF level. Each free-text token is matched with two vocabularies: one based on EQF level, and another based on specific ESCO qualifications. The process is language-agnostic and in alliance with the previous free-text matching. The suggestion of an EQF level for a given free-text entry is estimated by following a weighting method for the two classification methods. Each qualification entry provided on a CV is matched to a separate EQF level, and the respondent is assigned the highest EQF level achieved across all qualification entries.

Education Field

The ESCO taxonomy specifies one or more fields for each qualification. The fields of education and training defined by ISCED (FoET 2013) are used, and specifically the detailed fields at level 3. A single qualification may belong to one of the 80 distinct level 3 fields, or a combination of them in cases of inter-disciplinary programmes. Similar to the approach used previously, the following steps are performed,

Data Collection and Cleansing: A corpus of the cleansed ESCO qualification data is created with respect to field of education and training across the different locales. After evaluating each language’s corpus, the system once again augments the ones considered to be inadequate. Each augmented corpus is then used to produce numerical statistics. Education-related free text provided by the respondents is finally cleansed and tokenized before moving on with information retrieval.
Information retrieval: The education entries of each CV are mapped to the defined fields of education and training. The language-agnostic process detailed before is utilized, using the vocabularies and numerical statistics retrieved in the previous step. The number of education fields a specific qualification may belong to varies. Up to three fields may be assigned to a single entry based on the degree of confidence of the algorithm’s suggestions.

Collecting statistics

To increase the precision of the analysis, CVs are filtered based on the extent of their completion. Following a heuristic approach, a completion index has been defined as follows,

$\textrm{C.I.} \propto \textrm{Num Key Fields} * \textrm{CV completion},$

where C.I. stands for the completion index and key fields are assumed to be the locale, address, birthdate, gender, nationality, firstname, surname and email. The sampling is performed based on the completion index per locale. Thus, a CV with most of the key fields missing is discarded. In this way, the strata definition is collectively exhaustive and mutually exclusive. A proportionate allocation strategy is followed in each of the strata where the bottom 10% of the CVs is discarded. The resulting sample is 353,518 CVs used for statistical analysis.

In addition, relevant new variables are extracted by existing ones to assist the statistical analysis and the illustration of information. Noteworthy examples are,

Working years, are estimated to be the total years a respondent declared staying in all occupations.
Employment Status, is inferred by the year a user left a job. The respondent is estimated to be employed if no leave date is declared.
Mean and median years staying in a job are calculated by taking into account the years spent in all work experiences of a respondent.
Studied abroad, is estimated by the existence of at least one location of study outside the country of the respondents origin.
Education Level Group, consists of three levels. Namely, Low defined as Levels 1-4 in the EQF, Mid as Levels 5-6 and High as Levels 7-8.
Age Groups, are assumed to be 16-24, 25-49 and 50-64 in alliance with Online job vacancies and skills analysis.

What’s next?

This report can act as the starting point for further analysis towards more useful inferences about the job market in Europe and across the globe. Comparing the data derived from our model to other sources’ labour market datasets could reveal potential biases and drive to more precise conclusions. Additionally, more attentive handling of cases of multilingual text may disclose more patterns and help derive more trends in occupations, skills and qualifications.

Packages

xml2, parallel, rmdformats, text2vec, stringr, httr, cld2, cld3, DT, jsonlite, stopwords, magrittr, scales, ggplot2, colorspace, grid, grImport2, ISOcodes, data.table, fst, dplyr, tidyr, tidytext, ggalluvial, ggrepel, igraph, ggraph, shiny, shinyjs, shinyWidgets, shiny.router.

References

Boettiger, Carl. 2015. “An Introduction to Docker for Reproducible Research.” SIGOPS Oper. Syst. Rev. 49 (1). New York, NY, USA: Association for Computing Machinery: 71–79. doi:10.1145/2723872.2723882.

Cedefop. 2019. “Online job vacancies and skills analysis.” https://www.cedefop.europa.eu/en/publications-and-resources/publications/4172.

Hofert, Marius, and Martin Mächler. 2016. “Parallel and Other Simulations in R Made Easy: An End-to-End Study.” Journal of Statistical Software, Articles 69 (4): 1–44. doi:10.18637/jss.v069.i04.

Loo, Mark P.J. van der. 2014. “The stringdist Package for Approximate String Matching.” The R Journal 6 (1): 111–22. doi:10.32614/RJ-2014-011.

Rajaraman, Anand, and Jeffrey David Ullman. 2011. “Data Mining.” In Mining of Massive Datasets, 1–17. Cambridge University Press. doi:10.1017/CBO9781139058452.002.

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software, Articles 40 (1): 1–29. doi:10.18637/jss.v040.i01.

———. 2014. “Tidy Data.” Journal of Statistical Software, Articles 59 (10): 1–23. doi:10.18637/jss.v059.i10.

Wilkinson, Leland. 2010. “The Grammar of Graphics.” WIREs Computational Statistics 2 (6): 673–77. doi:10.1002/wics.118.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr, Second Edition (Chapman & Hall/CRC the R Series). Routledge. https://www.xarg.org/ref/a/1498716962/.

Table of contents