Data Wrangle Collected CVs

JSON CVs are parsed. Data wrangling scripts generate a set of meaningful data sets that can be used for statistical analysis, visualizations and machine learning applications.

Deduplication

The process identifies a large number of JSON files that refer to the same CV. The latest update of a CV for each e-mail address is kept. - Link to job

Demographics

The process collects data related to demographics of the subjects like for example residence, postal code, birthdate and gender. - Link to job

Work experience

The process collects data related to work experience of the subjects like for example job title, date of work and employer name. - Link to job

Qualifications

The process collects data related to qualifications of the subjects like for example organisation, country of study and title. - Link to job

Skills

The process collects data related to skills of the subjects like for example linguistic, organisational and computer. - Link to job

Miscellaneous

The process collects data related on miscellaneous topics, related to computer and linguistic skills. - Link to job

Matching with ESCO model

- Occupations

ESCO/ISCO data tidying

The process performs all the wrangling and cleansing of the necessary data-sets to run the occupation identification algorithm. The multilingual ESCO classification model is used together with the frequency of the ESCO URI provided by the subjects to generate the necessary numerical statistics. - Link to job

Occupations Retrieval

An optimized language agnostic matching method is used to identify free text words in the ESCO vocabulary. A text mining algorithm retrieves information about the association between free text and ESCO occupationsi using precalculated numerical statistics. - Link to job

Headline Occupations Retrieval

The headline field in the JSCON schema encodes information on the job position, the job applied for and/or the prefered job. The process follows a similar matching method with the free text of occupations for the related fields. - Link to job

- Skills: Data Collection and Cleansing

ESCO skills

This function loads the multilingual skills data set cleans free text and keeps only relevant information, as it is provided by ESCO. - Link to job

Occupations-Skills Model

This process calls the ESCO API to access the graph data model to retrieve the occupations to skills classification. -Link to job

EPAS skills

This function keeps free text descriptions and performs data/text cleansing. -Link to job

ESCO/ISCO data tidying

The process performs all the wrangling and cleansing of the necessary data-sets to run the skills identification algorithm. The multilingual ESCO classification model is used to generate the necessary numerical statistics. - Link to job

- Skills: Information retrieval

Linguistic Skills

This process retrieves information concerning the linguistic skills of the subjects. - Link to job

Free text

An optimized language agnostic matching method is used to identify free text words in the ESCO vocabulary. A text mining algorithm retrieves information about the association between free text and ESCO skills using precalculated numerical statistics. - Link to job

Occupations to Skills

This process uses the occupations-skills ESCO model to estimate the skills of the subjects based on the predicted occupations. -Link to job

Committee classifier

This process retrieves the final estimated skills by using a voting method using the two classification methods. - Link to job

- Qualifications: Data Collection and Cleansing

Corpus creation

This process uses certain initial data to create a multilingual corpus for each EQF level using ESCO data previously scrapped across the different locales. - Link to job

Corpus augmentation

This process improves the corpus for each EQF level using a translation/crawling approach and further augments it with prototypes informed by NQF, as well as labeled examples. - Link to job

Tidying EQF corpus

The process performs all the wrangling and cleansing of the necessary data-sets to run the EQF classification algorithm. The multilingual ESCO classification of qualifications is used to generate the necessary numerical statistics. - Link to job

CV free text and features

This process identifies numeric and discrete features that may assist the matching process and preprocesses the free text related to qualifications as it is provided by the subjects. -Link to job

- Qualifications: Information retrieval

Qualifications Retrieval

An optimized language agnostic matching method is used using the ESCO qualifications vocabulary. Using the precalculated numerical statistics, an information retrieval strategy is followed to estimate which EQF levels as well as which invidivual ESCO qualifications are most likely for each free text entry. - Link to job

Committee classifier

This process retrieves the final estimated EQF levels by utilizing previously selected numeric and discrete features to decide on a final level from the candidate levels. - Link to job

Institutions

This process retrieves the top organisations of each country and matches each organisation variable’s free-text with them. - Link to job

- Education Fields: Data Collection and Cleansing

Corpus creation

This process uses certain initial data to create a multilingual corpus for each field of education using ESCO data previously scrapped across the different locales. - Link to job

Corpus augmentation

This process improves the corpus for each field of education using a translation/crawling approach. - Link to job

Tidying fields corpus

The process performs all the wrangling and cleansing of the necessary data-sets to run the EQF classification algorithm. The multilingual ESCO classification of qualifications is used to generate the necessary numerical statistics. - Link to job

CV free text

The process preprocesses the free text related to qualifications as it is provided by the subjects. - Link to job

- Education Fields: Information retrieval

Education Fields Retrieval

An optimized language agnostic matching method is used using the ESCO qualifications vocabulary. Using the precalculated numerical statistics, an information retrieval strategy is followed to estimate which fields of education are most likely for each free text entry. - Link to job

Committee classifier

This process uses the field suggestions of the previous step to arrive on a final field prediction for each CV free text. - Link to job

Collecting statistics

Stratified Sampling

The percentage of completion is measured and a quality score is defined. The 10% of worst “performing” per strata CV is discarded to improve precision. - Link to job

Demographic Stats

This process collects inferred features/attributes of the subjects and calculates simple statistics. - Link to job

Occupation Stats

This process collects inferred features/attributes of the subjects’ work experiences and calculates simple statistics. - Link to job

Skills Stats

This process collects inferred features/attributes of the subjects’ skills and calculates simple statistics. - Link to job

Education Stats

This process collects inferred features/attributes of the subjects’ qualifications and calculates simple statistics. - Link to job

Report data

This process precalculates statistics for the survey report. - Link to job

Dataset Collection

- Raw

Previous analysis has resulted in measurements of numeric and categorical variables and mappings of free texts to standardized taxonomies for each CV. This data is now aggregated to become anonymous and more easily shareable.

Demographics

This process creates aggregations based on the previously stratified data. It standardizes variable names of demographic data and counts number of respondents for each combination of values. - Link to job

Occupations

This process creates aggregations based on the previously stratified data. It standardizes variable names of occupation data and counts number of responses for each combination of values. - Link to job

Skills

This process creates aggregations based on the previously stratified data. It standardizes variable names of skills data and counts number of responses for each combination of values. - Link to job

Qualifications

This process creates aggregations based on the previously stratified data. It standardizes variable names of qualification and education field data and counts number of responses for each combination of values. - Link to job

Occupations progression

This process creates aggregations regarding progression from one job to another based on the previously stratified data for occupations. It includes statistics based on the responses given to the work experience-related fields on the Europass survey. - Link to job

Education progression

This process creates aggregations regarding progression from one education field to another based on the previously stratified data for qualifications. It includes statistics based on the responses given to the education-related fields on the Europass survey. - Link to job

- Standardized

Data is further aggregated with values of each variable mapped to standardized levels. These datasets represent the final transformation step before use with the interactive visualization tool written in Shiny.

Demographics

- Link to job

Occupations

- Link to job

Skills

- Link to job

Education Fields

- Link to job

Career Path

- Link to job

Academic Path

- Link to job

- Free text n-grams

Datasets with term frequencies of the top key words and phrases of each CV section’s free text are derived. This data can be used to identify trends and correlation between terms.

Occupations terms

This process calculates unigrams and bigrams for the occupations free-text fields with respect to a given subset of features. - Link to job

Skills terms

This process calculates unigrams and bigrams for the skills free-text fields with respect to a given subset of features. - Link to job

Skills bigrams for top 70 ISCO 3

This process calculates bigrams from the skills free-text data based on the category of the skill or the occupation of the respondent. - Link to job

Qualification terms

This process calculates unigrams and bigrams for the qualification free-text fields with respect to a given subset of features. - Link to job

- Preparing free text data for Shiny

Occupations

This process creates standardized unigram and bigram frequency datasets for occupations-related text. A separate binary file is created for each locale to accommodate use with the Shiny application. - Link to job

Skills

This process creates standardized unigram and bigram frequency datasets for skills-related text. A separate binary file is created for each locale to accommodate use with the Shiny application. - Link to job

Skills for top 70 ISCO 3

This process creates standardized bigram frequency datasets for skills-related text with respect to the top 70 ISCO 3 occupations. A separate binary file is created for each locale to accommodate use with the Shiny application. - Link to job

Skill terms co-occurrence

This process creates standardized datasets representing bigrams in skills-related text in from / to / count format. A separate binary file is created for each locale to accommodate use with the Shiny application. - Link to job

Qualifications

This process creates standardized unigram and bigram frequency datasets for qualifications-related text. A separate binary file is created for each locale to accommodate use with the Shiny application. - Link to job

- Preparing transaction data for Shiny

Demographics for transactions

This process creates aggregations based on the previously stratified data. It includes statistics for the Europass survey respondents to be used with the calculated transactions. - Link to job

Transactions

This process translates the data inside the 3 ESCO Pillars into transactions for market basket analysis. - Link to job