Occupations Analysis, Trends and Correlations

Abstract

Using naturally occurring labour market data aggregated from over 10M users of the Europass CV editor application, we produce a report with insights on: a) disappearing and newly-emerging occupations, b) changing skill requirements for jobs positions, c) usual career paths, d) skills-to-occupations association in the ESCO model compared to the collected CV data.

Introduction

The Europass CV editor is an online application that allows users to create, store and share their curiculum vitae. It is part of the Europass initiative, which aims at increasing transparency of qualification and mobility of citizens in Europe. This report serves as a continuation of Cedefop’s previous Europass CV Insights Report, which provided descriptive statistics for users of the application between June and September 2019. Unlike the previous analysis, which was conducted on data gathered from an opt-in user survey of around 400K users, this project involves use of the backup Europass database, which includes anonymised data for all users of Europass between Q1 2017 and Q2 2020 for a total of 10M CV’s.

The data processing pipeline developed on the previous phase of the project was employed to acquire two main standardized datasets. The first one encodes the users’ characteristics (eg. gender, birth year, country, latest job), and the other their work experiences (eg. recruitment and termination year). The data cleansing and standardization process is based on the ESCO taxonomy model, as well as custom groups (eg. age groups) in comparison with other indicators of official statistics. The algorithm used to standardize multilingual free-text of work experiences is based on document vectorization, and a modified nearest-neighbours classifier using the ESCO/ISCO hierarchy. See also the published classifier on CRAN, labourR.

After deduplication, we proceed to a weighting analysis by comparing the demographic characteristics of the Europass users with the general population as reported by Eurostat. The weighting process is performed to gain feedback on the biases of the dataset, and weighted results are compared to their unweighted counterparts, as well as sources of official statistics. In addition, we mainly use the countries of the Euro area (EA-19) as a representative average, since they are the group with the least exploding weights. Users aged between 15 and 49 have been considered, while trend analysis was performed for recruitments between 2000 and 2019.

Career paths

Frequency of group change

  • Job transitions reported by Europass can be between the same or different ISCO groups. We perform a series of statistical transformations and aggregations to gain insight on the frequency of ISCO group change. Specifically, we aggregate the “from-to” sequences of subsequent job positions of all users, and measure the frequency of all transitions ISCO groups.
  • The following barplots report on the frequency of transition from a ISCO group to a different one.

ISCO 1

ISCO 2

Browse and filter data

  • When changing jobs, Europass users most commonly start a job that belongs on the same ISCO group as their previous one. Skilled agricultural, forestry and fishery workers and Clerical support workers are the most likely to move to a different ISCO 1 group, while Professionals and Technicians and associate professionals are the least.
  • More specifically, ISCO 2 groups Health professionals, Information and communications technology professionals, and Teaching professionals are particularly consistent in staying in their respective fields.

Most common transitions to different groups

  • Certain patterns of transition from one ISCO group to another are more prevalent than others in our data. By excluding transitions to the same group, we acquire correlations between occupations with respect to users’ career paths.
  • We report a sample of the 20 most common transitions between ISCO 1 groups and the top 30 between ISCO 2 groups in the form of a sankey plot. Note that these are the most common sequences of occupations, and not necessarily the most common occupations.

ISCO 1

ISCO 2

Browse and filter data

  • When starting a job that belongs to a different ISCO 1 group than their previous one, users most commonly become Professionals or Technicians and associate professionals.
  • Professional themselves are most likely to transition into Managers or Technicians and associate associate professionals.
  • Transitions between related ISCO 2 groups (eg. Business and administration professionals to Administrative and commercial managers, or Legal, social and cultural professionals to Teaching professionals) are common.

Work experience by age

  • Age is naturally related to a person’s years of work experience. To quantify this relationship on our sample we have measured each Europass user’s cumulative work experience and their respective age at the time they began working on each work experience they reported.
  • On the subsequent plot, the dashed line is derived from a linear model fit on all data points of cumulative work experience and age in the dataset. Additionally, the circular points shown correspond to the placement of each individual ISCO group based on the respective mean age and mean years of work experience of the users professing it. Note that points are hoverable and the size of each circle corresponds to the sample size.

ISCO 1

ISCO 2

ISCO 3

Browse and filter data

  • Three main clusters of ISCO 3 groups can be noticed based on the relationship between age and years of work experience: a) the main cluster composed mainly of Professionals and Technicians and associate professionals who have an average mean age and work experience, b) a dense cluster composed mainly of occupations such as Waiters and bartenders and Shop salespersons, and c) a more sparse cluster composed of occupations generally professed by older individuals, such as Managing directors and chief executives and Medical doctors, but also Heavy truck and bus drivers and Refuse workers.
  • The first cluster closely follows the main sequence of occupations noted by the dashed line and the majority of occupations belong in it. These occupations gather work experience at an average rate as the age of the people professing them increases.
  • The second cluster includes many entry-level jobs more commonly selected by younger individuals. People who choose these occupations enter the job market early and tend to accumulate work experience faster than the rest.
  • The third cluster displays more variability with respect to work experience. Some of those jobs, such as Medical doctors, mandate late entry in the work force due to years of education required for it. Others, such as those on ISCO 1 group Managers require a lot of experience and specialization. Finally, jobs in the ISCO 1 group Elementary occupations have a harder time acquiring work experience likely due to periods of unemployment and being more occasional.

Deviation of observed and expected work experience

  • The relationship between age and years of work experience is different for each ISCO group. Using the residuals of the regression line we fit on our data, we derive the mean residual of our data from the regression line by ISCO group.
  • We report the 99.9% confidence interval of the ISCO groups with the largest sample size. The mean age is mapped on the colour gradient.

ISCO 1

ISCO 2

ISCO 3

  • Users professing jobs that belong in the ISCO 1 groups Professionals and Technicians and associate professionals have roughly as many years of work experience as expected based on the main sequence.
  • Professions that require a lot of training and education before entering the job market include users that have less work experience than anticipated likely due to late entry on the labour force.
  • Administrative and managerial jobs have more experience than anticipated, likely indicating a requirement of work experience accumulation before users are able to move to these occupations.
  • The observed work experience for entry-level jobs such as Service and sales workers is higher than the expected value, while the opposite is true for Elementary occupations.

Occupation and skills

Skills by occupation and birth year

  • The database of the Europass application did not preserve free-text of skills for privacy reasons. For that reason, this part of the report is an outcome of analysis performed for free-text of skills on anonymised data collected in a survey at Q2 of 2019. Note that data weighting was not employed in this case.
  • As with occupations, regression analysis was performed for unigrams and ESCO skills with respect to the birth year and the latest job of users reporting them. The subsequent time series report the 12 most upward and downward trending among the 100 most relevant unigrams of each ISCO 1 group. Relevance is measured using a custom metric based on tf-idf. The same exercise is performed for the 250 most relevant ESCO skills of each ISCO 1 group.
  • Birth year is displayed on the x-axis, while the term frequency per year, normalized between 0.0 and 1.0 is displayed on the y-axis. Results for the top 4 most commonly included ISCO 1 groups are included.

Professionals

Upward (Unigram)

Downward (Unigram)

Upward (ESCO)

Downward (ESCO)

Browse and filter data

Technicians and associate professionals

Upward (Unigram)

Downward (Unigram)

Upward (ESCO)

Downward (ESCO)

Browse and filter data

Service and sales workers

Upward (Unigram)

Downward (Unigram)

Upward (ESCO)

Downward (ESCO)

Upward (Unigram)

Downward (Unigram)

Upward (ESCO)

Downward (ESCO)

Browse and filter data

  • Keywords like “photoshop”, “python” and “matlab” display a positive trend for ISCO 1 group Professionals with respect to birth year, meaning that they are more commonly reported by younger users compared to older ones, who mention keywords like “oracle”, “application”, and “security”.
  • Managerial ESCO skills (eg. project management, draft corporate emails) are more frequently observed among older users, while ESCO skills related to child care and students (eg. assist children with homework, communicate with youth) are more frequently observed among younger people.

Associations between skills and occupations

  • Without use of the relationships defined by the ESCO model, we have aggregated our data so that they form “baskets” on to which market basket analysis can be applied, with basket “items” being a user’s latest job (mapped to ISCO 3) and each one of the skills matched through our information retrieval algorithm.
  • Specifically, we have applied association rules mining, which defines three specific metrics to quantify the relationships between items on each basket:
    • \(Support(A \Rightarrow B) = T(A \Rightarrow B) = \frac{\left | \{ {(A \cap B)\subseteq T} \} \right |}{\left | T \right |}\)
    • \(Confidence(A \Rightarrow B) = C(A \Rightarrow B) = \frac{T(A \Rightarrow B)}{T(A)}\)
    • \(Lift(A \Rightarrow B) = L(A \Rightarrow B) = \frac{T(A \Rightarrow B)}{T(A) T(B)}\)
  • Setting thresholds \(Support > 0.0001\), \(Confidence > 0.03\) and \(Lift < 20\) we report on the ISCO 3 and ESCO skill associations that display the highest lift.
  • Skill necessity for an ESCO occupation has been extrapolated to the ISCO 3 level, with skills being marked as essential or optional if they are essential or optional for at least one ESCO occupation of the respective ISCO 3 group, or undetermined otherwise.

Service and sales workers

Professionals

Technicians and associate professionals

Elementary occupations

Clerical support workers

Managers

  • The results are compared with the ESCO taxonomy where skills are tagged as essential or optional for the respective occupation. Note the new meaningful associations that are not encoded in the ESCO graph data model emerge by sorting association rules by lift.
  • Examples of new ISCO 3 and ESCO skill pairs include Cooks - instruct kitchen personal, Nursing and midwifery professionals - communicate with nursing staff, and Domestic, hotel and office cleaners and helpers - conduct cleaning tasks.

Methodology

Information retrieval

  • The data processing pipeline developed for the Europass CV Insights Report has been used for information retrieval. The same general methodology was followed, with slight modifications in order to support the large volume of data available in this analysis.
  • Specifically, CV’s are initially pulled from the database backup and a deduplication process is applied to them based on the email hash attached to each entry. This is to ensure that the latest CV entry of each unique user is being used.
  • Subsequently, data are split into around 40 batches of roughly 300,000 CV’s, and each batch is processed independently until the data aggregation step. This has allowed us to utilize the same codebase that was optimized for speed of processing despite the memory restrictions introduced by the increase of data volume.
  • On instances where live data reading was not required, the binary file output has been altered to the ‘qs’ format, implementing Facebook’s ‘zstd’ library that allows fast read and write times while providing compression comparable to that of the ‘rds’ format.
  • For details regarding the information retrieval methodology itself, please refer to Europass CV Insights Report: Methodology.

Dataset characteristics

  • The data processing pipeline has resulted in two main datasets, one describing the population demographics and one describing the occupations included on the CVs of this population.
  • Before moving forward with the statistical analysis, it is important to understand and document the characteristics of this dataset in terms of the demographics described by it.
  • The charts below display the distribution of the age, country of residence, and gender as stated by users at the time of their CV creation.

Age

Country of residence

Gender

  • It can be observed that the age distribution of users is highly skewed towards younger ages. The mean age of users is 29. Among the 47% of users that stated their year of birth, 72% is under 32 years old.
  • 95% of users stated their country of residence. Among those, 85% live in a country among the 27 member states of the European Union (EU-27). Users from Italy and Portugal make up 42% of the user base of Europass.
  • The two genders are relatively well-balanced, with male users being slightly more than female ones among the 42% of users that stated their gender.
  • Given the above, we have opted to focus our analysis on users from countries in the European Union with ages between 15 and 49, for a total of 4.0M users.
  • As detailed in the weighting section, the European average chosen for the purpose of reporting is the EA-19, pertaining to countries in the Euro area.

Weighting

  • Age and country of residence are highly correlated with the detection and quantification of trends and correlations related to occupations, which is the goal of the analysis. Because of the imbalances documented, data weighting was deemed necessary in order to make the data more representative of the general population.
  • Weighting is a statistical technique in which data are adjusted to be brought more in line with the population being studied. While typically employed by surveys, weighting processes can also be applied to data pulled from databases as an attempt to correct data gaps. It is an important step of an analysis, as it ensures the target population is fairly and equally represented in the results.
  • The weighting procedure employed is iterative proportional fitting (IPFP), which is an algorithm used in many different fields such as economics and social sciences. Through IPFP, the given distribution is updated with respect to given target marginal distributions. The R implementation of IPFP in the package mipfp was utilized.
  • More specifically, we have focused on adjusting age so that it is less skewed towards younger population, and country of residence, so that we can derive a less biased European average. We have opted to use the official demographic statistics of Europe as published by Eurostat for our target marginal distributions.
  • A snapshot of the observed (Europass database) and target (Eurostat) distributions can be seen below.

Age

Country

  • The weighting procedure is applied initially for each country individually with respect to age, with the goal of adjusting the age distribution of each country. Subsequently, a separate weighting scheme has been derived for each potential European average. Specifically, we have adjusted the distribution of countries and ages in order to derive EA-19 (Euro area), EU-27 and EU-28 (including the UK) averages.
  • The charts below document not only the derived weighting scheme applied to the dataset, but also the inherent imbalances present in it, as a higher weight implies the underrepresentation of a particular group, while a lower one implies overrepresentation.
  • To avoid adding bias through weighting, we have restricted the derived weights to be between 0.35 and 3, as indicated by the cyan and white lines respectively on the charts.

EA-19

EU-27

EU-28

Countries

  • It can be observed that some countries of the 27 member states of the European Union are very underrepresented compared to others. Specifically, users from Denmark, Poland and France are few, while users from Sweden, the Netherlands, Germany and Ireland are also limited, especially for older age groups. Users from Southern Europe are more well represented.
  • We have elected to use countries from the EA-19 as our default European average in order to limit the number of especially underrepresented countries participating in the mean.
  • Following the weighting procedure, weights are applied to the occupations dataset. More concretely, the distribution of occupations with respect to the variables measured is adjusted by multiplying the frequency of each disaggregation with its respective weight based on the country and age group observed.
  • Please note that as weighting has been limited to be between 0.35 and 3, biases with respect to country and age have not been eliminated entirely, so it is important to remember that metrics applying to the EA-19 are still subject to biases inherent to the dataset.

Regression analysis

  • In order to identify and measure trends, we first need to define the potential time series that emerge from the data. To do this, we utilize temporal variables in the dataset along with the occupation-related variables, such as job position on the ISCO taxonomy.
  • The graphs bellow show the distribution of reported recruitments and terminations in the aggregated occupations dataset. The number for each year is inferred by the starting date of each reported work experience.

Recruitment Year

Termination Year

  • We have elected to use recruitment year to report on what percentage of total recruitments each specific ISCO occupation represents each year. This measurement is based on the labour force history of each Europass user, including past and present job positions mentioned on each CV. The years between 2000 and 2019 were selected for a total of 10.0M work experiences.
  • As an example, the graph below displays the time series of the top 5 ISCO 2 and ISCO 3 occupations with respect to recruitment year. Year is shown on the x-axis, while the percentage of total recruitments each occupation represents for that year is on the y-axis. Time series for both weighted and unweighted data are shown.

ISCO 3 (Weighted)

ISCO 3 (Raw)

ISCO 2 (Weighted)

ISCO 2 (Raw)

  • Qualitatively, it can be observed that occupations such as Waiters and Bartenders display an overall upwards trend, while others like Administrative and specialized secretaries are trending downwards. This is generally consistent between both weighted and unweighted data, as the weighting procedure has mostly affected the relative percentage between occupations, with only slight adjustments to the distributions’ shape.
  • In order to quantify trends, we have applied regression analysis to our data by fitting generalized linear models (GLM) to each breakdown of interest. Specifically, we have utilized the R package glm which estimates the coefficients \(\beta\) through maximum likelihood estimation (MLE).
  • GLMs can be considered a generalization of ordinary linear regression that does not make the assumption that each observation necessarily comes from a normal distribution, \(y_{i} \sim \mathit{N}\left (\mu_{i}, \sigma^{2} \right )\). Equivalently, this means that the error distribution of the response variable does not have to be the normal distribution.
  • A binomial distribution was assumed for our data, with:
    • Logit serving as the link function, \(estimate \times year + intercept = \ln \left ( \frac{y }{n - y} \right )\)
    • Mean function, \(\frac{y}{n} = \frac{1}{1 + e^{-(estimate \times year + intercept)}}\)
  • The process was repeated for both weighted and unweighted data. The results presented for the context of this report refer to trend estimations made for the weighted data, a sample of which is displayed below for EA-19.
ISCO 1 Country Estimate Intercept p-value Deviance Degrees of freedom
Armed forces occupations EA19 -0.0645050 124.83591 0 314.8306 18
Clerical support workers EA19 -0.0101363 17.46122 0 791.1704 18
Craft and related trades workers EA19 -0.0283781 54.40370 0 4332.7565 18
Elementary occupations EA19 0.0114642 -26.22486 0 1213.4496 18
Managers EA19 -0.0071675 12.26817 0 516.7734 18
Plant and machine operators and assemblers EA19 -0.0126231 22.21469 0 3753.0582 18
Professionals EA19 0.0108941 -22.66026 0 7412.2678 18
Service and sales workers EA19 0.0155061 -32.89404 0 1382.9993 18
Skilled agricultural, forestry and fishery workers EA19 0.0040978 -12.94743 0 242.5014 18
Technicians and associate professionals EA19 -0.0056631 10.04493 0 261.7392 18
The estimate statistic gives the percentage of the odds ratio increase for one unit of time (1 year), \(odds \propto \mathit{e}^{estimate \times year}\) where \(odds\) is defined as the ratio of the number of events that produce that outcome to the number that do not.

References

2016 - 2020