Occupations Analysis, Trends and Correlations
Occupations Analysis, Trends and Correlations
Abstract
Using naturally occurring labour market data aggregated from over 10M users of the Europass CV editor application, we produce a report with insights on: a) disappearing and newly-emerging occupations, b) changing skill requirements for jobs positions, c) usual career paths, d) skills-to-occupations association in the ESCO model compared to the collected CV data.
Introduction
The Europass CV editor is an online application that allows users to create, store and share their curiculum vitae. It is part of the Europass initiative, which aims at increasing transparency of qualification and mobility of citizens in Europe. This report serves as a continuation of Cedefop’s previous Europass CV Insights Report, which provided descriptive statistics for users of the application between June and September 2019. Unlike the previous analysis, which was conducted on data gathered from an opt-in user survey of around 400K users, this project involves use of the backup Europass database, which includes anonymised data for all users of Europass between Q1 2017 and Q2 2020 for a total of 10M CV’s.
The data processing pipeline developed on the previous phase of the project was employed to acquire two main standardized datasets. The first one encodes the users’ characteristics (eg. gender, birth year, country, latest job), and the other their work experiences (eg. recruitment and termination year). The data cleansing and standardization process is based on the ESCO taxonomy model, as well as custom groups (eg. age groups) in comparison with other indicators of official statistics. The algorithm used to standardize multilingual free-text of work experiences is based on document vectorization, and a modified nearest-neighbours classifier using the ESCO/ISCO hierarchy. See also the published classifier on CRAN, labourR.
After deduplication, we proceed to a weighting analysis by comparing the demographic characteristics of the Europass users with the general population as reported by Eurostat. The weighting process is performed to gain feedback on the biases of the dataset, and weighted results are compared to their unweighted counterparts, as well as sources of official statistics. In addition, we mainly use the countries of the Euro area (EA-19) as a representative average, since they are the group with the least exploding weights. Users aged between 15 and 49 have been considered, while trend analysis was performed for recruitments between 2000 and 2019.
Occupation trends
Distribution of occupations
- Users creating their CV via the Europass application report their previous and current work experiences, along with details such as the recruitment year. As detailed in the methodology section, the target population of the analysis has been determined to be users from countries in the European Union, with a European average being determined by users in the EA-19 specifically. The age range of the users considered is between 15 and 49.
- The graph that follows reports on the unweighted distribution of occupations for recruitments reported between 2000 and 2019 for users included in the analysis. Results are presented with respect to ISCO level 1, 2 and 3.
ISCO 1
ISCO 2
ISCO 3
- Occupations that are part of the ISCO 1 groups Professionals and Technicians and associate professionals are the most frequently reported ones. More specifically, Business and administration, Teaching, and Legal, social and cultural professionals and associated professionals are the most common subcategories on the ISCO 2.
- Despite this, the more specific professions reported more frequently are those related to service and sales, such as Waiters and Bartenders and Shop salespersons in ISCO 3. This is expected, as the Europass userbase is relatively young and thus includes many people who have recently been students.
Trends in job sectors for EA19
- The regression analysis we performed has given us a measure of how job groups of the ISCO taxonomy are changing in time.
- We report on the sub-major and minor groups of the ISCO taxonomy (hereafter referred to as ISCO 2 and ISCO 3 respectively) that display the biggest odds ratio increase or decrease in recruitments for one unit of time on the interval between 2000 and 2019.
- The time series displayed show the EA19 average of the percentage of recruitments per year each ISCO group represents.
Positive (ISCO 2)
Negative (ISCO 2)
Neutral (ISCO 2)
Positive (ISCO 3)
Negative (ISCO 3)
Neutral (ISCO 3)
- The yearly distribution of recruitments per ISCO group changes throughout the time period studied, ie. years between 2000 and 2019. For countries in EA-19, the presence of some occupations (eg. ISCO 2 groups Food preparation assistants, Personal care workers, Health professionals, and Personal service workers) increases, while the presence of others decreases (eg. ISCO 2 groups Commissioned armed forces officers, Building and related trades workers, and General and keyboard clerks).
- Recruitments for entry-level jobs such as Personal service workers and Food preparation assistants show a tendency to be reported on more recent years. One likely explanation for this is because as a user’s career advances and they move to different jobs, they are less likely to include their early work experiences on their CV.
- The measured trends may indicate actual trends in the labour market or differences in the patterns of behaviour of individuals working across the different ISCO groups when it comes to CV creation.
Highest degree of deviation in trend per country
- As part of our regression analysis, we have derived estimates for how ISCO groups change in time across every country in the EA19.
- For models with \(p < 0.05\) and sample \(N > 3000\) we derive the odds ratio change \((e^{estimate} - 1) \times 100\%\).
- Subsequently, we report on the ISCO 2 groups per country that display the highest absolute deviation in odds ratio change per year from the EA19 average.
IT
PT
ES
RO
EL
HR
DE
FR
HU
SI
- Countries that rely heavily on tourism, such as Greece, Croatia, Portugal and Spain, display more positive trends in professions related to tourism, such as Personal service workers and Food preparation assistants compare to the EA-19 average. Professions related to the industry, such as Assemblers and Manufacturing labourers have increased presence in Germany.
- On a period of relative stability, such as the one studied on this analysis, it is anticipated that the core industries of each country will exhibit growth as opposed to decline.
Net hire ratio in time
- With each CV including the end date of each work experience as well as the start date, it is possible to measure the relationship between job recruitments and terminations per year. This relationship is recorded in the so-called net hire ratio, which is defined as \(Net \ Hire \ Ratio = \frac{Total\ New\ Recruitments}{Total \ Terminations}\).
- We do not intend to measure the actual net hire ratio of the market, as this would require further calibration of the dataset, but we refer to the equivalent measurement in the context of the Europass users.
- We focus our reporting on the net hire ratio with respect to the different countries and ISCO 1 groups. More specifically, changes in the net hire ratio between 2000 and 2019 are displayed for all ISCO 1 groups, and the top 12 countries with the most recruitments and terminations reported.
- The period of data collection, starting in 2016, is denoted by the dashed line, while the EA-19 average is included for comparison.
ISCO 1 (EA19)
Country
- For the period before the data collection begins (2016), more recruitments than terminations are generally reported by Europass users, thus bringing the net hire ratio’s value over 1.0. This is anticipated, as the period studied is generally characterized by relative stability and thus growth for most countries in Europe.
- It can be observed that the net hire ratio tends to be higher for years further away from the beginning of the data collection. This effect may partially be related to the nature of the dataset, which tends to include people searching for jobs and who are thus more likely to report their more recently completed work experiences.
- Starting with the years after the data collection, net hire ratio starts to dip bellow 1.0, which may once again be explained by the fact that unemployed people currently searching for a job are more likely to use the application than employed ones who have recently started a new job.
- Comparing each distinct line with the EA19 average, phenomena more local to specific countries may be observable in the countries graph. Of note is the more sudden decline of the net hire ratio in countries in the south, such as Greece, Spain and Portugal between the late 2000’s and early 2010’s, which may be a reflection of the financial crisis of 2007-2008.
Mean net hire ratio of ISCO 2 groups per country
- It can be observed that the net hire ratio differs across the different ISCO groups and countries. Focusing on ISCO 2 groups, we report on the mean net hire ratio with respect to country.
- In the graphs bellow, the top 10 ISCO 2 groups displaying the highest and lowest mean net hire ratio for the period between 2000 and 2016 are displayed for the 12 countries with the most recruitments and terminations in the dataset. The EA-19 average has also been included.
EA19
IT
PT
ES
RO
EL
HR
DE
FR
HU
SI
MT
AT
- Legal, social and cultural professionals, Health professionals and Teaching professionals are the ISCO 2 groups displaying consistent growth across most European countries for the period studied. Generally, professions requiring more specialization or more investment in education are growing.
- Building and related trades workers, Assemblers, and Electrical and electronic trades workers display a decline in growth, meaning that more terminations than new recruitments are reported for the period. Generally, most professions that display a decline in growth seem to be related to manual labour or work that is increasingly being automated.
Career paths
Frequency of group change
- Job transitions reported by Europass can be between the same or different ISCO groups. We perform a series of statistical transformations and aggregations to gain insight on the frequency of ISCO group change. Specifically, we aggregate the “from-to” sequences of subsequent job positions of all users, and measure the frequency of all transitions ISCO groups.
- The following barplots report on the frequency of transition from a ISCO group to a different one.
ISCO 1
ISCO 2
- When changing jobs, Europass users most commonly start a job that belongs on the same ISCO group as their previous one. Skilled agricultural, forestry and fishery workers and Clerical support workers are the most likely to move to a different ISCO 1 group, while Professionals and Technicians and associate professionals are the least.
- More specifically, ISCO 2 groups Health professionals, Information and communications technology professionals, and Teaching professionals are particularly consistent in staying in their respective fields.
Most common transitions to different groups
- Certain patterns of transition from one ISCO group to another are more prevalent than others in our data. By excluding transitions to the same group, we acquire correlations between occupations with respect to users’ career paths.
- We report a sample of the 20 most common transitions between ISCO 1 groups and the top 30 between ISCO 2 groups in the form of a sankey plot. Note that these are the most common sequences of occupations, and not necessarily the most common occupations.
ISCO 1
ISCO 2
- When starting a job that belongs to a different ISCO 1 group than their previous one, users most commonly become Professionals or Technicians and associate professionals.
- Professional themselves are most likely to transition into Managers or Technicians and associate associate professionals.
- Transitions between related ISCO 2 groups (eg. Business and administration professionals to Administrative and commercial managers, or Legal, social and cultural professionals to Teaching professionals) are common.
Work experience by age
- Age is naturally related to a person’s years of work experience. To quantify this relationship on our sample we have measured each Europass user’s cumulative work experience and their respective age at the time they began working on each work experience they reported.
- On the subsequent plot, the dashed line is derived from a linear model fit on all data points of cumulative work experience and age in the dataset. Additionally, the circular points shown correspond to the placement of each individual ISCO group based on the respective mean age and mean years of work experience of the users professing it. Note that points are hoverable and the size of each circle corresponds to the sample size.
ISCO 1
ISCO 2
ISCO 3
- Three main clusters of ISCO 3 groups can be noticed based on the relationship between age and years of work experience: a) the main cluster composed mainly of Professionals and Technicians and associate professionals who have an average mean age and work experience, b) a dense cluster composed mainly of occupations such as Waiters and bartenders and Shop salespersons, and c) a more sparse cluster composed of occupations generally professed by older individuals, such as Managing directors and chief executives and Medical doctors, but also Heavy truck and bus drivers and Refuse workers.
- The first cluster closely follows the main sequence of occupations noted by the dashed line and the majority of occupations belong in it. These occupations gather work experience at an average rate as the age of the people professing them increases.
- The second cluster includes many entry-level jobs more commonly selected by younger individuals. People who choose these occupations enter the job market early and tend to accumulate work experience faster than the rest.
- The third cluster displays more variability with respect to work experience. Some of those jobs, such as Medical doctors, mandate late entry in the work force due to years of education required for it. Others, such as those on ISCO 1 group Managers require a lot of experience and specialization. Finally, jobs in the ISCO 1 group Elementary occupations have a harder time acquiring work experience likely due to periods of unemployment and being more occasional.
Deviation of observed and expected work experience
- The relationship between age and years of work experience is different for each ISCO group. Using the residuals of the regression line we fit on our data, we derive the mean residual of our data from the regression line by ISCO group.
- We report the 99.9% confidence interval of the ISCO groups with the largest sample size. The mean age is mapped on the colour gradient.
ISCO 1
ISCO 2
ISCO 3
- Users professing jobs that belong in the ISCO 1 groups Professionals and Technicians and associate professionals have roughly as many years of work experience as expected based on the main sequence.
- Professions that require a lot of training and education before entering the job market include users that have less work experience than anticipated likely due to late entry on the labour force.
- Administrative and managerial jobs have more experience than anticipated, likely indicating a requirement of work experience accumulation before users are able to move to these occupations.
- The observed work experience for entry-level jobs such as Service and sales workers is higher than the expected value, while the opposite is true for Elementary occupations.
Occupation and skills
Skills by occupation and birth year
- The database of the Europass application did not preserve free-text of skills for privacy reasons. For that reason, this part of the report is an outcome of analysis performed for free-text of skills on anonymised data collected in a survey at Q2 of 2019. Note that data weighting was not employed in this case.
- As with occupations, regression analysis was performed for unigrams and ESCO skills with respect to the birth year and the latest job of users reporting them. The subsequent time series report the 12 most upward and downward trending among the 100 most relevant unigrams of each ISCO 1 group. Relevance is measured using a custom metric based on tf-idf. The same exercise is performed for the 250 most relevant ESCO skills of each ISCO 1 group.
- Birth year is displayed on the x-axis, while the term frequency per year, normalized between 0.0 and 1.0 is displayed on the y-axis. Results for the top 4 most commonly included ISCO 1 groups are included.
Professionals
Upward (Unigram)
Downward (Unigram)
Upward (ESCO)
Downward (ESCO)
Technicians and associate professionals
Upward (Unigram)
Downward (Unigram)
Upward (ESCO)
Downward (ESCO)
Service and sales workers
Upward (Unigram)
Downward (Unigram)
Upward (ESCO)
Downward (ESCO)
Managers
Upward (Unigram)
Downward (Unigram)
Upward (ESCO)
Downward (ESCO)
- Keywords like “photoshop”, “python” and “matlab” display a positive trend for ISCO 1 group Professionals with respect to birth year, meaning that they are more commonly reported by younger users compared to older ones, who mention keywords like “oracle”, “application”, and “security”.
- Managerial ESCO skills (eg. project management, draft corporate emails) are more frequently observed among older users, while ESCO skills related to child care and students (eg. assist children with homework, communicate with youth) are more frequently observed among younger people.
Associations between skills and occupations
- Without use of the relationships defined by the ESCO model, we have aggregated our data so that they form “baskets” on to which market basket analysis can be applied, with basket “items” being a user’s latest job (mapped to ISCO 3) and each one of the skills matched through our information retrieval algorithm.
- Specifically, we have applied association rules mining, which defines three specific metrics to quantify the relationships between items on each basket:
- \(Support(A \Rightarrow B) = T(A \Rightarrow B) = \frac{\left | \{ {(A \cap B)\subseteq T} \} \right |}{\left | T \right |}\)
- \(Confidence(A \Rightarrow B) = C(A \Rightarrow B) = \frac{T(A \Rightarrow B)}{T(A)}\)
- \(Lift(A \Rightarrow B) = L(A \Rightarrow B) = \frac{T(A \Rightarrow B)}{T(A) T(B)}\)
- Setting thresholds \(Support > 0.0001\), \(Confidence > 0.03\) and \(Lift < 20\) we report on the ISCO 3 and ESCO skill associations that display the highest lift.
- Skill necessity for an ESCO occupation has been extrapolated to the ISCO 3 level, with skills being marked as essential or optional if they are essential or optional for at least one ESCO occupation of the respective ISCO 3 group, or undetermined otherwise.
Service and sales workers
Professionals
Technicians and associate professionals
Elementary occupations
Clerical support workers
Managers
- The results are compared with the ESCO taxonomy where skills are tagged as essential or optional for the respective occupation. Note the new meaningful associations that are not encoded in the ESCO graph data model emerge by sorting association rules by lift.
- Examples of new ISCO 3 and ESCO skill pairs include Cooks - instruct kitchen personal, Nursing and midwifery professionals - communicate with nursing staff, and Domestic, hotel and office cleaners and helpers - conduct cleaning tasks.
Methodology
Information retrieval
- The data processing pipeline developed for the Europass CV Insights Report has been used for information retrieval. The same general methodology was followed, with slight modifications in order to support the large volume of data available in this analysis.
- Specifically, CV’s are initially pulled from the database backup and a deduplication process is applied to them based on the email hash attached to each entry. This is to ensure that the latest CV entry of each unique user is being used.
- Subsequently, data are split into around 40 batches of roughly 300,000 CV’s, and each batch is processed independently until the data aggregation step. This has allowed us to utilize the same codebase that was optimized for speed of processing despite the memory restrictions introduced by the increase of data volume.
- On instances where live data reading was not required, the binary file output has been altered to the ‘qs’ format, implementing Facebook’s ‘zstd’ library that allows fast read and write times while providing compression comparable to that of the ‘rds’ format.
- For details regarding the information retrieval methodology itself, please refer to Europass CV Insights Report: Methodology.
Dataset characteristics
- The data processing pipeline has resulted in two main datasets, one describing the population demographics and one describing the occupations included on the CVs of this population.
- Before moving forward with the statistical analysis, it is important to understand and document the characteristics of this dataset in terms of the demographics described by it.
- The charts below display the distribution of the age, country of residence, and gender as stated by users at the time of their CV creation.
Age
Country of residence
Gender
- It can be observed that the age distribution of users is highly skewed towards younger ages. The mean age of users is 29. Among the 47% of users that stated their year of birth, 72% is under 32 years old.
- 95% of users stated their country of residence. Among those, 85% live in a country among the 27 member states of the European Union (EU-27). Users from Italy and Portugal make up 42% of the user base of Europass.
- The two genders are relatively well-balanced, with male users being slightly more than female ones among the 42% of users that stated their gender.
- Given the above, we have opted to focus our analysis on users from countries in the European Union with ages between 15 and 49, for a total of 4.0M users.
- As detailed in the weighting section, the European average chosen for the purpose of reporting is the EA-19, pertaining to countries in the Euro area.
Weighting
- Age and country of residence are highly correlated with the detection and quantification of trends and correlations related to occupations, which is the goal of the analysis. Because of the imbalances documented, data weighting was deemed necessary in order to make the data more representative of the general population.
- Weighting is a statistical technique in which data are adjusted to be brought more in line with the population being studied. While typically employed by surveys, weighting processes can also be applied to data pulled from databases as an attempt to correct data gaps. It is an important step of an analysis, as it ensures the target population is fairly and equally represented in the results.
- The weighting procedure employed is iterative proportional fitting (IPFP), which is an algorithm used in many different fields such as economics and social sciences. Through IPFP, the given distribution is updated with respect to given target marginal distributions. The R implementation of IPFP in the package
mipfp
was utilized. - More specifically, we have focused on adjusting age so that it is less skewed towards younger population, and country of residence, so that we can derive a less biased European average. We have opted to use the official demographic statistics of Europe as published by Eurostat for our target marginal distributions.
- A snapshot of the observed (Europass database) and target (Eurostat) distributions can be seen below.
Age
Country
- The weighting procedure is applied initially for each country individually with respect to age, with the goal of adjusting the age distribution of each country. Subsequently, a separate weighting scheme has been derived for each potential European average. Specifically, we have adjusted the distribution of countries and ages in order to derive EA-19 (Euro area), EU-27 and EU-28 (including the UK) averages.
- The charts below document not only the derived weighting scheme applied to the dataset, but also the inherent imbalances present in it, as a higher weight implies the underrepresentation of a particular group, while a lower one implies overrepresentation.
- To avoid adding bias through weighting, we have restricted the derived weights to be between 0.35 and 3, as indicated by the cyan and white lines respectively on the charts.
EA-19
EU-27
EU-28
Countries
- It can be observed that some countries of the 27 member states of the European Union are very underrepresented compared to others. Specifically, users from Denmark, Poland and France are few, while users from Sweden, the Netherlands, Germany and Ireland are also limited, especially for older age groups. Users from Southern Europe are more well represented.
- We have elected to use countries from the EA-19 as our default European average in order to limit the number of especially underrepresented countries participating in the mean.
- Following the weighting procedure, weights are applied to the occupations dataset. More concretely, the distribution of occupations with respect to the variables measured is adjusted by multiplying the frequency of each disaggregation with its respective weight based on the country and age group observed.
- Please note that as weighting has been limited to be between 0.35 and 3, biases with respect to country and age have not been eliminated entirely, so it is important to remember that metrics applying to the EA-19 are still subject to biases inherent to the dataset.
Regression analysis
- In order to identify and measure trends, we first need to define the potential time series that emerge from the data. To do this, we utilize temporal variables in the dataset along with the occupation-related variables, such as job position on the ISCO taxonomy.
- The graphs bellow show the distribution of reported recruitments and terminations in the aggregated occupations dataset. The number for each year is inferred by the starting date of each reported work experience.
Recruitment Year
Termination Year
- We have elected to use recruitment year to report on what percentage of total recruitments each specific ISCO occupation represents each year. This measurement is based on the labour force history of each Europass user, including past and present job positions mentioned on each CV. The years between 2000 and 2019 were selected for a total of 10.0M work experiences.
- As an example, the graph below displays the time series of the top 5 ISCO 2 and ISCO 3 occupations with respect to recruitment year. Year is shown on the x-axis, while the percentage of total recruitments each occupation represents for that year is on the y-axis. Time series for both weighted and unweighted data are shown.
ISCO 3 (Weighted)
ISCO 3 (Raw)
ISCO 2 (Weighted)
ISCO 2 (Raw)
- Qualitatively, it can be observed that occupations such as Waiters and Bartenders display an overall upwards trend, while others like Administrative and specialized secretaries are trending downwards. This is generally consistent between both weighted and unweighted data, as the weighting procedure has mostly affected the relative percentage between occupations, with only slight adjustments to the distributions’ shape.
- In order to quantify trends, we have applied regression analysis to our data by fitting generalized linear models (GLM) to each breakdown of interest. Specifically, we have utilized the R package
glm
which estimates the coefficients \(\beta\) through maximum likelihood estimation (MLE). - GLMs can be considered a generalization of ordinary linear regression that does not make the assumption that each observation necessarily comes from a normal distribution, \(y_{i} \sim \mathit{N}\left (\mu_{i}, \sigma^{2} \right )\). Equivalently, this means that the error distribution of the response variable does not have to be the normal distribution.
- A binomial distribution was assumed for our data, with:
- Logit serving as the link function, \(estimate \times year + intercept = \ln \left ( \frac{y }{n - y} \right )\)
- Mean function, \(\frac{y}{n} = \frac{1}{1 + e^{-(estimate \times year + intercept)}}\)
- The process was repeated for both weighted and unweighted data. The results presented for the context of this report refer to trend estimations made for the weighted data, a sample of which is displayed below for EA-19.
ISCO 1 | Country | Estimate | Intercept | p-value | Deviance | Degrees of freedom |
---|---|---|---|---|---|---|
Armed forces occupations | EA19 | -0.0645050 | 124.83591 | 0 | 314.8306 | 18 |
Clerical support workers | EA19 | -0.0101363 | 17.46122 | 0 | 791.1704 | 18 |
Craft and related trades workers | EA19 | -0.0283781 | 54.40370 | 0 | 4332.7565 | 18 |
Elementary occupations | EA19 | 0.0114642 | -26.22486 | 0 | 1213.4496 | 18 |
Managers | EA19 | -0.0071675 | 12.26817 | 0 | 516.7734 | 18 |
Plant and machine operators and assemblers | EA19 | -0.0126231 | 22.21469 | 0 | 3753.0582 | 18 |
Professionals | EA19 | 0.0108941 | -22.66026 | 0 | 7412.2678 | 18 |
Service and sales workers | EA19 | 0.0155061 | -32.89404 | 0 | 1382.9993 | 18 |
Skilled agricultural, forestry and fishery workers | EA19 | 0.0040978 | -12.94743 | 0 | 242.5014 | 18 |
Technicians and associate professionals | EA19 | -0.0056631 | 10.04493 | 0 | 261.7392 | 18 |
References
- European Centre for the Development of Vocational Training. (2020). Europass CV Insights Report, June - September 2019.
- labourR: Classify Multilingual Labour Market Free-Text to Standardized Hierarchical Occupations. CRAN.
- Hastie, T. J. and Pregibon, D. (1992) Generalized linear models. Chapter 6 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
- European Centre for the Development of Vocational Training. (2019). Skills-OVATE: Skills Online Vacancy Analysis Tool for Europe