What effect does immigration have on unemployment of nativeborn citizens?
Over the past decade the popularity of far right political parties with anti immigration standpoints has risen significantly in numerous well established democracies. Examples are the German NDP, British UKIP and the French Front National. Cesáreo Rodríguez-Aguilera summarizes the three core far right ideas in the following list: "chauvinistic and ethnic exaltation of the nation; anti-immigrant xenophobia; and “anti-politician”, anti-establishment populism". Thus one of their key talking points is anti immigration partly due to their impact on labor market. An example of this is the following quote by Donald Trump: “jobs are being stolen … like candy from a baby.” However, immigration can also be necessary in some cases such as stagnating population growth or an aging population. Thus this negative view of immigration if it leads to anti immigration policies might have far reaching long term consequences on pension and welfare systems. We dislike when political discourse is practiced not on facts but merely on fearmongering. Thus in this assignment we would like to research what truly is the relationship between immigration and native born employment.
Sources:
https://www.iemed.org/publication/the-rise-of-the-far-right-in-europe/
https://www.eyes-on-europe.eu/explaining-the-main-drivers-of-anti-immigration-attitudes-in-europe/
To answer our research question this notebook is structures as follows:
# importing packages
import pandas as pd
import datetime as dt
import pandas_datareader.data as web
import os
import pandasdmx as pdmx
import numpy as np
C:\Users\gamepc\anaconda3\lib\site-packages\pandasdmx\remote.py:14: RuntimeWarning: optional dependency requests_cache is not installed; cache options to Session() have no effect RuntimeWarning,
# Data
oecd = pdmx.Request("OECD")
#Reading in Employment dataset from OECD
data = oecd.data(resource_id = "MIG_NUP_RATES_GENDER", key ="AUS+AUT+BEL+CAN+CHL+CZE+DNK+EST+FIN+FRA+DEU+GRC+HUN+ISL+IRL+ISR+ITA+LVA+LTU+LUX+MEX+NLD+NZL+NOR+POL+PRT+SVK+SVN+ESP+SWE+CHE+TUR+GBR+USA.NB.TOT.N_RATE+U_RATE+P_RATE/all?startTime=2000&endTime=2019").to_pandas()
df_employment = pd.DataFrame(data).reset_index()
#Reading in Population dataset from OECD
data = oecd.data(resource_id = "HISTPOP", key ="AUS+AUT+BEL+CAN+CHL+CZE+DNK+EST+FIN+FRA+DEU+GRC+HUN+ISL+IRL+ISR+ITA+LVA+LTU+LUX+MEX+NLD+NZL+NOR+POL+PRT+SVK+SVN+ESP+SWE+CHE+TUR+GBR+USA.W+M+T.TOTAL/all?startTime=2000&endTime=2019").to_pandas()
df_population = pd.DataFrame(data).reset_index()
#Reading in Migration dataset from OECD
data = oecd.data(resource_id = "MIG", key ="TOT.B11.TOT.AUS+AUT+BEL+CAN+CHL+CZE+DNK+EST+FIN+FRA+DEU+GRC+HUN+ISL+IRL+ISR+ITA+LVA+LTU+LUX+MEX+NLD+NZL+NOR+POL+PRT+SVK+SVN+ESP+SWE+CHE+TUR+GBR+USA/all?startTime=2000&endTime=2019").to_pandas()
df_migration = pd.DataFrame(data).reset_index()
#Reading in GDP per capita dataset from OECD
data = oecd.data(resource_id = "SNA_TABLE1", key ="AUS+AUT+BEL+CAN+CHL+CZE+DNK+EST+FIN+FRA+DEU+GRC+HUN+ISL+IRL+ISR+ITA+LVA+LTU+LUX+MEX+NLD+NZL+NOR+POL+PRT+SVK+SVN+ESP+SWE+CHE+TUR+GBR+USA.B1_GE.HVPVOB/all?startTime=2000&endTime=2019").to_pandas()
df_GDPpc = pd.DataFrame(data).reset_index()
#Reading in inflation dataset from OECD
data = oecd.data(resource_id = "PRICES_CPI", key ="AUS+AUT+BEL+CAN+CHL+CZE+DNK+EST+FIN+FRA+DEU+GRC+HUN+ISL+IRL+ISR+ITA+LVA+LTU+LUX+MEX+NLD+NZL+NOR+POL+PRT+SVK+SVN+ESP+SWE+CHE+TUR+GBR+USA.CPALTT01+CP18ALTT.GY.A/all?startTime=2000&endTime=2019").to_pandas()
df_inflation = pd.DataFrame(data).reset_index()
# Data selection and merging
#Selection of only native born unemployment
df_employment = df_employment[df_employment["RATE"] == "U_RATE"]
#selection of only Total population
df_population = df_population[df_population["SEX"] == "T"]
#Dropping irrelevant columns
df_employment.drop(["BIRTH","GENDER","RATE"], axis=1, inplace = True)
df_population.drop(["SEX","AGE"], axis=1, inplace = True)
df_migration.drop(["CO2","VAR","GEN"], axis=1, inplace = True)
df_GDPpc.drop(["TRANSACT","MEASURE"], axis=1, inplace = True)
df_inflation.drop(["SUBJECT","MEASURE","FREQUENCY"], axis=1, inplace = True)
#RENAMING COLUMNS
df_employment.rename(columns = {"COUNTRY": "Country", "TIME_PERIOD": "Year", "value": "Unemployment"}, inplace = True)
df_population.rename(columns = {"LOCATION": "Country", "TIME_PERIOD": "Year", "value": "Population"}, inplace = True)
df_migration.rename(columns = {"COU": "Country", "TIME_PERIOD": "Year", "value": "Migration"}, inplace = True)
df_GDPpc.rename(columns = {"LOCATION": "Country", "TIME_PERIOD": "Year", "value": "GDPpc"}, inplace = True)
df_inflation.rename(columns = {"LOCATION": "Country", "TIME_PERIOD": "Year", "value":"Inflation"}, inplace = True)
#combiningall datasets
df_combined = df_employment.merge(df_population, how = "left", left_on = ["Country", "Year"], right_on = ["Country", "Year"]).merge(df_migration, how="left", left_on = ["Country", "Year"], right_on = ["Country", "Year"])
df_combined = df_combined.merge(df_GDPpc, how = "left", left_on = ["Country", "Year"], right_on = ["Country", "Year"]).merge(df_inflation, how="left", left_on = ["Country", "Year"], right_on = ["Country", "Year"])
#Exploring the main dataset
df_combined.head()
| Country | Year | Unemployment | Population | Migration | GDPpc | Inflation | |
|---|---|---|---|---|---|---|---|
| 0 | AUS | 2000 | 6.3 | 19028802.0 | 107148.0 | 37933.185775 | 4.457435 |
| 1 | AUS | 2001 | 6.7 | 19274701.0 | 127877.0 | 38952.657236 | 4.407135 |
| 2 | AUS | 2002 | 6.3 | 19495210.0 | 119080.0 | 39709.935147 | 2.981575 |
| 3 | AUS | 2003 | 5.9 | 19720737.0 | 123411.0 | 40906.749065 | 2.732596 |
| 4 | AUS | 2004 | 5.5 | 19932722.0 | 146441.0 | 41750.715304 | 2.343255 |
#Creating the variable migration as a share of population
df_combined["Migration"] = df_combined["Migration"]/df_combined["Population"]
#Dropping the population column
df_combined.drop(["Population"], axis = 1, inplace = True)
data = df_combined
# Checking missing values
print(data.isna().sum())
print(data.shape)
Country 0 Year 0 Unemployment 0 Migration 23 GDPpc 0 Inflation 0 dtype: int64 (592, 6)
The Fixed effects regression finds a negative relationship between GDP per capita, Inflation, Migration and the outcome variable Native Born Unemployment.
THe lasso regression finds that Inflation does not influence Native Born Unemployment. The Lasso regression also finds a negative relationship between GDP per capita, Migration and the outcome variable Native Born Unemployment.
The Bayesian model find a negative relationship between GDP per capita, Inflation, Migration and the outcome variable Native Born Unemployment.
The negative relationship is the strongest for GDPpercapita. The size of the negative relationships for Migration and Inflation are quite similar.
We assume that migration into a country influences the resulting unemployment of native born people through the labor market channel. The labor market is too complex to be measured in a single measure thus this is unobserved. We also assume that other important factors influencing the labor market are inflation and GDP per capita. Finally we assume that all these effects work relatively quickly, meaning that changes in the input variables are reflected in the outcome variable in the same year.
We visualize our view of this relationship in the DAG below.
# Import necessary packages
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(20,10))
plt.title("DAG describing relationship between migration and native born unemployment levels")
# Create 5 nodes for the content of teh DAG and 3 to use as a legend indicating whether the variable is the relationship of interest, unobserved or a covariate.
DAG = nx.DiGraph()
variables = np.arange(0, 8).tolist()
DAG.add_nodes_from(variables)
# Add arrows between nodes
DAG.add_edges_from([(0,3), (1,3),
(2,3), (3, 4)])
# Set node colors
colors = ["lime", "mistyrose","mistyrose", "coral",
"lime",
"lime",
"coral",
"mistyrose"]
# Set node positions
pos = {0:(1, 8),
1:(1, 6), 2:(1, 4),
3:(5, 6), 4:(8, 6),5:(8, 9),
6:(8, 8.5), 7:(8, 8)}
# Set node labels
labels = {0:"Migration",
1:"Inflation", 2: "GDPpc",
3: "Labor \nMarket \nConditions", 4: "Unemployment\n Native Born", 5:"Variables of interest",6:"Unobserved",7:"Covariates"}
# Set node sizes
sizes = [10000, 10000, 10000, 10000, 10000,1000,1000,1000]
# Draw DAG
nx.draw(DAG, pos = pos, labels = labels, arrows = True,
node_shape = "s", node_size=sizes,node_color = colors)
plt.savefig("DAG.png", format="PNG")
plt.show()
Below we plot a graph showing the Unemployment of native born people and the migration into the relevant country in that same year. We see no obvious visual relationship except that high migration is not related to high unemployment as there are no datapoints plotted in the rop right of the graph
# Import necessary packages
import matplotlib.colors as colors
import matplotlib as mpl
import matplotlib.cm as mplcm
import statsmodels.api as sm
from matplotlib import pyplot as plt
# Give every country a different color
plt.style.use('seaborn-dark-palette')
NUM_COLORS = 34
cm = plt.get_cmap('tab20b')
cNorm = colors.Normalize(vmin=0, vmax=NUM_COLORS-1)
scalarMap = mplcm.ScalarMappable(norm=cNorm, cmap=cm)
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color=([scalarMap.to_rgba(i) for i in range(NUM_COLORS)]))
# Create figure
fig, ax1 = plt.subplots(figsize=(10, 10))
# Plot Native Born Unemployment on the y-axis and Migration on the x-axis
for country in df_combined.Country.unique():
mask = (df_combined.Country == country)
ax1.scatter(df_combined[mask]['Migration'], df_combined[mask]['Unemployment'], label = country)
ax1.set_xlabel('Migration')
ax1.set_ylabel("Native Born Unemployment(%)")
plt.tight_layout()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.);
To check whether the regressions we use work correctly we create a dataset much the same as the real dataset except that the coefficients regarding the different determinants are known.
df_simulated = df_combined.copy()
df_simulated.head()
df_simulated.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 592 entries, 0 to 591 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 592 non-null object 1 Year 592 non-null object 2 Unemployment 592 non-null float64 3 Migration 569 non-null float64 4 GDPpc 592 non-null float64 5 Inflation 592 non-null float64 dtypes: float64(4), object(2) memory usage: 32.4+ KB
Here we create data for the three determinants and the target. Meaning that we now have three columns of data with a known coefficient on the target column. Much the same as our real data, where we have the unemployment of the native born and the three determinants in the regression, which are the migration as a percentage of population, GDP per capita and Inflation.
#creating simulated data
from sklearn import datasets
features, output, coef = datasets.make_regression(n_samples = 592, n_features = 3,
n_informative = 4, n_targets = 1,
noise = 0.0, coef = True)
#replacing columns of original dataset with simulated data
features = pd.DataFrame(features, columns=['Migration', 'GDPpc', 'Inflation'])
target = pd.DataFrame(output, columns=['Unemployment'])
#Dropping real data columns and merging the simulated data columns to df_simulated
df_simulated.drop(['Migration', 'GDPpc', 'Inflation','Unemployment'], axis = 1, inplace = True)
df_simulated = pd.concat([df_simulated,features,target], axis=1)
Below can be seen that both the real dataset and the simulated dataset are similar in structure but the datapoints themselves obviously differ because they are simulated randomly.
print("Simulated dataframe: \n",df_simulated.head())
print("\n\n Real dataframe: \n",df_combined.head())
Simulated dataframe: Country Year Migration GDPpc Inflation Unemployment 0 AUS 2000 1.733243 0.000080 0.478075 67.662168 1 AUS 2001 0.332955 -0.834127 0.299409 -21.361405 2 AUS 2002 -1.312477 -1.925291 1.567559 -40.344644 3 AUS 2003 -0.441943 -1.607506 -0.819247 -151.702276 4 AUS 2004 1.131345 -2.052216 0.446066 -63.792617 Real dataframe: Country Year Unemployment Migration GDPpc Inflation 0 AUS 2000 6.3 0.005631 37933.185775 4.457435 1 AUS 2001 6.7 0.006634 38952.657236 4.407135 2 AUS 2002 6.3 0.006108 39709.935147 2.981575 3 AUS 2003 5.9 0.006258 40906.749065 2.732596 4 AUS 2004 5.5 0.007347 41750.715304 2.343255
In the simulated data regression we do not scale the data such that the coefficients found in the data are the same as which are created when simulating the data. We use train_test_split from sklearn to divide our data into a train and testset. Then for the regression we use both the Lasso and LassoCV function from sklearn. The LassoCV function uses crossvalidation to find the optimal alpha of the model.
Below we set the optimal alpha as the alpha of the lasso regression we will do on the full dataset. The value of alpha is quite low, meaning that most of the features are important and therefore used in the lasso regression.
# Set independant and explanatory variables as y and x
df_Lasso = df_simulated
y = pd.DataFrame(df_Lasso["Unemployment"])
X = df_Lasso.drop(["Unemployment", "Country", "Year"], axis = 1)
#Check data for y and X datasets
#print(y.head())
#print(X.head())
# Import and run the function to split our data in a train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
# Import lasso models and fit the model
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
# Lasso with 5 fold cross-validation
model = LassoCV(cv=10, random_state=0, max_iter=1000000)
# Fit model
model.fit(X_train, y_train)
# Print optimal alpha for our model
print(model.alpha_)
# Set best alpha determined using crossvalidation for our new lasso model
lasso_best = Lasso(alpha=model.alpha_)
lasso_best.fit(X_train, y_train)
# The resulting coeffecients from the lasso regression on the simulated data are stored to later compare to the values used to simulate the data during data creation
coef_sim = list(zip(lasso_best.coef_, X))
print(coef_sim)
# Print model R squareds
print('R squared training set', round(lasso_best.score(X_train, y_train)*100, 2))
print('R squared test set', round(lasso_best.score(X_test, y_test)*100, 2))
#comparing true coefficients with the lasso coefficients:
coef_df = pd.DataFrame(coef, columns=['True coefficient values'])
coef_sim_df = pd.DataFrame(coef_sim, columns=['Simulated data lasso coefficients','Column'])
df_coefficients = coef_df.join(coef_sim_df)
df_coefficients = df_coefficients[['Column', 'Simulated data lasso coefficients', 'True coefficient values']]
print(df_coefficients.to_string(index=False))
C:\Users\gamepc\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:1571: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
0.057027570442860255
[(21.825187464590677, 'Migration'), (56.60128660649159, 'GDPpc'), (62.12052127946433, 'Inflation')]
R squared training set 100.0
R squared test set 100.0
Column Simulated data lasso coefficients True coefficient values
Migration 21.825187 21.883924
GDPpc 56.601287 56.664626
Inflation 62.120521 62.181616
Above we have printed the Lasso regression coefficients on the simulated dataset: Here the R squared values are 100, meaning that the variance of the dependent variable is fully explained by the variance of the independent variables. This is logical because the data was simulated without applying noise. Beneath that is the comparison of the true coefficients in the simulated data and the coefficients that the lasso regression found. From this comparision we can conclude that our lasso regression method indeed works as intended because the Simulated data lasso coefficients are very close to the true coefficient values used during simulation
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
import statsmodels.formula.api as smf
from matplotlib import pyplot as plt
import seaborn as sns
unit_names = ['AUS', 'AUT', 'BEL', 'CAN', 'CHE', 'CHL', 'CZE', 'DEU', 'DNK',
'ESP', 'EST', 'FIN', 'FRA', 'GBR', 'GRC', 'HUN', 'IRL', 'ISL',
'ISR', 'ITA', 'LTU', 'LUX', 'LVA', 'MEX', 'NLD', 'NOR', 'NZL',
'POL', 'PRT', 'SVK', 'SVN', 'SWE', 'TUR', 'USA']
unit_col_name='Country'
time_period_col_name='Year'
#Define y and X
y_var_name = 'Unemployment'
X_var_names = ['Migration','GDPpc','Inflation']
Here again we do not scale the data such that the coefficients that are found are the same as teh coefficients which are in teh data when the data was simulated.
#selecting data
df_panel = df_simulated.dropna()
#looking at the correlation
print(df_panel.corr())
#Create the dummy variables, one for each country
df_dummies = pd.get_dummies(df_panel[unit_col_name])
#Join the dummies Dataframe with the panel data set
df_panel_with_dummies = df_panel.join(df_dummies)
Migration GDPpc Inflation Unemployment Migration 1.000000 -0.019727 -0.006861 0.236185 GDPpc -0.019727 1.000000 -0.016705 0.647171 Inflation -0.006861 -0.016705 1.000000 0.707814 Unemployment 0.236185 0.647171 0.707814 1.000000
In the next piece of code we construct the regression equation while leaving out one of the dummy variables to prevent multi-colinearity.
lsdv_expr = y_var_name + ' ~ '
i = 0
for X_var_name in X_var_names:
if i > 0:
lsdv_expr = lsdv_expr + ' + ' + X_var_name
else:
lsdv_expr = lsdv_expr + X_var_name
i = i + 1
for dummy_name in unit_names[:-1]:
lsdv_expr = lsdv_expr + ' + ' + dummy_name
print('Regression expression for OLS with dummies=' + lsdv_expr)
lsdv_model = smf.ols(formula=lsdv_expr, data=df_panel_with_dummies)
lsdv_model_results = lsdv_model.fit()
print('===============================================================================')
print('============================== OLSR With Dummies ==============================')
print(lsdv_model_results.summary())
print('LSDV='+str(lsdv_model_results.ssr))
Regression expression for OLS with dummies=Unemployment ~ Migration + GDPpc + Inflation + AUS + AUT + BEL + CAN + CHE + CHL + CZE + DEU + DNK + ESP + EST + FIN + FRA + GBR + GRC + HUN + IRL + ISL + ISR + ITA + LTU + LUX + LVA + MEX + NLD + NOR + NZL + POL + PRT + SVK + SVN + SWE + TUR
===============================================================================
============================== OLSR With Dummies ==============================
OLS Regression Results
==============================================================================
Dep. Variable: Unemployment R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.931e+29
Date: Fri, 26 Aug 2022 Prob (F-statistic): 0.00
Time: 19:18:21 Log-Likelihood: 15685.
No. Observations: 592 AIC: -3.130e+04
Df Residuals: 555 BIC: -3.113e+04
Df Model: 36
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.137e-13 1.75e-13 0.650 0.516 -2.3e-13 4.57e-13
Migration 21.8839 3.4e-14 6.44e+14 0.000 21.884 21.884
GDPpc 56.6646 3.3e-14 1.72e+15 0.000 56.665 56.665
Inflation 62.1816 3.37e-14 1.85e+15 0.000 62.182 62.182
AUS 7.105e-14 2.48e-13 0.287 0.774 -4.16e-13 5.58e-13
AUT -9.948e-14 2.48e-13 -0.402 0.688 -5.86e-13 3.87e-13
BEL -1.066e-14 2.47e-13 -0.043 0.966 -4.96e-13 4.75e-13
CAN -3.908e-14 2.85e-13 -0.137 0.891 -5.99e-13 5.21e-13
CHE -8.527e-14 2.51e-13 -0.340 0.734 -5.78e-13 4.08e-13
CHL 3.02e-14 3.63e-13 0.083 0.934 -6.82e-13 7.43e-13
CZE -4.885e-14 2.5e-13 -0.196 0.845 -5.39e-13 4.42e-13
DEU 2.665e-14 2.48e-13 0.107 0.915 -4.61e-13 5.15e-13
DNK 3.197e-14 2.49e-13 0.129 0.898 -4.56e-13 5.2e-13
ESP 2.132e-14 2.46e-13 0.087 0.931 -4.62e-13 5.04e-13
EST 1.776e-15 2.47e-13 0.007 0.994 -4.84e-13 4.87e-13
FIN 0 2.47e-13 0 1.000 -4.85e-13 4.85e-13
FRA 6.395e-14 2.47e-13 0.259 0.796 -4.22e-13 5.5e-13
GBR 4.974e-14 2.47e-13 0.201 0.840 -4.35e-13 5.35e-13
GRC 1.421e-14 2.46e-13 0.058 0.954 -4.69e-13 4.97e-13
HUN 3.197e-14 2.47e-13 0.129 0.897 -4.54e-13 5.18e-13
IRL 8.171e-14 2.47e-13 0.331 0.741 -4.04e-13 5.67e-13
ISL 6.75e-14 2.47e-13 0.273 0.785 -4.18e-13 5.53e-13
ISR -9.237e-14 3.25e-13 -0.284 0.777 -7.32e-13 5.47e-13
ITA 3.109e-14 2.47e-13 0.126 0.900 -4.54e-13 5.16e-13
LTU 5.684e-14 5.8e-13 0.098 0.922 -1.08e-12 1.2e-12
LUX 1.066e-14 2.46e-13 0.043 0.966 -4.74e-13 4.95e-13
LVA -3.908e-14 4.82e-13 -0.081 0.935 -9.85e-13 9.07e-13
MEX -5.684e-14 2.67e-13 -0.213 0.832 -5.82e-13 4.68e-13
NLD 6.928e-14 2.47e-13 0.280 0.779 -4.16e-13 5.55e-13
NOR 8.704e-14 2.46e-13 0.353 0.724 -3.97e-13 5.71e-13
NZL 7.816e-14 2.54e-13 0.308 0.758 -4.2e-13 5.76e-13
POL -6.217e-14 2.49e-13 -0.249 0.803 -5.52e-13 4.28e-13
PRT 2.132e-14 2.47e-13 0.086 0.931 -4.63e-13 5.06e-13
SVK -7.816e-14 2.51e-13 -0.312 0.755 -5.7e-13 4.14e-13
SVN 1.066e-14 2.47e-13 0.043 0.966 -4.75e-13 4.96e-13
SWE 3.819e-14 2.46e-13 0.155 0.877 -4.45e-13 5.22e-13
TUR 5.773e-14 2.86e-13 0.202 0.840 -5.04e-13 6.2e-13
==============================================================================
Omnibus: 2.879 Durbin-Watson: 1.861
Prob(Omnibus): 0.237 Jarque-Bera (JB): 2.643
Skew: -0.094 Prob(JB): 0.267
Kurtosis: 2.733 Cond. No. 33.4
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
LSDV=3.3580091823871685e-22
In the above regression results the coefficients of Migration, GDP per capita and Inflation align with the coefficients that are within the simulated data. See below the coefficients in the simulated data:
print(coef_sim_df)
Simulated data lasso coefficients Column 0 21.825187 Migration 1 56.601287 GDPpc 2 62.120521 Inflation
To test whether the fixed effects model is actually better than an OLSR model we do an F-test
#First build and fit a Pooled OLSR model on the panel data set
pooled_y=df_panel[y_var_name]
pooled_X=df_panel[X_var_names]
pooled_X = sm.add_constant(pooled_X)
pooled_olsr_model = sm.OLS(endog=pooled_y, exog=pooled_X)
pooled_olsr_model_results = pooled_olsr_model.fit()
#Setup the variables for calculating the F-test
#n=number of groups
n=len(unit_names)
#T=number of time periods per unit
T=df_panel.shape[0]/n
#N=total number of rows in the panel data set
N=n*T
#k=number of regression variables of the Pooled OLS model
k=len(X_var_names)+1
#Get the Residual Sum of Squares for the Pooled OLS model
ssr_restricted_model = pooled_olsr_model_results.ssr
#Get the Residual Sum of Squares for the Fixed Effects model
ssr_unrestricted_model = lsdv_model_results.ssr
#Get the degrees of freedom of the Pooled OLSR model
k1 = len(pooled_olsr_model_results.params)
#Get the degrees of freedom of the Fixed Effects model
k2 = len(lsdv_model_results.params)
#Calculate the F statistic
f_statistic = ((ssr_restricted_model - ssr_unrestricted_model)/ssr_unrestricted_model)*((N-k2)/(
k2-k1))
print('F-statistic for FE model='+str(f_statistic))
#Calculate the critical value at alpha=.05
alpha=0.05
f_critical_value=st.f.ppf((1.0-alpha), (k2-k1), (N-k2))
print('F test critical value at alpha=0.05='+str(f_critical_value))
F-statistic for FE model=-16.8069857309734 F test critical value at alpha=0.05=1.4575225366557338
Here the F-statistic of the Fixed Effects model is higher than the F-test critical value, menaing that the fixed effects model has a better goodness of fit than the pooled OLSR model.
Below we start with a lasso regression on the whole dataset. After that we do a fixed effects regression for panel data to take into account country fixed effects. We end with a Bayesian analysis.
The lasso regression starts by selecting the X and y out of the dataset. As we research the effect of multiple coefficients on unemployment of native born the y variable is unemployment of native born and the X variable are the other variables. We drop Country and Time as those are descriptive columns, not measured columns.
We start with the lasso regression on the complete dataset, pooling all years together.
#For the pooled Lasso regression we delete the rows that contain NaN values
df_Lasso = df_combined.dropna()
y = pd.DataFrame(df_Lasso["Unemployment"])
X = df_Lasso.drop(["Unemployment", "Country", "Year"], axis = 1)
#Check data for y and X datasets
print(y.head())
print(X.head())
Unemployment 0 6.3 1 6.7 2 6.3 3 5.9 4 5.5 Migration GDPpc Inflation 0 0.005631 37933.185775 4.457435 1 0.006634 38952.657236 4.407135 2 0.006108 39709.935147 2.981575 3 0.006258 40906.749065 2.732596 4 0.007347 41750.715304 2.343255
Below we import the train_test_split function to split the dataset into train and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
Here we import the StandardScaler function to scale the train and test datasets. The standard scaler standardizes the data by removing the mean and then scales the data to unit variance.
from sklearn.preprocessing import StandardScaler
#scaling y
scaler = StandardScaler().fit(np.asarray(y_train['Unemployment']).reshape(-1,1))
y_train['Unemployment'] = scaler.transform(np.asarray(y_train['Unemployment']).reshape(-1,1))
y_test['Unemployment'] = scaler.transform(np.asarray(y_test['Unemployment']).reshape(-1,1))
#scaling X
#List of numerical columns
list_numerical = [ 'Migration',
'GDPpc',
'Inflation']
scaler = StandardScaler().fit(X_train[list_numerical])
X_train[list_numerical] = scaler.transform(X_train[list_numerical])
X_test[list_numerical] = scaler.transform(X_test[list_numerical])
The code continues with the lasso regression itself. For this we use both the Lasso and LassoCV function from sklearn. The LassoCV function uses crossvalidation to find the optimal alpha of the model.
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
# Lasso with 5 fold cross-validation
model = LassoCV(cv=5, random_state=0, max_iter=1000000)
# Fit model
model.fit(X_train, y_train)
C:\Users\gamepc\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:1571: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
LassoCV(cv=5, max_iter=1000000, random_state=0)
Optimal model alpha:
print(model.alpha_)
0.0729890061384064
Below we set the optimal alpha as the alpha of the lasso regression we will do on the full dataset. This alpha is not low, menaing that there are features in the data which are not important.
# Set best alpha
lasso_best = Lasso(alpha=model.alpha_)
lasso_best.fit(X_train, y_train)
Lasso(alpha=0.0729890061384064)
Lasso regression coefficients on the pooled dataset:
coef_ = list(zip(lasso_best.coef_, X))
print(coef_)
[(-0.11925072109542673, 'Migration'), (-0.28712408837860676, 'GDPpc'), (-0.0, 'Inflation')]
Here the coefficient value of Inflation is 0, meaning that with the optimal value of alpha this feature is not important for the regression.
print('R squared training set', round(lasso_best.score(X_train, y_train)*100, 2))
print('R squared test set', round(lasso_best.score(X_test, y_test)*100, 2))
R squared training set 20.64 R squared test set 16.79
The R squared values are low, meaning that the variance of the dependent variable (Unemployment of the native born) is not well explained by the variance of the independent variables (Migration, GDPpc and Inflation)
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
import statsmodels.formula.api as smf
from matplotlib import pyplot as plt
import seaborn as sns
unit_names = ['AUS', 'AUT', 'BEL', 'CAN', 'CHE', 'CHL', 'CZE', 'DEU', 'DNK',
'ESP', 'EST', 'FIN', 'FRA', 'GBR', 'GRC', 'HUN', 'IRL', 'ISL',
'ISR', 'ITA', 'LTU', 'LUX', 'LVA', 'MEX', 'NLD', 'NOR', 'NZL',
'POL', 'PRT', 'SVK', 'SVN', 'SWE', 'TUR', 'USA']
unit_col_name='Country'
time_period_col_name='Year'
#Define y and X
y_var_name = 'Unemployment'
X_var_names = ['Migration','GDPpc','Inflation']
#selecting data
df_panel = df_combined.dropna()
#scaling the X features
from sklearn.preprocessing import StandardScaler
features_to_scale = ['Migration','GDPpc','Inflation','Unemployment']
df_panel[features_to_scale] = StandardScaler().fit_transform(df_panel[features_to_scale])
C:\Users\gamepc\anaconda3\lib\site-packages\pandas\core\frame.py:3678: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self[col] = igetitem(value, i)
#looking at the correlation
print(df_panel.corr())
#Create the dummy variables, one for each country
df_dummies = pd.get_dummies(df_panel[unit_col_name])
#Join the dummies Dataframe with the panel data set
df_panel_with_dummies = df_panel.join(df_dummies)
Unemployment Migration GDPpc Inflation Unemployment 1.000000 -0.382632 -0.441784 -0.050550 Migration -0.382632 1.000000 0.738889 -0.083024 GDPpc -0.441784 0.738889 1.000000 -0.214777 Inflation -0.050550 -0.083024 -0.214777 1.000000
In the next piece of code we construct the regression equation while leaving out one of the dummy variables to prevent multi-colinearity.
lsdv_expr = y_var_name + ' ~ '
i = 0
for X_var_name in X_var_names:
if i > 0:
lsdv_expr = lsdv_expr + ' + ' + X_var_name
else:
lsdv_expr = lsdv_expr + X_var_name
i = i + 1
for dummy_name in unit_names[:-1]:
lsdv_expr = lsdv_expr + ' + ' + dummy_name
print('Regression expression for OLS with dummies=' + lsdv_expr)
lsdv_model = smf.ols(formula=lsdv_expr, data=df_panel_with_dummies)
lsdv_model_results = lsdv_model.fit()
print('===============================================================================')
print('============================== OLSR With Dummies ==============================')
print(lsdv_model_results.summary())
print('LSDV='+str(lsdv_model_results.ssr))
Regression expression for OLS with dummies=Unemployment ~ Migration + GDPpc + Inflation + AUS + AUT + BEL + CAN + CHE + CHL + CZE + DEU + DNK + ESP + EST + FIN + FRA + GBR + GRC + HUN + IRL + ISL + ISR + ITA + LTU + LUX + LVA + MEX + NLD + NOR + NZL + POL + PRT + SVK + SVN + SWE + TUR
===============================================================================
============================== OLSR With Dummies ==============================
OLS Regression Results
==============================================================================
Dep. Variable: Unemployment R-squared: 0.711
Model: OLS Adj. R-squared: 0.691
Method: Least Squares F-statistic: 36.29
Date: Fri, 26 Aug 2022 Prob (F-statistic): 1.43e-119
Time: 19:18:22 Log-Likelihood: -454.59
No. Observations: 569 AIC: 983.2
Df Residuals: 532 BIC: 1144.
Df Model: 36
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.3576 0.166 2.159 0.031 0.032 0.683
Migration -0.1745 0.061 -2.853 0.004 -0.295 -0.054
GDPpc -1.0470 0.126 -8.316 0.000 -1.294 -0.800
Inflation -0.2095 0.031 -6.716 0.000 -0.271 -0.148
AUS -0.5743 0.201 -2.860 0.004 -0.969 -0.180
AUT -0.6320 0.206 -3.064 0.002 -1.037 -0.227
BEL -0.4677 0.203 -2.304 0.022 -0.866 -0.069
CAN -0.4028 0.225 -1.786 0.075 -0.846 0.040
CHE -0.0372 0.204 -0.182 0.856 -0.439 0.364
CHL -1.4256 0.361 -3.946 0.000 -2.135 -0.716
CZE -1.4376 0.249 -5.784 0.000 -1.926 -0.949
DEU -0.4057 0.211 -1.919 0.056 -0.821 0.010
DNK -0.6018 0.185 -3.246 0.001 -0.966 -0.238
ESP 1.1528 0.248 4.646 0.000 0.665 1.640
EST -1.0155 0.266 -3.824 0.000 -1.537 -0.494
FIN -0.1777 0.196 -0.907 0.365 -0.562 0.207
FRA -0.3754 0.204 -1.836 0.067 -0.777 0.026
GBR -0.8665 0.208 -4.169 0.000 -1.275 -0.458
GRC 1.1085 0.267 4.156 0.000 0.585 1.632
HUN -1.3301 0.275 -4.840 0.000 -1.870 -0.790
IRL 0.7852 0.187 4.197 0.000 0.418 1.153
ISL -0.4262 0.224 -1.899 0.058 -0.867 0.015
ISR -1.4347 0.271 -5.302 0.000 -1.966 -0.903
ITA -0.1424 0.213 -0.669 0.504 -0.561 0.276
LTU -1.0701 0.441 -2.424 0.016 -1.937 -0.203
LUX 3.2768 0.394 8.308 0.000 2.502 4.052
LVA -1.1580 0.394 -2.943 0.003 -1.931 -0.385
MEX -2.5281 0.317 -7.971 0.000 -3.151 -1.905
NLD -0.7843 0.185 -4.237 0.000 -1.148 -0.421
NOR -0.2509 0.183 -1.373 0.170 -0.610 0.108
NZL -1.0172 0.300 -3.389 0.001 -1.607 -0.428
POL -0.7752 0.287 -2.700 0.007 -1.339 -0.211
PRT -0.7019 0.256 -2.737 0.006 -1.206 -0.198
SVK -0.0708 0.267 -0.265 0.791 -0.596 0.454
SVN -1.0328 0.272 -3.796 0.000 -1.567 -0.498
SWE -0.6196 0.199 -3.113 0.002 -1.011 -0.229
TUR 0.6481 0.362 1.788 0.074 -0.064 1.360
==============================================================================
Omnibus: 39.922 Durbin-Watson: 0.365
Prob(Omnibus): 0.000 Jarque-Bera (JB): 99.234
Skew: 0.353 Prob(JB): 2.83e-22
Kurtosis: 4.921 Cond. No. 64.4
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
LSDV=164.65613850356738
To test whether the fixed effects model is actually than an OLSR model better we do an F-test
#First build and fit a Poole OLSR model on the panel data set
pooled_y=df_panel[y_var_name]
pooled_X=df_panel[X_var_names]
pooled_X = sm.add_constant(pooled_X)
pooled_olsr_model = sm.OLS(endog=pooled_y, exog=pooled_X)
pooled_olsr_model_results = pooled_olsr_model.fit()
#Setup the variables for calculating the F-test
#n=number of groups
n=len(unit_names)
#T=number of time periods per unit
T=df_panel.shape[0]/n
#N=total number of rows in the panel data set
N=n*T
#k=number of regression variables of the Pooled OLS model
k=len(X_var_names)+1
#Get the Residual Sum of Squares for the Pooled OLS model
ssr_restricted_model = pooled_olsr_model_results.ssr
#Get the Residual Sum of Squares for the Fixed Effects model
ssr_unrestricted_model = lsdv_model_results.ssr
#Get the degrees of freedom of the Pooled OLSR model
k1 = len(pooled_olsr_model_results.params)
#Get the degrees of freedom of the Fixed Effects model
k2 = len(lsdv_model_results.params)
#Calculate the F statistic
f_statistic = ((ssr_restricted_model - ssr_unrestricted_model)/ssr_unrestricted_model)*((N-k2)/(
k2-k1))
print('F-statistic for FE model='+str(f_statistic))
#Calculate the critical value at alpha=.05
alpha=0.05
f_critical_value=st.f.ppf((1.0-alpha), (k2-k1), (N-k2))
print('F test critical value at alpha=0.05='+str(f_critical_value))
F-statistic for FE model=27.23218251543961 F test critical value at alpha=0.05=1.4584349540049237
Here the F-statistic of the Fixed Effects model is higher than the F-test critical value, menaing that the fixed effects model has a better goodness of fit than the pooled OLSR model.
# Import packages
import pymc3 as pm
import numpy as np
# We define two different functions to standardize our input data.
#The first is for the variables that do not have missing values, the second one can handle missing values, which is necessary for the Migration data.
def standardize(x):
return (x-x.mean())/x.std()
def standardize_ma(x):
x_ma = np.ma.masked_invalid(x)
return (x_ma-x_ma.mean())/x_ma.std()
# Run functions to standardize inputs
Unemployment = standardize(df_combined['Unemployment'])
GDPpc = standardize(df_combined['GDPpc'])
Inflation = standardize(df_combined['Inflation'])
Migration = standardize_ma(df_combined['Migration'])
# Declare Bayesion model
with pm.Model() as normal_missing:
# Initialize priors on coefficients, we set a low σ_prior to prevent overfitting.
constant = pm.Normal('constant', mu = 0.0, sd = 1.0)
σ_prior = 0.1
# We define our coeffecients as normally distributed with a 0 mean and the aforementioned tight σ_prior
b_inflation = pm.Normal('b_inflation',mu = 0, sd = σ_prior)
b_GDPpc = pm.Normal('b_GDPpc',mu = 0, sd = σ_prior)
b_Migration = pm.Normal('b_Migration',mu = 0, sd = σ_prior)
## For Migration we use the observed values and initialize the rest also as being normally distributed
Migration = pm.Normal('Migration', mu = 0, sd = 1.0, observed = Migration)
## Here we set our model formula with Unemployment being explained by Inflation, GDP per capita and migration
μ = constant + b_inflation*Inflation + b_GDPpc*GDPpc + b_Migration*Migration
σ = pm.HalfNormal('σ',1)
unemployment = pm.Normal('Unemployment', μ, σ, observed=Unemployment)
trace_normal_missing = pm.sample(cores = 1)
# For the missing values we s
with normal_missing:
ppc_normal_missing = pm.sample_posterior_predictive(trace_normal_missing, var_names=['Migration'])
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Sequential sampling (2 chains in 1 job) NUTS: [σ, Migration_missing, b_Migration, b_GDPpc, b_inflation, constant]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 6574 seconds.
import arviz as az
# Plot teh trace plots and a summary of the models results
data_posterior_normal_missing = az.from_pymc3(trace_normal_missing, posterior_predictive = ppc_normal_missing)
variables_unemployment = ['b_inflation','b_GDPpc', 'b_Migration']
az.plot_trace(trace_normal_missing, var_names = variables_unemployment)
az.summary(data_posterior_normal_missing.posterior ,variables_unemployment)
Got error No model on context stack. trying to find log_likelihood in translation. Got error No model on context stack. trying to find log_likelihood in translation. Got error No model on context stack. trying to find log_likelihood in translation.
| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|
| b_inflation | -0.106 | 0.034 | -0.171 | -0.044 | 0.001 | 0.000 | 3861.0 | 1663.0 | 1.0 |
| b_GDPpc | -0.348 | 0.045 | -0.431 | -0.263 | 0.001 | 0.001 | 3497.0 | 1680.0 | 1.0 |
| b_Migration | -0.119 | 0.044 | -0.202 | -0.036 | 0.001 | 0.001 | 4094.0 | 1712.0 | 1.0 |
data_posterior_normal_missing
Got error No model on context stack. trying to find log_likelihood in translation.
<xarray.Dataset>
Dimensions: (chain: 2, draw: 1000, Migration_missing_dim_0: 23)
Coordinates:
* chain (chain) int32 0 1
* draw (draw) int32 0 1 2 3 4 5 ... 995 996 997 998 999
* Migration_missing_dim_0 (Migration_missing_dim_0) int32 0 1 2 ... 20 21 22
Data variables:
constant (chain, draw) float64 -0.04934 -0.05253 ... 0.03776
b_inflation (chain, draw) float64 -0.1506 -0.1352 ... -0.08501
b_GDPpc (chain, draw) float64 -0.3184 -0.342 ... -0.3721
b_Migration (chain, draw) float64 -0.1057 -0.104 ... -0.1252
Migration_missing (chain, draw, Migration_missing_dim_0) float64 0...
σ (chain, draw) float64 0.8889 0.8892 ... 0.952
Attributes:
created_at: 2022-08-26T16:56:03.927020
arviz_version: 0.12.1
inference_library: pymc3
inference_library_version: 3.11.5
sampling_time: 6573.864566326141
tuning_steps: 1000array([0, 1])
array([ 0, 1, 2, ..., 997, 998, 999])
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22])array([[-0.04934437, -0.05253203, 0.04792663, ..., 0.02199674,
-0.02709864, 0.0015587 ],
[ 0.0148203 , -0.00845927, -0.00087016, ..., 0.01809082,
0.00377294, 0.03776126]])array([[-0.15057749, -0.13524182, -0.10982745, ..., -0.17102413,
-0.05448086, -0.11009633],
[-0.12199994, -0.10019326, -0.05821004, ..., -0.11624801,
-0.13385065, -0.08500901]])array([[-0.31838929, -0.34200683, -0.37502416, ..., -0.35360942,
-0.33648748, -0.39180874],
[-0.34815157, -0.36326453, -0.30252791, ..., -0.36377141,
-0.3057699 , -0.37207166]])array([[-0.10566729, -0.10401182, -0.08062464, ..., -0.11855244,
-0.14137268, -0.10527206],
[-0.12540861, -0.12070865, -0.06870178, ..., -0.15533652,
-0.10769294, -0.12521994]])array([[[ 3.40252737e-01, 2.65534252e-01, 7.21603156e-01, ...,
1.54758229e-03, -8.56331814e-01, 8.35847601e-01],
[ 1.23339620e+00, 3.43736782e-01, -4.61752471e-01, ...,
-2.97803660e-01, -1.58299756e+00, 1.34999981e-01],
[-2.64741489e-01, -1.17118790e+00, -2.01128155e-01, ...,
-1.80336010e+00, 7.70323607e-01, 1.04061076e+00],
...,
[-3.70690576e-02, -5.06425837e-01, -7.09088294e-01, ...,
-2.66834126e-01, 1.12384290e+00, -2.31220243e-02],
[-2.15782266e-01, 1.85726984e+00, 5.58629200e-01, ...,
1.39429447e-01, -7.04549658e-01, -5.18106291e-01],
[ 4.23849073e-01, 1.62345296e-01, -2.02905475e-01, ...,
-1.06165168e+00, -5.01062925e-01, 7.92246756e-01]],
[[-1.35685275e+00, -2.57955783e+00, 5.05161473e-01, ...,
1.21273235e+00, -1.21486954e+00, 1.28884330e+00],
[ 8.65368917e-01, 2.30876177e+00, -3.71271358e-01, ...,
-1.29275630e+00, 1.09548221e+00, -1.33492848e+00],
[-1.23427726e-01, -2.40053671e+00, 1.72038066e+00, ...,
-6.40053431e-01, -2.87288768e+00, 7.72008592e-01],
...,
[ 3.61811012e-01, -1.27145772e+00, -7.51578163e-01, ...,
1.72273555e+00, -1.63532391e+00, 1.52508209e-01],
[-3.78815739e-02, -4.96674656e-01, 2.59662925e-01, ...,
2.19188268e+00, -1.43129741e+00, 5.33917320e-01],
[-1.66605936e-02, 1.39105958e-01, -5.06434512e-01, ...,
4.02788917e-01, -1.26498378e+00, -4.43031805e-01]]])array([[0.88887583, 0.88918935, 0.87728304, ..., 0.86900811, 0.89608409,
0.87399915],
[0.93495643, 0.8466439 , 0.8707006 , ..., 0.86134306, 0.8718176 ,
0.95198019]])<xarray.Dataset>
Dimensions: (chain: 2, draw: 1000, Migration_dim_0: 592)
Coordinates:
* chain (chain) int32 0 1
* draw (draw) int32 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999
* Migration_dim_0 (Migration_dim_0) int32 0 1 2 3 4 5 ... 587 588 589 590 591
Data variables:
Migration (chain, draw, Migration_dim_0) float64 -0.326 ... -1.771
Attributes:
created_at: 2022-08-26T16:57:20.810086
arviz_version: 0.12.1
inference_library: pymc3
inference_library_version: 3.11.5array([0, 1])
array([ 0, 1, 2, ..., 997, 998, 999])
array([ 0, 1, 2, ..., 589, 590, 591])
array([[[-0.32597455, -0.70755533, 0.14105734, ..., -0.03009481,
0.9462267 , 1.19121772],
[-1.49700183, 0.16411289, 0.31915752, ..., 0.73365917,
-0.12912038, 0.8540581 ],
[-1.20508576, 0.07972184, -0.03946982, ..., -0.20655575,
-0.56184334, -0.12296223],
...,
[ 1.71317798, 1.51899326, -0.58026181, ..., 0.72074476,
-0.18708489, 2.74060162],
[ 0.86685803, -0.57018913, 0.33720799, ..., 1.92102215,
-1.99971524, 0.48091253],
[-1.02045308, -0.39847874, -0.83313674, ..., -0.46669628,
2.54589748, -0.94028071]],
[[-2.09599732, -0.96802061, 0.04181558, ..., 0.68827659,
-0.12190804, 0.60367635],
[-0.5343704 , -0.93430663, -2.2823175 , ..., -0.01805696,
1.61494073, -1.2770965 ],
[-1.13446112, 0.91399528, 0.37319003, ..., 0.24350056,
-1.21759289, -0.28424968],
...,
[ 0.4880437 , -1.66743715, -0.85646495, ..., 1.10578388,
1.59395058, -3.77755763],
[ 0.02907084, 2.02156182, 0.57143962, ..., -0.47856093,
-0.6882393 , 1.0970703 ],
[ 0.53528563, 0.14781754, 1.99350338, ..., 0.34201899,
-0.21681318, -1.77142486]]])<xarray.Dataset>
Dimensions: (chain: 2, draw: 1000, Migration_dim_0: 592, Unemployment_dim_0: 592)
Coordinates:
* chain (chain) int32 0 1
* draw (draw) int32 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
* Migration_dim_0 (Migration_dim_0) int32 0 1 2 3 4 ... 588 589 590 591
* Unemployment_dim_0 (Unemployment_dim_0) int32 0 1 2 3 4 ... 588 589 590 591
Data variables:
Migration (chain, draw, Migration_dim_0) float64 -0.9851 ... -1...
Unemployment (chain, draw, Unemployment_dim_0) float64 -0.8091 ......
Attributes:
created_at: 2022-08-26T16:57:20.808074
arviz_version: 0.12.1
inference_library: pymc3
inference_library_version: 3.11.5array([0, 1])
array([ 0, 1, 2, ..., 997, 998, 999])
array([ 0, 1, 2, ..., 589, 590, 591])
array([ 0, 1, 2, ..., 589, 590, 591])
array([[[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969],
[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969],
[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969],
...,
[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969],
[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969],
[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969]],
[[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969],
[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969],
[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969],
...,
[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969],
[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969],
[-0.98510197, -0.94487181, -0.96366055, ..., -1.13480012,
-1.14486143, -1.16465969]]])array([[[-0.80910916, -0.80131521, -0.82104723, ..., -0.87761391,
-0.91511279, -0.96042792],
[-0.81204869, -0.80149434, -0.82226209, ..., -0.86578716,
-0.90062553, -0.93947931],
[-0.82997683, -0.79805935, -0.84130951, ..., -0.87660705,
-0.91786174, -0.95559436],
...,
[-0.79904209, -0.77980502, -0.82022134, ..., -0.88211085,
-0.92265764, -0.97746389],
[-0.84737317, -0.81681829, -0.84543366, ..., -0.89243828,
-0.93849515, -0.96745395],
[-0.81604889, -0.78912745, -0.82499739, ..., -0.8525055 ,
-0.88880204, -0.9228779 ]],
[[-0.87910822, -0.85581489, -0.89025781, ..., -0.93998499,
-0.97898683, -1.01799475],
[-0.7859913 , -0.75744042, -0.79378956, ..., -0.83771157,
-0.88166266, -0.92008771],
[-0.81713639, -0.78856593, -0.81855268, ..., -0.87438127,
-0.92525255, -0.95754791],
...,
[-0.80930148, -0.77687502, -0.82086396, ..., -0.87756563,
-0.92430273, -0.96956178],
[-0.80303918, -0.78348201, -0.81792747, ..., -0.89370208,
-0.941524 , -0.99249239],
[-0.91431853, -0.88181479, -0.91826426, ..., -0.95305553,
-0.99212858, -1.02230615]]])<xarray.Dataset>
Dimensions: (chain: 2, draw: 1000)
Coordinates:
* chain (chain) int32 0 1
* draw (draw) int32 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
Data variables: (12/13)
energy (chain, draw) float64 1.632e+03 1.622e+03 ... 1.626e+03
energy_error (chain, draw) float64 0.2918 -0.8669 ... -0.2274
lp (chain, draw) float64 -1.614e+03 ... -1.613e+03
acceptance_rate (chain, draw) float64 0.584 1.0 0.863 ... 0.968 0.9758
tree_depth (chain, draw) int64 3 3 3 3 3 3 3 3 ... 3 3 3 3 3 3 3 3
step_size (chain, draw) float64 0.7961 0.7961 ... 0.5819 0.5819
... ...
n_steps (chain, draw) float64 7.0 7.0 7.0 7.0 ... 7.0 7.0 7.0
diverging (chain, draw) bool False False False ... False False
perf_counter_diff (chain, draw) float64 1.11 1.107 1.103 ... 1.111 1.129
max_energy_error (chain, draw) float64 1.629 -0.8669 ... -1.165 -0.7979
perf_counter_start (chain, draw) float64 9.545e+04 9.545e+04 ... 9.986e+04
process_time_diff (chain, draw) float64 1.109 1.109 1.094 ... 1.125 1.109
Attributes:
created_at: 2022-08-26T16:56:03.934018
arviz_version: 0.12.1
inference_library: pymc3
inference_library_version: 3.11.5
sampling_time: 6573.864566326141
tuning_steps: 1000array([0, 1])
array([ 0, 1, 2, ..., 997, 998, 999])
array([[1632.46081978, 1622.33409045, 1619.19499894, ..., 1623.10117874,
1620.03049557, 1619.48124414],
[1624.43843356, 1627.37789534, 1630.97590846, ..., 1634.39569068,
1622.48632034, 1625.57273674]])array([[ 0.29175989, -0.86694506, -0.03888114, ..., 0. ,
0.11887024, -0.20513971],
[ 0.20778158, -0.16488882, 2.0690153 , ..., -1.02133997,
-0.06820958, -0.22740254]])array([[-1614.26459169, -1609.30520337, -1608.81246094, ...,
-1606.0676584 , -1605.78329588, -1604.87740878],
[-1613.67488287, -1613.26665103, -1620.61574285, ...,
-1614.76491623, -1613.23485979, -1613.02947473]])array([[0.58396519, 1. , 0.86304038, ..., 0.46704384, 0.49893792,
0.79851132],
[0.86036479, 0.86032101, 0.37155344, ..., 1. , 0.96800228,
0.97577905]])array([[3, 3, 3, ..., 3, 3, 3],
[3, 3, 3, ..., 3, 3, 3]], dtype=int64)array([[0.79605593, 0.79605593, 0.79605593, ..., 0.79605593, 0.79605593,
0.79605593],
[0.5819367 , 0.5819367 , 0.5819367 , ..., 0.5819367 , 0.5819367 ,
0.5819367 ]])array([[0.75531347, 0.75531347, 0.75531347, ..., 0.75531347, 0.75531347,
0.75531347],
[0.71786253, 0.71786253, 0.71786253, ..., 0.71786253, 0.71786253,
0.71786253]])array([[7., 7., 7., ..., 7., 7., 7.],
[7., 7., 7., ..., 7., 7., 7.]])array([[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]])array([[1.1104766, 1.1066193, 1.1033018, ..., 1.1138427, 1.1256376,
1.1049981],
[1.1227839, 1.114459 , 1.1085773, ..., 1.1104872, 1.110561 ,
1.1290093]])array([[ 1.62947229, -0.86694506, -0.43441891, ..., 1.49912549,
1.59478933, 1.12753323],
[ 0.5194084 , 0.53289274, 2.46915777, ..., -1.38430119,
-1.16505053, -0.79788295]])array([[95451.1935136, 95452.3055271, 95453.4134841, ..., 96536.8754081,
96537.9906387, 96539.1179718],
[98756.7114112, 98757.8359668, 98758.9525201, ..., 99854.5922421,
99855.7040422, 99856.8161513]])array([[1.109375, 1.109375, 1.09375 , ..., 1.125 , 1.109375, 1.109375],
[1.125 , 1.109375, 1.109375, ..., 1.09375 , 1.125 , 1.109375]])<xarray.Dataset>
Dimensions: (Migration_dim_0: 592, Unemployment_dim_0: 592)
Coordinates:
* Migration_dim_0 (Migration_dim_0) int32 0 1 2 3 4 ... 588 589 590 591
* Unemployment_dim_0 (Unemployment_dim_0) int32 0 1 2 3 4 ... 588 589 590 591
Data variables:
Migration (Migration_dim_0) float64 -0.3638 -0.2277 ... -0.701
Unemployment (Unemployment_dim_0) float64 -0.1997 -0.1009 ... -0.7926
Attributes:
created_at: 2022-08-26T16:57:20.811074
arviz_version: 0.12.1
inference_library: pymc3
inference_library_version: 3.11.5array([ 0, 1, 2, ..., 589, 590, 591])
array([ 0, 1, 2, ..., 589, 590, 591])
array([-3.63767610e-01, -2.27742302e-01, -2.99071941e-01, -2.78773720e-01,
-1.31198349e-01, -4.05172659e-02, 4.08245143e-02, 1.06157922e-01,
1.73440102e-01, 2.43926247e-01, 1.17027116e-01, 1.25039082e-01,
2.80029327e-01, 3.07918263e-01, 2.23508018e-01, 1.45856416e-01,
9.71867363e-02, 1.08317119e-01, -1.14390671e-01, -2.94376346e-01,
-1.11704023e-02, 1.33410690e-01, 3.17670822e-01, 4.31397665e-01,
6.02551570e-01, 4.87807593e-01, 2.32007416e-01, 3.68828247e-01,
4.10054272e-01, 3.62379242e-01, 4.43767521e-01, 6.49072508e-01,
8.93383841e-01, 1.03510269e+00, 1.32013483e+00, 1.99318030e+00,
1.33485967e+00, 1.02016692e+00, 8.93178548e-01, 9.33588173e-01,
-3.69427095e-01, -2.57674359e-01, -2.05736019e-01, -2.28263408e-01,
-1.84725108e-01, -1.25674347e-01, -5.48770658e-02, 6.42464300e-02,
2.14643195e-01, 1.62489960e-01, 2.85955195e-01, 3.27183203e-01,
4.52149102e-01, 3.05705421e-01, 1.62303406e-01, 4.25917616e-01,
1.11256577e-01, 1.80928125e-01, 2.60862483e-01, 4.03766864e-01,
-1.19036709e-01, -1.10617696e-01, -8.19735547e-03, -1.45336021e-01,
-1.20555919e-01, -1.26223122e-01, -1.31456374e-01, -9.51072525e-02,
-1.44766929e-02, -6.44573867e-02, 4.70159547e-02, 1.02829041e-01,
7.73924912e-01, 7.68493179e-01, 6.09935027e-01, 6.38772231e-01,
5.92634329e-01, 7.32191147e-01, 1.38026761e+00, 1.66027818e+00,
...
-1.05325626e+00, -1.06497761e+00, -1.06646345e+00, -1.03263582e+00,
-1.03685024e+00, -1.05440770e+00, -1.05555321e+00, -1.06516729e+00,
nan, nan, nan, nan,
nan, nan, nan, 7.19025904e-01,
7.53475694e-01, 6.90935180e-01, -2.86652997e-01, -4.16084115e-01,
-3.17461256e-01, -3.62025060e-01, -3.83336003e-01, -2.94908082e-01,
-2.23481016e-01, -1.08868697e-01, 4.51054605e-01, 6.60896320e-01,
-4.82029205e-01, -4.59122328e-01, -4.04040478e-01, -4.00900502e-01,
-4.09899101e-01, -3.56967717e-01, 7.30734429e-02, 1.10699232e-01,
9.78901467e-02, 9.39870328e-02, 1.53056026e-02, -3.89572350e-02,
4.90578821e-02, 2.19334837e-01, 3.56156075e-01, 4.47996721e-01,
8.26042156e-01, 5.57205155e-01, 3.97292944e-01, 1.68362325e-01,
nan, nan, -1.07152914e+00, nan,
nan, nan, nan, nan,
-6.58731987e-01, -5.11713011e-01, -3.49616552e-01, -1.77488280e-01,
-7.22973147e-01, -6.23315336e-01, -6.27752449e-01, -7.98257009e-01,
-6.83554175e-01, -6.12233925e-01, -5.51821145e-01, -6.53422969e-01,
-6.33495920e-01, -6.27335428e-01, -6.70105918e-01, -6.64968815e-01,
-6.81476357e-01, -7.02167551e-01, -6.94218147e-01, -6.82808026e-01,
-6.30439346e-01, -6.57056448e-01, -6.72194758e-01, -7.01029463e-01])array([-1.99690461e-01, -1.00867679e-01, -1.99690461e-01, -2.98513243e-01,
-3.97336025e-01, -5.20864503e-01, -5.70275894e-01, -6.93804372e-01,
-7.18510067e-01, -4.22041721e-01, -4.71453112e-01, -4.71453112e-01,
-4.46747416e-01, -3.47924634e-01, -1.99690461e-01, -2.24396157e-01,
-3.23218939e-01, -3.47924634e-01, -4.22041721e-01, -4.71453112e-01,
-6.93804372e-01, -8.91449936e-01, -6.93804372e-01, -7.18510067e-01,
-6.93804372e-01, -7.18510067e-01, -8.17332849e-01, -8.91449936e-01,
-9.65567022e-01, -7.92627154e-01, -8.42038545e-01, -9.16155631e-01,
-7.43215763e-01, -6.44392980e-01, -5.94981589e-01, -6.19687285e-01,
-5.94981589e-01, -7.18510067e-01, -8.42038545e-01, -8.91449936e-01,
-3.72630330e-01, -4.71453112e-01, -3.47924634e-01, -1.74984766e-01,
-1.74984766e-01, 7.20721894e-02, -2.67505926e-02, -1.74984766e-01,
-2.98513243e-01, -1.25573375e-01, -5.14562881e-02, -3.23218939e-01,
-2.98513243e-01, -7.61619836e-02, -5.14562881e-02, -7.61619836e-02,
-1.99690461e-01, -3.47924634e-01, -5.94981589e-01, -6.93804372e-01,
-2.73807548e-01, 2.20306363e-01, 1.46189276e-01, 2.26607984e-02,
-2.04489709e-03, -5.14562881e-02, -7.61619836e-02, -5.14562881e-02,
-5.14562881e-02, -1.99690461e-01, -3.47924634e-01, -3.97336025e-01,
-1.28674106e+00, -7.43215763e-01, -1.03968411e+00, -9.90272718e-01,
-9.90272718e-01, -1.06438980e+00, -1.11380120e+00, -1.16321259e+00,
...
1.70264809e+00, 1.77676518e+00, 1.52970822e+00, 1.08500571e+00,
6.40303186e-01, 2.69717754e-01, -1.25573375e-01, -2.98513243e-01,
-7.61619836e-02, -3.72630330e-01, -2.98513243e-01, -1.50279070e-01,
-2.73807548e-01, -1.25573375e-01, -2.73807548e-01, -5.45570198e-01,
-6.69098676e-01, -2.98513243e-01, 2.26607984e-02, 2.20306363e-01,
4.17951927e-01, 6.40303186e-01, 6.15597491e-01, 4.17951927e-01,
1.46189276e-01, -1.50279070e-01, -4.96158807e-01, -6.93804372e-01,
-5.94981589e-01, -7.43215763e-01, -6.93804372e-01, -5.70275894e-01,
-3.47924634e-01, -5.14562881e-02, -2.24396157e-01, -4.46747416e-01,
-4.46747416e-01, 2.26607984e-02, 4.73664939e-02, -1.99690461e-01,
-1.50279070e-01, -1.50279070e-01, -2.24396157e-01, -3.97336025e-01,
-5.45570198e-01, -6.44392980e-01, -7.92627154e-01, -6.44392980e-01,
6.89714577e-01, 1.40617975e+00, 9.12065837e-01, 4.42657622e-01,
2.94423449e-01, 4.42657622e-01, 7.14420273e-01, 8.13243055e-01,
9.61477228e-01, 9.61477228e-01, 9.86182923e-01, 1.67794240e+00,
-7.43215763e-01, -5.45570198e-01, -2.98513243e-01, -2.49101852e-01,
-3.47924634e-01, -4.46747416e-01, -5.45570198e-01, -5.45570198e-01,
-2.73807548e-01, 5.66186100e-01, 6.89714577e-01, 5.16774709e-01,
2.94423449e-01, 1.46189276e-01, -1.50279070e-01, -3.72630330e-01,
-4.71453112e-01, -6.19687285e-01, -7.43215763e-01, -7.92627154e-01])import matplotlib.colors as colors
# Bayesian analysis Graph
percentiles = np.percentile(data_posterior_normal_missing.posterior_predictive.Migration,[2.5,97.5],axis=[0,1])
# Create different colors for all countries
plt.style.use('seaborn-dark-palette')
NUM_COLORS = 34
cm = plt.get_cmap('tab20b')
cNorm = colors.Normalize(vmin=0, vmax=NUM_COLORS-1)
scalarMap = mplcm.ScalarMappable(norm=cNorm, cmap=cm)
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color=([scalarMap.to_rgba(i) for i in range(NUM_COLORS)]))
# Create figure
fig, ax = plt.subplots(figsize=(13,5))
# Plot all 95% percentile ranges and data points
for country in df_combined.Country.unique():
mask = (df_combined.Country == country)
ax.vlines(Migration[mask],percentiles[0,:][mask],percentiles[1,:][mask], color = next(ax1._get_lines.prop_cycler)['color'],alpha=0.3)
ax.scatter(Migration[mask], Unemployment[mask], label = country)
# Set labels
ax.set_xlabel('Migration')
ax.set_ylabel("Native Born Unemployment(%)")
# Add legend
plt.tight_layout()
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.);
Most of the points lie within the lines meaning that most of the observations lie within the 95% confidence interval, however on the top left we do have some observations that don't fit within their intervals. This could inidcate that not all countries fit the model completely.
We have used 3 different methods to estimate our results, of those methods we have tested 2 with simulated data which gave the same coefficient as the simulated dataset was created with. For Bayesian analysis we have looked at the trace plots and r_hats to check if our results are valid. In the discussion we will compare the findings from the different methods used and reach a final conclusion.
# We once again print the results from our Lasso regression, Fixed effects regression and Bayesian model
print("Lasso regression results:")
print(coef_)
print('R squared training set', round(lasso_best.score(X_train, y_train)*100, 2))
print('R squared test set', round(lasso_best.score(X_test, y_test)*100, 2))
print("Fixed effects regression results:")
print(lsdv_model_results.summary())
Lasso regression results:
[(-0.11925072109542662, 'Migration'), (-0.2871240883786068, 'GDPpc'), (-0.0, 'Inflation')]
R squared training set 20.64
R squared test set 16.79
Fixed effects regression results:
OLS Regression Results
==============================================================================
Dep. Variable: Unemployment R-squared: 0.711
Model: OLS Adj. R-squared: 0.691
Method: Least Squares F-statistic: 36.29
Date: Fri, 26 Aug 2022 Prob (F-statistic): 1.43e-119
Time: 19:57:25 Log-Likelihood: -454.59
No. Observations: 569 AIC: 983.2
Df Residuals: 532 BIC: 1144.
Df Model: 36
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.3576 0.166 2.159 0.031 0.032 0.683
Migration -0.1745 0.061 -2.853 0.004 -0.295 -0.054
GDPpc -1.0470 0.126 -8.316 0.000 -1.294 -0.800
Inflation -0.2095 0.031 -6.716 0.000 -0.271 -0.148
AUS -0.5743 0.201 -2.860 0.004 -0.969 -0.180
AUT -0.6320 0.206 -3.064 0.002 -1.037 -0.227
BEL -0.4677 0.203 -2.304 0.022 -0.866 -0.069
CAN -0.4028 0.225 -1.786 0.075 -0.846 0.040
CHE -0.0372 0.204 -0.182 0.856 -0.439 0.364
CHL -1.4256 0.361 -3.946 0.000 -2.135 -0.716
CZE -1.4376 0.249 -5.784 0.000 -1.926 -0.949
DEU -0.4057 0.211 -1.919 0.056 -0.821 0.010
DNK -0.6018 0.185 -3.246 0.001 -0.966 -0.238
ESP 1.1528 0.248 4.646 0.000 0.665 1.640
EST -1.0155 0.266 -3.824 0.000 -1.537 -0.494
FIN -0.1777 0.196 -0.907 0.365 -0.562 0.207
FRA -0.3754 0.204 -1.836 0.067 -0.777 0.026
GBR -0.8665 0.208 -4.169 0.000 -1.275 -0.458
GRC 1.1085 0.267 4.156 0.000 0.585 1.632
HUN -1.3301 0.275 -4.840 0.000 -1.870 -0.790
IRL 0.7852 0.187 4.197 0.000 0.418 1.153
ISL -0.4262 0.224 -1.899 0.058 -0.867 0.015
ISR -1.4347 0.271 -5.302 0.000 -1.966 -0.903
ITA -0.1424 0.213 -0.669 0.504 -0.561 0.276
LTU -1.0701 0.441 -2.424 0.016 -1.937 -0.203
LUX 3.2768 0.394 8.308 0.000 2.502 4.052
LVA -1.1580 0.394 -2.943 0.003 -1.931 -0.385
MEX -2.5281 0.317 -7.971 0.000 -3.151 -1.905
NLD -0.7843 0.185 -4.237 0.000 -1.148 -0.421
NOR -0.2509 0.183 -1.373 0.170 -0.610 0.108
NZL -1.0172 0.300 -3.389 0.001 -1.607 -0.428
POL -0.7752 0.287 -2.700 0.007 -1.339 -0.211
PRT -0.7019 0.256 -2.737 0.006 -1.206 -0.198
SVK -0.0708 0.267 -0.265 0.791 -0.596 0.454
SVN -1.0328 0.272 -3.796 0.000 -1.567 -0.498
SWE -0.6196 0.199 -3.113 0.002 -1.011 -0.229
TUR 0.6481 0.362 1.788 0.074 -0.064 1.360
==============================================================================
Omnibus: 39.922 Durbin-Watson: 0.365
Prob(Omnibus): 0.000 Jarque-Bera (JB): 99.234
Skew: 0.353 Prob(JB): 2.83e-22
Kurtosis: 4.921 Cond. No. 64.4
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The Fixed effects regression finds a negative relationship between GDP per capita, Inflation, Migration and the outcome variable Native Born Unemployment. All 3 coefeccients found are statistically significant.
THe lasso regression finds that Inflation does not influence Native Born Unemployment. The Lasso regression also finds a negative relationship between GDP per capita, Migration and the outcome variable Native Born Unemployment. However the R-squared values are quite low, indicating the variance of the dependent variable (Unemployment of the native born) is not well explained by the variance of the independent variables (Migration, GDPpc and Inflation).
The Bayesian model find a negative relationship between GDP per capita, Inflation, Migration and the outcome variable Native Born Unemployment.
The negative relationship is the strongest for GDPpercapita, which makes sense as high GDP per capita indicates a strong economy which can lead to low unemployment.
For the Bayesian Model and the Lasso Regression the effect size of Migration are almost exaclty equal at -0.119.
Overall we can conclude that while our coeffecients are not exactly equal between the different methods used they do all have the same negative sign. The negative relationship between Inflation and Native Born Unemployment is surprising as normally Inflation does not have a positive effect, however cases of very low inflation could nidicate economic stagnation which can negatively influence the laor market and thus increase Native Born Unemployment.
Finally we do find that more migration is correlated with lower Native Born Unemployment, however this does not necessary imply that more Migration decreases Unemployment. It could also be another factor that both attracts immigrants also leads to lower Native Born Unemployment, for example if a country's economy is doing well. So to conclude we do not find definitive evidence to disprove teh far right talking points mentioned in the introduction, but we also definitely do not find evidence that proves their point.
DAG code: https://mungingdata.com/python/dag-directed-acyclic-graph-networkx/
Bayesian analysis code: https://github.com/janboone/msc_datascience/blob/master/hacking_part_2.ipynb
Lasso regression code: https://www.kirenz.com/post/2019-08-12-python-lasso-regression-auto/#lasso-regression
Fixed effects regression: https://timeseriesreasoning.com/contents/the-fixed-effects-regression-model-for-panel-data-sets/