Statistical Test: Checking for Gender Bias in University Admission List

Statistical Test: Checking for Gender Bias in University Admission List

In the spirit of showing up, here's another thing I did;

I carried out a data analysis task to check for traces of gender bias in a university's admissions list. This project excites me because it addresses an issue strongly associated with my interest: data ethics. To learn more about data bias and ethics as they relate to gender, Caroline Perez has a very informative book, Invisible Women.

In this writing, I try to present a simple description of the steps I followed in carrying out the task. However, as the writing progresses, you may come across advanced statistical jargon, don't fret; I'll try to keep it simple and refer to resources that could help you understand them. I have also included key findings of my analysis in this writing. Finally, you can find the code for the project here. Now let's get right into it!

A data analysis/mining workflow is often not a straightforward process because datasets differ in characteristics and may require different stages. Also, you may have to go back to a previous step of the process. However, it's good to always clean up the process at the end of your analysis.

(Tip: Ensure you always keep a copy of your original dataset in a separate file in case of a titanic sink, that way, you always have the raw materials to rebuild)

The Dataset

The dataset contained a university's admissions list from a secondary source. The table below shows the dataset's features, descriptions, and values.

Screenshot 2022-04-23 at 16.23.21.png

Assumptions

Stating assumptions made in your analysis improves the integrity and clarity of your analysis. Most people leave out the assumptions they made when reporting an analysis. Don't do that. The assumptions I made are listed below.

  1. The dataset is a representative sample (across all features) of the entire admissions list.

  2. The school admits based on only SAT scores (most schools have other requirements for admissions, but since our data provides us with just SAT scores, we need this assumption to move further).

The problem/task

A good skill every good data analyst must have is the ability to translate and view problems from a statistical perspective. This helps to clarify what statistical methods and tools to apply in solving that problem.

From a layman's view: Determine if there are traces of gender bias in a university's admission list.

From a statistical view: Determine if gender is a significant predictor of getting admitted into a particular university.

Significant in the statement above could be ambiguous. This leads us to our next consideration, P-values.

Before we talk about P-values, let's talk about hypotheses. In stats, a hypothesis is a formal claim about a population. There are two types of hypotheses in statistics.

  • Null Hypothesis or H0: It states that there is no relationship between two variables/entities being considered.

  • Alternative Hypothesis or H1: It states that there is a relationship between two variables/entities being considered.

In this context, the null and alternative hypotheses are the following.

H0: There is no relationship between gender and university admission.

H1: There is a relationship between gender and university admission.

(It gets a bit tricky here, but I'll keep it simple as I promised)

In stats, a key idea is that we can never prove a claim to be true since we cannot observe that the claim is true at all times (including in the future). We can only prove that the claim is not true at all times if we find any evidence against it from present or past observations. Therefore, our tests are often constructed to disprove a claim. Statistical tests follow the court system of "innocent until proven guilty," i.e., the null hypothesis is true until disproved.

Ei incumbit probatio qui dicit, non qui negat (the burden of proof is on the one who declares, not on one who denies)

In statistical tests, we aim to reject the null hypothesis (in this context, we want to reject the hypothesis that states there is no association between gender and university admission). To reject a null hypothesis, we need some evidence that backs the alternative hypothesis. With enough evidence, we reject the null hypothesis.

(But how much evidence is enough??)

Another key idea in statistics is that any outcome could be a result of chance. With this knowledge, we can also say that the evidence gotten against the null hypothesis could be a result of chance.

Consider a statistical test to check if a coin is a fair one. In 100 throws, we would expect outcomes around 50 heads and 50 tails. However, note that even with a fair coin we can have an outcome of 95 heads and 5 tails but the probability of obtaining this outcome from a fair coin is low. This probability is known as the** P-value.**

In simple terms, the p-value is the probability that our evidence against the null hypothesis results from chance. A smaller p-value signifies stronger evidence against the null hypothesis.

In testing to see if gender is related to admission here, I would be checking for the p-value of gender as a predictor of admission. i.e., the probability that gender is not related to university admission given the results of our test.

After obtaining the p-value, the null hypothesis can be rejected or accepted based on a threshold chosen by the statistician. In statistics, we often use a threshold of 0.05. i.e.

  • p-value > 0.05: Accept null hypothesis, reject alterative hypothesis

  • p-value < 0.05: Reject null hypothesis, accept alternative hypothesis

Honestly, this last concept can be tricky. It took me some time to understand and digest it. I have left some links at the end of this article to help you understand or solidify your understanding of statistical significance and p-values.

Another way to reconstruct the problem statement is to ask, "Given the same SAT scores, how does the odds of one gender getting admitted compare to the odds of the other gender getting admitted?"

I have left a link to an article that describes why statisticians prefer to consider odds instead of probability. One of the reasons is that it helps us apply the logistic model to solving a problem.

In stats, the logistic model (or logit model) models the probability of one event (out of two alternatives) taking place by modeling the log-odds (the logarithm of the odds) of the event as a linear combination of one or more predictors (independent variables). I have left links to some materials that explain the logistic model.

Now, to the actual process.

The analysis was done in Python programing language using some data-handling and statistical libraries (Pandas, Matplotlib, Seaborn, Scipy, Statsmodels). After importing the data, these were the stages of the entire process.

** Stage 1: Exploratory Data Analysis**

I start with EDA on the dataset in carrying out statistical tests. Here's why

  • It helps me know the data better; its content, structure, and flaws.

  • It's the first and easiest way to extract insights from the data

At this stage, I did the following

  1. Viewed the first 10 rows of the dataset to have a summarised view of its content.

  2. Checked the data types of each feature in the dataset.

  3. Obtained a descriptive summary for all features in the dataset, including categorical features.

  4. Checked for the presence of null values, duplicate entries, and outliers.

  5. Plot bar charts and probability distribution curves to understand the distribution of each feature in the dataset

I took my EDA a step further by filtering the data by gender. I compared findings from the two genders.

These were my findings from the EDA;

  • The dataset contains 168 records of students (90 male and 78 female). The data was clean with no inconsistent or missing values.

  • The SAT scores follow a normal distribution (with skewness of 0.008). This signifies that few applicants did very excellently or poorly i.e. the test was neither too simple nor too difficult.

  • 94 (56%) students were admitted into the university while 74 (44%) students were not.

  • 81% of female applicants were admitted while 34% of male applicants were admitted.

  • More male applicants had lower scores than female applicants.

  • All applicants with scores above 1800 were admitted.

  • Some applicants with scores between 1500 and 1800 were admitted while others weren't.

The Jupyter notebook at the end of the writing contains tables and charts showing these results.

Stage 2: Data Preprocessing

At this stage, I encoded the categorical features using the label encoding technique. For the gender feature, I mapped males to 1 and females to 0 while for the admitted feature, I mapped admitted to 1 and not admitted to 0.

Stage 3: Check for Complete Quasi-Separation

Quasi-separation in a logistic regression problem occurs when an outcome variable separates independent variables. Consider the table and plot below showing the heights of job applicants and if they were employed or not (1 signifies employed while 0 signifies not employed).

Screenshot 2022-04-23 at 16.25.40.png

Employment Vs Height.png

From the table, we see that all applicants shorter than 172 cm were not employed. This is a scenario where complete-quasi separation occurs. In a case where height is the fair and only criteria for employment, we can infer that the process was without bias since all applicants above a particular height were employed and all below the height were not employed.

Following our assumption that SAT scores were the only criteria for admission, we would expect to see a complete-quasi separation between admission and SAT scores. i.e we would expect to see that all students who had scores below a particular score were not admitted while all who had scores above the threshold score were admitted. This is what our scatter-plot looked like for the 'SAT' and 'admitted' features.

Admission vs SAT 1.png

The graph shows something similar to that of a complete-quasi separation but if we look closely, there's an overlap at scores between 1625-1700 This means some of the students who had scores in this range were admitted while some weren't admitted.

Admission vs SAT2.png

Stage 4: Further Exploratory Data Analysis

After finding an interesting spot at scores between 1625 and 1700, I filtered my data to that portion to see if I could extract any insights. After some data analysis, the table below shows the following.

For students who scored between 1625 and 1700, the odds of a female getting admitted was 8 times the odds of a male getting admitted. Note that this result is only for scores between 1625 and 1800.

Stage 5: Model Fitting

One way to know if an independent variable (e.g gender) is relevant in understanding an outcome variable (university admission) is to compare results gotten from two statistical models. The first model should be without the feature under consideration while the second should be with feature being considered. If the second model has a stronger explanatory power than the first model then that's a sign that the independent variable is relevant to understanding the outcome variable.

For logit models, we compare the pseudo-R-squared value gotten from the two models. The model with the higher pseudo-R-squared value has a stronger explanatory power.

*Note: A pseudo R-squared can only be interpreted when compared to the pseudo R-squared of another similar model predicting the same outcome. * I built two logit models, one with just SAT score as a predictor of admission and another with SAT score and gender as predictors of gender. The results are shown below.

Screenshot 2022-04-21 at 17.08.41.png

Screenshot 2022-04-23 at 16.37.18.png

The pseudo R-squared value for the second model was higher than the value of the first, signifying a stronger explanatory power.

Finally, from our result, we see that the gender feature had a p-value of 0.02. This means that the probability that our results are a product of chance and there is no relationship between gender and admission is true is 0.02. At this point, we can choose to accept or reject the null value based on an arbitrary threshold chosen. Using a 0.05 threshold, we reject the null hypothesis.

The snapshot below interprets the report of the second logit model

Screenshot 2022-04-25 at 15.14.07.png

Conclusion
From the result of the analysis, we see that female applicants had 7 times the odd of male applicants getting admitted into this university. This often occurs to reduce male dominance in male-dominated fields and to encourage female participation in those fields. The tools and knowledge applied in this analysis can be used to solve similar problems in other contexts, such as examining if gender plays a significant role in a customer buying a product.

View the Jupyter notebook containing the analysis implementation

References
Cover Image by Ellii
Data365 Data science bootcamp Why statisticians like odds
Statistical significance and p-value
What are p-values What are logit models