PISA test, analysis of school performance.

PISA test data

For a recent, great Meetup of the NYC Data Wranglers we were exploring three very different data-sets. While I was working with a small team on predicting oil-prices from historical OPEC data during the Meetup, on the train to the Meetup I did some exploratory data analysis on a data-set measuring the influence of external factors on kid’s school performance. Here is a small summary of my findings.

The data was collected by the “Programme for International Student Assessment” (PISA test) in order to identify potential policies to help improving school performance. Here I analyze PISA test data reporting students performance in reading, science and math for 68 countries (34 OECD members, 34 non-OECD members). The performance for each country is reported for groups classified by i) class size, ii) age of father, iii) age of mother, iv) public or private school or v) teacher morale.

Using this data I want to test the following hypotheses:

1) Performance in reading, science and math are highly correlated.
2) Students in OECD member countries outperform non-member countries.
3) The smaller the class, the higher the performance.
4) Private schools outperform public schools.
5) Teacher morale is an important contributor to performance.
6) The older the parents, the higher the performance.

All data and code used to perform this analysis can be downloaded from my github repository. While my narrative sometimes suggests causality, I am well aware that the data presented here can only measure correlation not causation.

Correlation of performance metrics

I hypothesize that the systematic differences in the students performance outweigh the topic specific differences. To test this hypothesis, I combined all five tables, such that each group in each country contributes 3 performance measurements. I then scatter-plotted performance measurements for reading, math and science pairwise (Figure 1, lower triangular panels). The diagonal appearance of all pairwise scatterplots indicates a high correlation between performance measurements. In agreement, a Pearson correlation test results in correlation coefficients > 0.95 for all pairs (Figure 1, upper triangular panels). This confirms that all three measurements are highly correlated.

Figure 1. School performance measures are highly correlated. Scatter plots of performance measures (lower triangular panels) for OECD member (green) and non-member states (red). Densities of performance values (diagonal panels) and pairwise correlation coefficients between performance measures (upper triangular panels). Boxplots of performance values for countries grouped  by OECD membership (right column).

Figure 1. School performance measures are highly correlated. Scatter plots of performance measures (lower triangular panels) for OECD member (green) and non-member states (red). Densities of performance values (diagonal panels) and pairwise correlation coefficients between performance measures (upper triangular panels). Boxplots of performance values for countries grouped by OECD membership (right column). School performance is measured in arbitrary units and scaled such that the mean for each performance measure is 500 and the standard deviation 100.

Performance of OECD member and non-member countries

The Organisation for Economic Cooperation and Development (OECD) acts as a forum to allow comparison of policy approaches, identify good practices and coordinate their realization in member countries. Educational performance quantified here can be used as a measure of the OECD performance itself. I used distinct colors for OECD member (green) and non-members states (red) in Figure 1. Distributions for all three performance measures are significantly different between these groups of countries (Figure 1. Diagonal panels) as measured by Kolmogorov-Smirnoff test ( for reading, math and science). Interestingly, there are minor peaks within the distributions for both classes which overlap with the major peak of the other. This may be due to a confounding variable, and thus the correlation between OECD membership and school performance may not be direct. An obvious candidate for such a confounding variable may be wealth of the country.

School performance and class size

I hypothesize that a higher teacher to student ratio improves school performance. To test this, I visualized the distribution of performance (average of math, science and reading) using boxplots for each class size group for OECD member and non-member states (Figure 2 left panel). Interestingly, my hypothesis “the smaller the class, the better the performance” is wrong: For OECD non-members performance is only decreased by very large classes (> 45 students). Similarly, performance correlates negatively for classes with too many students in OECD member classes. Surprisingly, performance decreases significantly for classes with very few students, resulting in an optimal class size of 31-35 students.

Figure 2. External factors affect school performance as measured by the PISA test. Influence of class size (left), school form (middle) and teacher morale (right) on school performance. Each panel is split for OECD member (right) and non-member states (left). Performance distributions are represented by boxplots for each class. School performance is measured in arbitrary units and scaled such that the mean for each performance measure is 500 and the standard deviation 100.

Figure 2. External factors affect school performance as measured by the PISA test. Influence of class size (left), school form (middle) and teacher morale (right) on school performance. Each panel is split for OECD member (right) and non-member states (left). Performance distributions are represented by boxplots for each class. School performance is measured in arbitrary units and scaled such that the mean for each performance measure is 500 and the standard deviation 100.

School form, teacher morale and school performance

Using the same visualization technique I test wether private schools outperform public schools and whether teacher morale is highly correlated with performance. Both hypotheses hold true (Figure 2 middle and right panel).

School performance and parental age

Finally, I wanted to address if the age of parents is correlated with school performance of their kids. The performances were grouped by age of the mother and father respectively and represented with boxplots (Figure 3). Note that the age of the investigated students is ~15 years. Thus, parents in the first group (<36 years) became parents in their teens. As hypothesized, very young parents correlate with decreased school performance of their kids. While parental age correlates positively with performance for the entire age range for non-OECD countries, there is an optimal age (46-50 years) for OECD countries. The behavior for both OECD member and non-member countries is unspecific with respect to parental gender. Importantly, this correlation is likely to be no causation, but a secondary effect reflecting the tendency of late pregnancy in specific groups of society.

Figure 3. Parental age affects school performance as measured by the PISA test. Influence of maternal (left) and paternal age (right) on school performance. Each panel is split for OECD member (right) and non-member states (left). Performance distributions are represented by boxplots for each class.

Figure 3. Parental age affects school performance as measured by the PISA test. Influence of maternal (left) and paternal age (right) on school performance. Each panel is split for OECD member (right) and non-member states (left). Performance distributions are represented by boxplots for each class. School performance is measured in arbitrary units and scaled such that the mean for each performance measure is 500 and the standard deviation 100.

Summary and interpretation

This exploratory analysis of the PISA test data allowed me to make the following observations concerning school performance:

1) Performance in reading, math and science is highly correlated.
2) Performance is significantly higher in OECD member versus non-member countries. My gut feeling tells me that this is very likely to be a confounding effect, potentially measuring wealth of the country. This hypothesis could be easily tested using for example the GDP as a predictor.
3) There seems to be an optimal class size. Again, this has to be taken with a grain of salt and tests for indirect correlations have to be performed. If this correlation holds and can be extended to a causation, this would be the most interesting result of this analysis, because it would provide an inexpensive way to improve educational performance.
4) Private schools outperform public schools and
5) teacher morale correlates positively with performance. While these results are expected, the latter may provide a relatively inexpensive lever to increase school performance.
6) The older the parents, the better the kids perform. Again, this effect is very likely to be secondary and is not exploitable to improve education.

GDELT: MySQL database

GDELT

The Global Database of Events, Language and Tone (GDELT) is a project aimed to catalogue events and their reception in an exhaustive and unbiased fashion. To this end, coverage by media worldwide is classified with respect to involved parties, the nature of the event, location and the tone of the coverage itself. In total 58 features are used to characterized an event.

I’m really excited about the shear mass of information logged by GDELT: The database covers approximately a quarter billion events from 1979 to 2013. While this information content makes GDELT a great potential resource to answer political, economical and social questions, it shares the burden of all big data: How can we efficiently store the data, but make it fast and easily accessable?

Here I describe my MySQL implementation, designed for efficient database access without loosing too much of the intuitive feel for the GDELT data. While the scale of the data may suggest a NoSQL solution, a MySQL database may allow easier interaction with the data for non-programmers. In the first section I describe the design process. In the second section I introduce java code to easily populate the database with data. In the third section I will test the database by trying to retrieve some useful information. Specifically I will ask i) how does the number of events change over time and ii) is the feature QuadClass a good indicator for conflict?

Database design

The goal of the database implementation is to provide easy access to the data by allowing fast responses to intuitive database queries.

The data provided by GDELT is formatted as one very large table (relation) with one simple constraint: every field contains only one value. Thus, the input data satisfies first normal form (1NF). Importantly, the data is highly redundant. Redundancy is problematic for at least two reasons: It makes the database susceptible for anomalies and it significantly increases the database footprint in terms of storage and accessibility. While anomalies will play a minor role in the GDELT database (no data modification after insertion is expected), the storage and latency effects of redundant data will be large.

To eliminate these problems I want to transform the database to satisfy third normal form (3NF). While this can be done in an automated fashion I want to use this transformation to maximize the intuitive feeling for the data experienced by the user. To this end, I divided the GDELT fields into 6 subtables: Date, Actor, Event, Media, Geo and Source. All subtables except Source are further fragmented into subsubtables eliminating redundancy while keeping usability. The primary keys of subtable are foreign keys in the parent table enforcing database integrity. While the vast majority of tables satisfy 3NF, I deliberately decided to keep some redundancy to maximize user friendliness. For the database-design I used MySQL Workbench. Figure 1 visualizes the design and the design file can be downloaded here.

Figure 1: GDELT MySQL Database design. Boxes indicate tables. Primary key dependencies are visualized by arrows between tables. Tables are grouped by color coding.

Figure 1: GDELT MySQL Database design. Boxes indicate tables. Primary key dependencies are visualized by arrows between tables. Tables are grouped by color coding.

The design removes redundancy and makes searches significantly faster: Take for instance Actors. In the original data each row contains all information for both actors involved in an event. In the new design we store unique sets of actor features in the Actor table and reference these entries in the main table. The resulting Actor table is significantly smaller than the main table and can be searched in milliseconds. Events for identified actors can then be quickly retrieved from the main table taking advantage of foreign keys stored in B-trees.

Database realization

Having decided on a database design, we can now initiate the database and populate it with data. Initiation is very simple: MySQL Workbench allows forward engineering from the design to MySQL code to create all relations (download here). Pasting this code into the MySQL command line will initiate the relations.

Populating the database is a little bit more involved, mainly due to interconnected relations. To enable simple population, I wrote code to load the data taking advantage of the recursive nature of the data insertion process:

This function is implemented in the java class Table.java. All tables extend this class, implementing the recursion specific for the respective data and table subsets. To abolish tries to insert redundant data, inserted sets are stored in hashes.

Population of the database occurs in two steps: The first step loads tables which are not expected to change a lot. The majority of these static tables are provided by the GDELT project. I had to remove some duplications and added another table for the QuadClass. All tables can be downloaded here. These tables are loaded into the database via the class DatabaseInitiation.java. This function also loads an example set into the database, to test if everything runs smoothly. To initiate the database export DatabaseInitiation.java as an executable jar and run it from the command line:

Substitute 172.29.13.226:3306 with your MySQL server IP:Port, myUser and myPassword with your user credentials and /PATH/TO/staticTables/ with the directory you stored the static tables in.

The second step, the actual data propagation is performed by DataImport.java. To load data into the database, export this class as an executable jar and run it from the command line:

Substitute 172.29.13.226:3306 with your MySQL server IP:Port, myUser and myPassword with your user credentials and 20131107.export.CSV with the GDELT data file you want to load.

The main disadvantage of the MySQL implementation lays in the database population: For every line of data the database checks for each subtable if it already contains the respective entries. I limit actual MySQL-queries by hashing unique data sets at each subtable level. Advantages of the normalized MySQL implementation are the small resulting footprint of 13 GB (dump w/o indices) and its fast query results (see below). Unfortunately, I don’t have space to host a database dump here, but I’m happy to upload it to a server. Just drop me a line.

Testing the database

Mean number of events per day

Now that the database is populated, let us test its performance by finding out some general features of the data. Let’s start with reporting the number of events for each day:

The first subtable d holds the mapping of DateId and Date information, the second subtable the mapping between DateId and EventCount. Joining of the subtables d and c on DateId results in a table holding the date information together with the number of events reported. Considering that the entire database had to be traversed the result was retrieved fairly quickly (2 min 33.73 sec). We can now use the newly created table EventCountPerDate to retrieve summary statistics for every year and save it to a file:

Visualization (script) of the data shows that reported events increase exponentially from ~1,000 events per day in 1980 to almost 100,000 events per day in the 2010s (Figure 2, note the logarithmic y-axis).

Figure 2: Number of events per day (Log10) plotted versus year. Mean (red line) and standard deviation (black dashed line) are shown.

Figure 2: Number of events per day (Log10) plotted versus year. Mean (red line) and standard deviation (black dashed line) are shown.

QuadClass, an indicator for conflict?

For a little bit more sophisticated analysis, let us concentrate on the feature QuadClass (see GDELT data format). QuadClasses are used to classify events into four groups: verbal and material cooperation (1 and 2) as well as verbal and material conflict (3 and 4). In theory, events with certain QuadClasses should be enriched during wars (i.e. material conflict) and others during peace periods (i.e. material cooperation). To test this hypothesis, I used the Global Peace Index to identify the most and less peaceful countries within the last 10 years. According to this index, the most peaceful country is Iceland, the less peaceful one Somalia. If GDELT's QuadClass classification is an unbiased estimator of conflict, there should be a significant difference in QuadClass counts between these two countries. Using the MySQL database we can easily retrieve the number of events in each year for each QuadClass.

The MySQL queries below show how to perform this analysis using Iceland as an example. Let us first retrieve its CountryId:

We can now create a new relation which reports DateId and QuadClass for all events in which Iceland (CountryId = 122) acted as Actor1:

We can then summarize events by QuadClass and year and write the result to disk:

In order not to bias the analysis towards aggressors or defenders the analogous analysis if performed for Actor2 and both datasets combined. The same procedure is followed for Somalia (CountryId 222) and QuadClass per year visualized (script for all remaining plots) for both countries (see Figure 3).

Figure 3A: Iceland, event count per QuadClass (Log10) versus year.

Figure 3A: Iceland, event count per QuadClass (Log10) versus year.

Figure 3B: Somalia, event count per QuadClass (Log10) versus year.

Figure 3B: Somalia, event count per QuadClass (Log10) versus year.

The difference in i) total number of events per year and ii) their distinct change over time makes a direct comparison difficult. Both problems can be taken care of by normalizing QuadClass counts by total number of events for each year and plotting the fraction of total events for each QuadClass (Figure 4).

Figure 4A: Iceland, fractions of events per QuadClass (Log10) per year.

Figure 4A: Iceland, fractions of events per QuadClass (Log10) per year.

Figure 4B: Somalia, fractions of events per QuadClass (Log10) per year.

Figure 4B: Somalia, fractions of events per QuadClass (Log10) per year.

Obviously one QuadClass, "Verbal Cooperation", dominates, characterizing almost 100% of all events. Interestingly, the other three QuadClasses have not only similar values, but are also highly correlated. While this behavior could be expected for the related QuadClasses "Verbal and Material Conflict", the correlation with "Material Cooperation" is very surprising. A simple, but unrewarding explanation would be that the three minor QuadClasses represent background noise and observed changes are the result of fluctuations in the major QuadClass "Verbal Cooperation". In agreement, minor QuadClasses of all countries I checked follow a similar pattern: approximately constant values until 2005, followed by a decrease. This behavior is anti-correlated with a sudden increase in total events (see Figure 1). This anti-correlation could suggest that the decrease in minor QuadClass could be explained by an increase in total event coverage, which would give more weight to less spectacular news.

Despite this worrying correlation, can we still use QuadClass as an indicator of conflict? To boost sensitivity, let us collapse both subclasses of conflict and cooperation into one. Figure 5 shows how QuadClasses representing conflict and cooperation vary over time.

Figure 5A: Iceland, fractions of events per simplified QuadClass (Log10) per year.

Figure 5A: Iceland, fractions of events per simplified QuadClass (Log10) per year.

Figure 5B: Somalia, fractions of events per simplified QuadClass (Log10) per year.

Figure 5B: Somalia, fractions of events per simplified QuadClass (Log10) per year.

While the fraction of conflict related events is visually similar (due to the Log-scale), they are significantly enriched in Somalia compared to Iceland (0.131 vs. 0.068). Assuming that the World Peace Index and GDELT agree on the identity of the most extreme countries in respect to conflict, this 2-fold difference represents the expected dynamic range. A low signal-to-noise ratio () makes the observed dynamic range too small to use the feature QuadClass as an independent, robust indicator for conflict.

Conclusion

The MySQL implementation described here provides an efficient and fast interface to interact with the GDELT data. Its disadvantages lay in the slow population of the database and the general problem of relational databases that the design anticipates future uses. These disadvantages would be abolished in using a non-relational database like Hadoop, but would complicate the interaction with the data.

A quick analysis of the GDELT data shows that it is indeed a large database of events with potential importance. Robust analysis of the data will be influenced by significant biases and will have to be taken care of with careful normalization. Due to the small dynamic range and small signal-to-noise ratio, the QuadClass feature alone does not seem to be a robust indicator for conflict.

Kaggle: Africa Soil Property Prediction Challenge

Kaggle Africa Soil Property Prediction Challenge

The Kaggle Africa Soil Property Prediction Challenge asked for the best model to predict 5 soil parameters (Ca, P, pH, SOC and Sand) from infrared spectroscopy data and some additional features. The competition posed a series of interesting challenges, including regression of multi-parameter targets, unbalanced training and test sets, a significant overfitting problem and a non-representative set used for the preliminary derivation of the leaderboard position.

Data preprocessing

All preprocessing presented below was performed in R and my script can be downloaded here.

The training data contained ~3-fold more features than samples, immediately indicating a potential overfitting problem. My data preprocessing therefore focused on dimensionality reduction without loosing crucial information.

But before starting with features reduction, I performed a normalization which significantly improved downstream prediction performance: All spectra have a characteristic decay from low to high wavenumbers (see Figure 1A). I quantified the decay by fitting an exponential function to the data and subtracted this decay from each spectrum, resulting in more evenly distributed intensities over the wavenumber range. This normalization collapsed spectral data into more similar traces, enabling better comparison (see Figure 1B).

Kaggle: Africa Soil Property Prediction Challenge. Figure 1A: Raw spectral data. Relative intensities vs wavenumber are plotted for 10 examples. Sample IDs are given in the legend.

Figure 1A: Raw spectral data. Relative intensities vs wavenumber are plotted for 10 examples. Sample IDs are given in the legend.

Kaggle: Africa Soil Property Prediction Challenge. Figure 1B: Decay normalized spectral data. Relative intensities vs wavenumber are plotted.

Figure 1B: Decay normalized spectral data. Relative intensities vs wavenumber are plotted.

Visualization of all spectra in one heat map identifies peaks common to all spectra as well as spectra specific peaks (Figure 2B). Clustering identifies distinct groups of similar spectra. To characterize the spectral data better I added a few further features. These features include the characteristic decay described above as well as summary statistics like median, mean and standard deviation.

To reduce dimensionality, I decomposed the data into its principal components (PCA). Principal components were ordered by their contribution to the variance. The fraction of variance explained by each of the first 20 principal component is plotted in Figure 2B. The fast decay indicates that the data can be well approximated by few principal components. In other words, the information contained in > 3000 features can be compressed into a few principal components losing little information, the prerequisite for successful dimensionality reduction. The number of principal components used for prediction was determined during model selection (see below).

Kaggle: Africa Soil Property Prediction Challenge. Figure 2A: Visualization of decay normalized spectral data. Columns correspond to wavenumber and color to intensity (see legend). Each row corresponds to one spectrum. Rows are ordered by clustering.

Figure 2A: Visualization of decay normalized spectral data. Columns correspond to wavenumber and color to intensity (see legend). Each row corresponds to one spectrum. Rows are ordered by clustering.

Kaggle: Africa Soil Property Prediction Challenge. Figure 2B: Principal component analysis. The fraction of total variance of the data explained by each principal component is shown. Principal components are ordered by contribution.

Figure 2B: Principal component analysis. The fraction of total variance of the data explained by each principal component is shown. Principal components are ordered by contribution.

Balancing training and test data sets

During preprocessing and first machine learning runs I noticed that the features of test and training data follow distinct distributions. Unbalanced training and test sets pose a significant problem by i) forcing extrapolation to unobserved ranges and/or ii) giving too much weight to outlier values.

Since generating more training data is impossible, training and test sets were matched by resampling. To find the training samples best representing the test data, I performed a nearest neighbor search in the principal component space to return the n most similar training samples to each test sample. Only training samples which were returned for at least one test sample were kept. Matching was performed in python using scikit learn algorithms and can be found here.

While this downsampling technique generates a training set more similar to the test set, sample reduction itself can have critical negative affects on the prediction performance and has to be assessed carefully.

Learning: Deep neural networks

To predict the soil properties from the preprocessed data, I used a deep neural network implemented in the H2O framework. Specifically, I used an ensemble of 20 deep learning networks, each containing 100 neurons in each of the 2 hidden layers. For each target value a grid search to identify the optimal regularization parameters was performed. Model performance was quantified by 20-fold cross validation on the training data and calculating the root-mean-square error over all target values. The code to interact with the remote H2O server can be found here and is a slight modification of a script generated by Arno Candel.

Kaggle: Africa Soil Property Prediction Challenge. Figure 3: Cross-validation error (RMSE) vs number of principal components. Data is shown for unprocessed (grey) decay-normalized (orange) and decay-normalized and test-set-matched (blue) data.

Figure 3: Cross-validation error (RMSE) vs number of principal components. Data is shown for unprocessed (grey) decay-normalized (orange) and decay-normalized and test-set-matched (blue) data.

To identify the optimal number of principal components, I plotted the cross-validation error as a function of principal component count of the unprocessed (Raw), the decay-subtracted (Decay Normalized) and the decay-subtracted and test-set-matched data (Decay Normalized and Matched) (Figure 3). All three curves follow a typical pattern of decreasing error with addition of important features, followed by an increase in error, representing overfitting. The minimal cross validation error was achieved using the first 64 principal components of the decay-normalized, test-matched training data set (RMSE=0.36). The winning model was then applied for prediction on the test data.

Conclusion

Like many other competitors, I was surprised by the huge discrepancies between the cross-validation error and the error on the actual test data. This boils down to one simple diagnosis: Overfitting on the training set. While I tried to circumvent this phenomenon by dimensionality reduction, model averaging (ensemble of networks) and matching of training and test data, this was clearly not sufficient. A more efficient (and time consuming) strategy would have been to train distinct learning algorithms and merge individual predictions.

Neuroph: GraphML export

Neuroph is an awesome java neural network framework for machine learning. While I use the framework’s API, it also provides a graphical user interphase which allows visualization of neural networks. Unfortunately, the visualization fails for networks with a large number of nodes. Here I introduce a GraphML export functionality for Neuroph allowing visualization in Gephi.

In my work I use neural networks with several hundred input and hidden nodes. I would like to use interactive visualization and some graph theory algorithms to perform exploratory analysis. For such analyses I normally use Gephi and the R package iGraph, respectively. Both of these tools can import network architectures saved in the GraphML format, which uses XML syntax to describe structural properties of graphs.

I contributed a package to the Neuroph project neuroph.contrib.graphml, which allows export of a trained neural network as a GraphML file, specifying network topology as well as node-labels and learned weights. Export is performed with a few lines of code as shown in Example.java:

Visualization of the exported network allows easy identification of important network nodes. Let's define important nodes as those with edges with large absolute weight. The GraphML is imported into Gephi, the network visualized in a space-filling way and edges colored by weight:

Neuroph: GraphML export. Neural network. Edges colored according to weight.

Neural network. Nodes are represented as circles, edges as green lines connecting nodes. Edges are colored according to weight.

While this visualization is aesthetically appealing, the edges with large weight are difficult to spot due to the number of edges. This is where the interactive potential of Gephi comes into play. We can selectively show edges with large positive weight (left) or large negative weight (right):

Neuroph: GraphML export. Edges with large positive weight.

Edges with large positive weight.

Neuroph: GraphML export. Edges with large negative weight.

Edges with large negative weight.

We can thus easily identify nodes with high contributing potential. While this example just shows the start of an exploratory analysis, I hope it demonstrates how quick and interactive the GraphML export functionality for Neuroph can make it.

Kaggle: Titanic Competition

Introduction

I’d like to share my approach to the Kaggle Titanic competition. The objective of this competition is to use machine learning to predict which passengers are likely to survive the titanic accident. The training data contains information (gender, age, class etc.) of survived and deceased passengers. We use this data to train a classifier which incorporates useful features to predict passenger survival.

My approach can be divided into two parts: Pre-processing and classification. I spent most of my time on preprocessing the data to maximize the use of available information. This focus is reflected in the sizes of the respective sections. I use a combination of R with ggplot2 and Java, using the machine learning library weka. You’ll find the code I wrote on my github page.

Pre-processing - Parsing

The first task seems trivial: Parsing. Let’s take a look at one row:

32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C

We see good and bad things. Bad things first: i) The name string looks non-trivial to parse. ii) Some of the features have missing values. The good thing is that there is a lot of information hidden within single features. Take for example the name feature. The titles can tell us about the social standing and profession of a passenger. Does a Mister have the same chance to survive as a Master or a Reverent? Surnames can indicate families and therefore grouping. Another interesting feature is the Cabin feature (B78). Is the deck (B) relevant for survival? Do passengers with more than one cabin have distinct survival chances?

I wrote a simple parser in java to i) replace missing values with the weka standard ‘?’ and ii) to extract more information from several features. Specifically, I split the original ‘Name’ feature into ‘Surname’ and ‘Title’. I split the ‘Ticket’ feature into the leading characters (‘TicketId’) and the numeric value (’TicketNr’). I further split the ‘Cabin’ feature into how many cabins the passenger booked (‘CabinCount’), on which deck the cabins are located (‘CabinDeck’) and the cabin number (‘CabinNr’). I also keep the original ‘Cabin’ as a potential grouping feature. The new data contains the following features:

PassengerId,Survived,Pclass,Surname,Title,Sex,Age,SibSp,Parch,TicketId,TicketNr,Fare,CabinCount,CabinDeck,CabinNr,Cabin,Embarked

You can download the newly formatted training and test data from my github page.

Pre-processing - Data exploration

Let’s get a feeling for the data! For this section I wrote my code in R due to its strength in visualization and statistics. Before we focus on features let’s calculate the fraction of passengers survived, . This is a bit higher than expected from a totally unbiased sample ( see wikipedia). Maybe the nice people at Kaggle gave us training data in which both classes are almost equally represented. We have two feature types, nominal and numerical. Let’s look at both types independently and try to identify those useful for our classification problem.

Feature selection - Nominal

We want to identify nominal features for which a given value provides additional information on the survival probability compared to the base rate. Take for example the feature "Sex". Is the survival probability for male and female passengers distinct? Interestingly, of all female passengers in our training data survived, compared to only of male. This looks interesting, but can we quantify how likely such a separation could have occurred by chance alone? We can model the chance of survival, neglecting any features, as a Bernoulli trial for each passenger. This simulates a coin flip with a biased coin such that the probability of success equals the population average. In the training data we observe survivors out of trials, therefore a base survival rate of . The probability to get k survivors given n trials assuming the base rate is given by the Binomial distribution .

How does the probability distribution help us to identify useful nominal features? We use this theory to generate a null hypothesis, stating that the observed number of survived passengers for a given class (e.g. female) is generated by the base survival rate. Importantly, we can use the binomial test to derive a p-value for k survivors out of n passengers given our null hypothesis. The p-value reflects the probability to observe a result at least as extreme as the one observed, given the null hypothesis. Let us come back to our example: out of female passengers survived. Figure 1 shows the probability distribution of k survivors given passengers and the base survival rate of . The vertical line indicates the observed number of survivors ().

Kaggle: Titanic Competition. Figure 1. Probability to observe k survivors given n=314 female passengers and a population survival rate of p=0.38. The vertical line indicates the number of survivors observed.

Figure 1. Probability to observe k survivors given n=314 female passengers and a population survival rate of p=0.38. Vertical line indicates observed number of survivors.

The p-value of the observed number of survivors given the null hypothesis is the sum of all probabilities for (right of the line) and is almost zero (). We can almost certainly reject the null hypothesis and therefore identify the feature "Sex" as important for classification.
Thus, a p-value in conjunction with a threshold gives us a statistical handle to select features. Importantly, we have to correct p-values to take into account that we test one hypothesis for each feature class. The multiple testing problem becomes obvious when you notice that you accept a small probability to incorrectly reject the null hypothesis for each individual test. Testing a sequence of hypothesis makes the occurrence of at least one of these false positive more likely. If you choose a p-value threshold of 0.05 and test 50 hypothesis, the probability to falsely reject the null hypothesis at least once is . There are a variety of methods to perform corrections for multiple hypothesis testing. I chose Bonferroni correction, which minimizes the number of false positive results with the cost of accepting a significant number of false negatives.

Filtering out features containing no class with a p-value < 0.01 yields five features (PClass, Title, Sex, CabinDeck and Embarked). While some of the features seem intuitive (PClass, Sex) others are more surprising (Embarked). Figure 2 shows the class-specific survival fractions.

Kaggle: Titanic Competition. Figure 2. Class-specific survival fractions for features with significant difference to the survival base rate. Horizontal lines indicate base rate survival. Width of bars scale with instances per class.

Figure 2. Class-specific survival fractions for features with significant difference to the survival base rate. Horizontal lines indicate base rate survival. Width of bars scale with instances per class.

The grey horizontal line indicates the average survival rate over the entire training population (base rate). The width of each bar scales with the number of instances for each class of a nominal feature. The distance from the average survival rate in combination with the width of the bar gives us a visual intuition of the significance of the difference.

Feature selection - Numerical

Let us now turn to the numerical features (PassengerId, Age, SibSp, Parch, TicketNr, Fare, CabinCount, CabinNr). Again, we seek to identify features whose values can separate between survived and deceased passengers. Figure 3 shows the empirical cumulative density distribution of values for each feature separated by survival.

Kaggle: Titanic Competition. Figure 3. Empirical cumulative density distribution of values for each feature, split by survived (green) and deceased (red). Note that values are normalized between 0 and 1 for all features.

Figure 3. Empirical cumulative density distribution of values for each feature for survived (green) and deceased (red) passengers.

We can immediately see that some numerical values seem indistinguishable for both classes of passengers (e.g. PassengerId) whereas others seem distinct (e.g. Fare). Can we use a statistical test to quantify the subjective difference? The Kolmogorov-Smirnov test allows us to assign a probability that two sets are drawn from the same distribution. Importantly, this test is non-parametric and therefore allows us to test without knowing the type of the underlying distribution. The Kolmogorov-Smirnov derived and Bonferroni corrected p-value identifies 4 interesting features with p-values < 0.01 (SibSp, Parch, TicketNr, Fare). Interestingly, we cannot reject the null hypothesis for the feature Age. Visual inspection indicates that age may play a predictive role for small values only. Let us include Age in our set of features.

Classification

The data exploration phase identified a subset of interesting features. This does not mean that all of them are useful (i.e. they might be redundant) or that these are the only features with useful information. It simply states that these features are characterized by significantly different values for survived and non-survived passengers.

Let’s use these features as a starting point to train our classifier. For classification we use java with extensive usage of the weka library. You can find the main class here. We first load our training and test data in the Instances class, the weka specific data representation. Before starting classification, we have to define the type (nominal or numerical) for each feature. This is necessary because the input format (csv) does not contain type information. To make our life easy we match training and test data by i) adding the missing 'Survived' feature to the test data (without assigning values) and ii) merging the training and test classes for each feature. These steps yield identical formats of both data sets and prepare the data for efficient filtering and classification using the weka framework. We then specify the classification feature (Survived) and normalize all numeric features between 0 and 1. The latter ensures that all numeric features have identical weight at the beginning of training.

For classification we use the build-in sequential optimization algorithm for support vector machines (SMO class). We can estimate the classification performance on our test data using cross-validation contained in the Evaluation class. Leave-one-out cross validation predicts a classification error of 0.178. We then train our classifier on the entire training data and classify all test instances. Submission to Kaggle yields a score of 0.780. As mentioned above, we might have missed useful features in our subset. Let’s see if we can improve that score by adding back features to the subset we defined in our data exploration phase. Let’s use a greedy forward selection of features. We want to maximize our prediction performance while keeping the number of features minimal. This approach adds the nominal features Surname and Cabin to our feature space. Cross validation of the classifier trained on the extended features space predicts an error of 0.144. Submission to Kaggle scores 0.804.

Summary

I focused my approach on increasing the amount of useful information by i) extracting information from given features and ii) identifying features which are significantly different between survived and deceased passengers. For the latter I employed p-value generating statistical tests. Feature selection was performed using a significance threshold. A friend of mine, Hugo Bowne-Anderson, is discussing an alternative approach using a correlation measure to rank features (I will link the post when it is online). I then train a support vector machine to a subset of the feature space. I correctly predict outcome for of the passengers as judged by the Kaggle score.