Java ML - Titanic
Weka (Waikato Enviornment for Knowledge Analysis)
- Java-based machine learning toolkit with:
- A large number of built-in algorithms for classification, regression, clustering, etc
- Useful for teaching, rapid prototyping, and data analysis
Weka in titanic example
Build a passenger profile and run the heuristic model.
TableSaw Vs Smile
| Aspect | Tablesaw | Smile |
|---|---|---|
| Primary focus | Data manipulation and exploratory data analysis (EDA), similar to pandas in Python | Machine learning, statistics, and data analysis library with ML models and algorithms |
| DataFrame support | Yes, Tablesaw provides a rich DataFrame API for tabular data manipulation | Yes, Smile provides DataFrame, but often more focused on ML workflows |
| ML Algorithms | Minimal or no built-in ML algorithms; mainly for data wrangling and analysis | Extensive ML support: classification, regression, clustering, dimensionality reduction, etc. |
| Data types support | Supports various column types (numeric, categorical, date, etc.) with convenient API | Supports different types but with a focus on numeric data for ML |
| Data visualization | Limited built-in support, but can export or integrate with Java plotting libs | Very limited visualization; focus is on ML and stats |
| Performance | Efficient for in-memory tabular data; good for typical data wrangling tasks | Highly optimized for numerical computation and ML tasks |
| Missing value handling | Good support for missing data in tables | Supports missing data but less focus on data cleaning than Tablesaw |
| API complexity | Simple and intuitive for data manipulation and EDA | More complex, with many ML-related classes and utilities |
| Community and documentation | Growing, focused on data manipulation | Mature, with focus on ML and statistics |
| Integration | Easy integration with Java projects for ETL, data manipulation | Great for projects requiring ML algorithms and predictive modeling |
TableSaw
Select the steps in the CORRECT execution order to avoid the iceberg!
Titanic Data Project Overview
This project explores the Titanic dataset using Java, with three main components for data preprocessing, analysis, and machine learning modeling.
1. TitanicPreprocess.java — Data Cleaning & Preparation
Prepares the raw Titanic dataset for analysis and modeling:
- Loads the raw dataset from a CSV file.
- Adds a new
"Alone"column to indicate whether a passenger was traveling alone. - Removes irrelevant columns:
PassengerId,Name,Ticket, andCabin. - Encodes categorical variables:
Sex:male → 1,female → 0Embarked:C → 1,Q → 2,S → 3
- Fills missing values with the median of each column.
- Saves the cleaned dataset to
titanic_cleaned.csv.
2. TitanicAnalysis.java — Exploratory Data Analysis (EDA)
Performs visual and statistical analysis on the Titanic dataset:
- Loads the raw dataset from a CSV file.
- Adds the
"Alone"column. - Splits data into subsets of survivors and non-survivors.
- Calculates and visualizes:
- Survival rate by gender
- Survival rate based on fare price
- Comparison of survival for passengers traveling alone vs. with family
- Age distribution among survivors and non-survivors
3. TitanicML.java — Machine Learning Modeling
Builds and evaluates ML models using the cleaned dataset:
- Loads and processes the data (similar to
TitanicPreprocess.java). - Normalizes numerical features to a 0–1 range.
- Converts Tablesaw tables to Weka instances.
- Trains two models:
- Decision Tree (J48)
- Logistic Regression
- Evaluates the models using cross-validation.
Recommended Execution Order
To ensure a smooth workflow, run the scripts in this order:
-
TitanicPreprocess.java
Cleans the dataset and generatestitanic_cleaned.csv. -
TitanicAnalysis.java
Performs EDA and generates insights and visualizations. -
TitanicML.java
Runs machine learning models on the cleaned dataset.
1. Tablesaw: Data Analysis & Preprocessing
Tablesaw is a Java library for data manipulation, cleaning, and visualization—similar to pandas in Python.
Where is Tablesaw used? TitanicPreprocess.java: Cleans and transforms the raw Titanic data. TitanicAnalysis.java: Performs exploratory data analysis and visualization. TitanicML.java: Prepares data for machine learning.
Example: Loading Data with Tablesaw
import tech.tablesaw.api.Table;
import java.io.InputStream;
InputStream inputStream = TitanicAnalysis.class.getResourceAsStream("/data/titanic.csv");
Table titanic = Table.read().csv(inputStream);
Purpose:
Loads the Titanic CSV file into a Tablesaw Table object for further processing.
Example: Data Cleaning & Feature Engineering
NumericColumn<?> sibSpColumn = titanic.numberColumn("SibSp");
NumericColumn<?> parchColumn = titanic.numberColumn("Parch");
BooleanColumn aloneColumn = BooleanColumn.create("Alone", titanic.rowCount());
for (int i = 0; i < titanic.rowCount(); i++) {
boolean isAlone = ((Number) sibSpColumn.get(i)).doubleValue() == 0 && ((Number) parchColumn.get(i)).doubleValue() == 0;
aloneColumn.set(i, isAlone);
}
titanic.addColumns(aloneColumn);
Purpose: Adds a new column “Alone” to indicate if a passenger was traveling alone.
Example: Data Visualization
import tech.tablesaw.plotly.Plot;
import tech.tablesaw.plotly.api.Histogram;
Plot.show(Histogram.create("Fare Distribution", titanic.numberColumn("Fare")));
Purpose:
Plots a histogram of the “Fare” column for visual analysis.
2. Weka: Machine Learning
Weka is a Java library for machine learning, providing algorithms and evaluation tools.
Where is Weka used? TitanicML.java: Converts cleaned data into a format Weka understands, trains models, and evaluates them. Example: Converting Tablesaw Table to Weka Instances
import weka.core.*;
private static Instances convertTableToWeka(Table table) {
List<Attribute> attributes = new ArrayList<>();
for (Column<?> col : table.columns()) {
if (col.type().equals(ColumnType.STRING)) {
List<String> classValues = new ArrayList<>();
table.stringColumn(col.name()).unique().forEach(classValues::add);
attributes.add(new Attribute(col.name(), classValues));
} else {
attributes.add(new Attribute(col.name()));
}
}
Instances data = new Instances("Titanic", new ArrayList<>(attributes), table.rowCount());
for (Row row : table) {
double[] values = new double[table.columnCount()];
for (int i = 0; i < table.columnCount(); i++) {
Column<?> col = table.column(i);
if (col.type() == ColumnType.INTEGER) {
values[i] = row.getInt(i);
} else if (col.type() == ColumnType.DOUBLE) {
values[i] = row.getDouble(i);
} else if (col.type() == ColumnType.STRING) {
values[i] = attributes.get(i).indexOfValue(row.getString(i));
}
}
data.add(new DenseInstance(1.0, values));
}
return data;
}
Purpose: Converts a Tablesaw Table into Weka Instances, which is the format Weka uses for machine learning.
Example: Training and Evaluating Models with Weka
import weka.classifiers.trees.J48;
import weka.classifiers.functions.Logistic;
J48 tree = new J48();
tree.buildClassifier(data);
Logistic logistic = new Logistic();
logistic.buildClassifier(data);
// Evaluate with cross-validation
evaluateModel(tree, data, "Decision Tree");
evaluateModel(logistic, data, "Logistic Regression");
Purpose:
Trains a Decision Tree (J48) and Logistic Regression model on the Titanic data. Evaluates models using cross-validation.