Weka (Waikato Enviornment for Knowledge Analysis)

Java-based machine learning toolkit with:
- A large number of built-in algorithms for classification, regression, clustering, etc
- Useful for teaching, rapid prototyping, and data analysis

Weka in titanic example

🚢 Survival Predictor

Build a passenger profile and run the heuristic model.

Class

Sex

TableSaw Vs Smile

Aspect	Tablesaw	Smile
Primary focus	Data manipulation and exploratory data analysis (EDA), similar to pandas in Python	Machine learning, statistics, and data analysis library with ML models and algorithms
DataFrame support	Yes, Tablesaw provides a rich DataFrame API for tabular data manipulation	Yes, Smile provides DataFrame, but often more focused on ML workflows
ML Algorithms	Minimal or no built-in ML algorithms; mainly for data wrangling and analysis	Extensive ML support: classification, regression, clustering, dimensionality reduction, etc.
Data types support	Supports various column types (numeric, categorical, date, etc.) with convenient API	Supports different types but with a focus on numeric data for ML
Data visualization	Limited built-in support, but can export or integrate with Java plotting libs	Very limited visualization; focus is on ML and stats
Performance	Efficient for in-memory tabular data; good for typical data wrangling tasks	Highly optimized for numerical computation and ML tasks
Missing value handling	Good support for missing data in tables	Supports missing data but less focus on data cleaning than Tablesaw
API complexity	Simple and intuitive for data manipulation and EDA	More complex, with many ML-related classes and utilities
Community and documentation	Growing, focused on data manipulation	Mature, with focus on ML and statistics
Integration	Easy integration with Java projects for ETL, data manipulation	Great for projects requiring ML algorithms and predictive modeling

TableSaw

⚠️ WORKFLOW CHALLENGE

Select the steps in the CORRECT execution order to avoid the iceberg!

Data Analysis (EDA)

Preprocessing (Cleaning)

Machine Learning (Modeling)

Titanic Data Project Overview

This project explores the Titanic dataset using Java, with three main components for data preprocessing, analysis, and machine learning modeling.

1. `TitanicPreprocess.java` — Data Cleaning & Preparation

Prepares the raw Titanic dataset for analysis and modeling:

Loads the raw dataset from a CSV file.
Adds a new "Alone" column to indicate whether a passenger was traveling alone.
Removes irrelevant columns: PassengerId, Name, Ticket, and Cabin.
Encodes categorical variables:
- Sex: male → 1, female → 0
- Embarked: C → 1, Q → 2, S → 3
Fills missing values with the median of each column.
Saves the cleaned dataset to titanic_cleaned.csv.

2. `TitanicAnalysis.java` — Exploratory Data Analysis (EDA)

Performs visual and statistical analysis on the Titanic dataset:

Loads the raw dataset from a CSV file.
Adds the "Alone" column.
Splits data into subsets of survivors and non-survivors.
Calculates and visualizes:
- Survival rate by gender
- Survival rate based on fare price
- Comparison of survival for passengers traveling alone vs. with family
- Age distribution among survivors and non-survivors

3. `TitanicML.java` — Machine Learning Modeling

Builds and evaluates ML models using the cleaned dataset:

Loads and processes the data (similar to TitanicPreprocess.java).
Normalizes numerical features to a 0–1 range.
Converts Tablesaw tables to Weka instances.
Trains two models:
- Decision Tree (J48)
- Logistic Regression
Evaluates the models using cross-validation.

Recommended Execution Order

To ensure a smooth workflow, run the scripts in this order:

TitanicPreprocess.java
Cleans the dataset and generates titanic_cleaned.csv.
TitanicAnalysis.java
Performs EDA and generates insights and visualizations.
TitanicML.java
Runs machine learning models on the cleaned dataset.

1. Tablesaw: Data Analysis & Preprocessing

Tablesaw is a Java library for data manipulation, cleaning, and visualization—similar to pandas in Python.

Where is Tablesaw used? TitanicPreprocess.java: Cleans and transforms the raw Titanic data. TitanicAnalysis.java: Performs exploratory data analysis and visualization. TitanicML.java: Prepares data for machine learning.

Example: Loading Data with Tablesaw

import tech.tablesaw.api.Table;
import java.io.InputStream;

InputStream inputStream = TitanicAnalysis.class.getResourceAsStream("/data/titanic.csv");
Table titanic = Table.read().csv(inputStream);

Purpose:

Loads the Titanic CSV file into a Tablesaw Table object for further processing.

Example: Data Cleaning & Feature Engineering

NumericColumn<?> sibSpColumn = titanic.numberColumn("SibSp");
NumericColumn<?> parchColumn = titanic.numberColumn("Parch");
BooleanColumn aloneColumn = BooleanColumn.create("Alone", titanic.rowCount());
for (int i = 0; i < titanic.rowCount(); i++) {
    boolean isAlone = ((Number) sibSpColumn.get(i)).doubleValue() == 0 && ((Number) parchColumn.get(i)).doubleValue() == 0;
    aloneColumn.set(i, isAlone);
}
titanic.addColumns(aloneColumn);

Purpose: Adds a new column “Alone” to indicate if a passenger was traveling alone.

Example: Data Visualization

import tech.tablesaw.plotly.Plot;
import tech.tablesaw.plotly.api.Histogram;

Plot.show(Histogram.create("Fare Distribution", titanic.numberColumn("Fare")));

Purpose:

Plots a histogram of the “Fare” column for visual analysis.

2. Weka: Machine Learning

Weka is a Java library for machine learning, providing algorithms and evaluation tools.

Where is Weka used? TitanicML.java: Converts cleaned data into a format Weka understands, trains models, and evaluates them. Example: Converting Tablesaw Table to Weka Instances

import weka.core.*;

private static Instances convertTableToWeka(Table table) {
    List<Attribute> attributes = new ArrayList<>();
    for (Column<?> col : table.columns()) {
        if (col.type().equals(ColumnType.STRING)) {
            List<String> classValues = new ArrayList<>();
            table.stringColumn(col.name()).unique().forEach(classValues::add);
            attributes.add(new Attribute(col.name(), classValues));
        } else {
            attributes.add(new Attribute(col.name()));
        }
    }
    Instances data = new Instances("Titanic", new ArrayList<>(attributes), table.rowCount());
    for (Row row : table) {
        double[] values = new double[table.columnCount()];
        for (int i = 0; i < table.columnCount(); i++) {
            Column<?> col = table.column(i);
            if (col.type() == ColumnType.INTEGER) {
                values[i] = row.getInt(i);
            } else if (col.type() == ColumnType.DOUBLE) {
                values[i] = row.getDouble(i);
            } else if (col.type() == ColumnType.STRING) {
                values[i] = attributes.get(i).indexOfValue(row.getString(i));
            }
        }
        data.add(new DenseInstance(1.0, values));
    }
    return data;
}

Purpose: Converts a Tablesaw Table into Weka Instances, which is the format Weka uses for machine learning.

Example: Training and Evaluating Models with Weka

import weka.classifiers.trees.J48;
import weka.classifiers.functions.Logistic;

J48 tree = new J48();
tree.buildClassifier(data);

Logistic logistic = new Logistic();
logistic.buildClassifier(data);

// Evaluate with cross-validation
evaluateModel(tree, data, "Decision Tree");
evaluateModel(logistic, data, "Logistic Regression");

Purpose:

Trains a Decision Tree (J48) and Logistic Regression model on the Titanic data. Evaluates models using cross-validation.

Popcorn Hack

- Run the Titanic code on your own computer

- Use Tablesaw to visualize the class distribution (first, second, third class) of the Titanic data

Lets Look at a Titanic Example

TITANIC_OS V1.0 - SECURE TERMINAL

Loading challenge...

. . . . . . .________________. | TITANIC | \________________/ ~~~~~~~~~~~~~~~~ ACCESS GRANTED

Weka (Waikato Enviornment for Knowledge Analysis)

Weka in titanic example

TableSaw Vs Smile

TableSaw

Titanic Data Project Overview

1. TitanicPreprocess.java — Data Cleaning & Preparation

2. TitanicAnalysis.java — Exploratory Data Analysis (EDA)

3. TitanicML.java — Machine Learning Modeling

Recommended Execution Order

1. Tablesaw: Data Analysis & Preprocessing

2. Weka: Machine Learning

Popcorn Hack

- Run the Titanic code on your own computer

- Use Tablesaw to visualize the class distribution (first, second, third class) of the Titanic data

Lets Look at a Titanic Example

1. `TitanicPreprocess.java` — Data Cleaning & Preparation

2. `TitanicAnalysis.java` — Exploratory Data Analysis (EDA)

3. `TitanicML.java` — Machine Learning Modeling