Weka (Waikato Enviornment for Knowledge Analysis)

  • Java-based machine learning toolkit with:
    • A large number of built-in algorithms for classification, regression, clustering, etc
    • Useful for teaching, rapid prototyping, and data analysis

Weka in titanic example

TableSaw Vs Smile

Aspect Tablesaw Smile
Primary focus Data manipulation and exploratory data analysis (EDA), similar to pandas in Python Machine learning, statistics, and data analysis library with ML models and algorithms
DataFrame support Yes, Tablesaw provides a rich DataFrame API for tabular data manipulation Yes, Smile provides DataFrame, but often more focused on ML workflows
ML Algorithms Minimal or no built-in ML algorithms; mainly for data wrangling and analysis Extensive ML support: classification, regression, clustering, dimensionality reduction, etc.
Data types support Supports various column types (numeric, categorical, date, etc.) with convenient API Supports different types but with a focus on numeric data for ML
Data visualization Limited built-in support, but can export or integrate with Java plotting libs Very limited visualization; focus is on ML and stats
Performance Efficient for in-memory tabular data; good for typical data wrangling tasks Highly optimized for numerical computation and ML tasks
Missing value handling Good support for missing data in tables Supports missing data but less focus on data cleaning than Tablesaw
API complexity Simple and intuitive for data manipulation and EDA More complex, with many ML-related classes and utilities
Community and documentation Growing, focused on data manipulation Mature, with focus on ML and statistics
Integration Easy integration with Java projects for ETL, data manipulation Great for projects requiring ML algorithms and predictive modeling

TableSaw

Titanic Data Project Overview

This project explores the Titanic dataset using Java, with three main components for data preprocessing, analysis, and machine learning modeling.


1. TitanicPreprocess.javaData Cleaning & Preparation

Prepares the raw Titanic dataset for analysis and modeling:

  • Loads the raw dataset from a CSV file.
  • Adds a new "Alone" column to indicate whether a passenger was traveling alone.
  • Removes irrelevant columns: PassengerId, Name, Ticket, and Cabin.
  • Encodes categorical variables:
    • Sex: male → 1, female → 0
    • Embarked: C → 1, Q → 2, S → 3
  • Fills missing values with the median of each column.
  • Saves the cleaned dataset to titanic_cleaned.csv.

2. TitanicAnalysis.javaExploratory Data Analysis (EDA)

Performs visual and statistical analysis on the Titanic dataset:

  • Loads the raw dataset from a CSV file.
  • Adds the "Alone" column.
  • Splits data into subsets of survivors and non-survivors.
  • Calculates and visualizes:
    • Survival rate by gender
    • Survival rate based on fare price
    • Comparison of survival for passengers traveling alone vs. with family
    • Age distribution among survivors and non-survivors

3. TitanicML.javaMachine Learning Modeling

Builds and evaluates ML models using the cleaned dataset:

  • Loads and processes the data (similar to TitanicPreprocess.java).
  • Normalizes numerical features to a 0–1 range.
  • Converts Tablesaw tables to Weka instances.
  • Trains two models:
    • Decision Tree (J48)
    • Logistic Regression
  • Evaluates the models using cross-validation.

To ensure a smooth workflow, run the scripts in this order:

  1. TitanicPreprocess.java
    Cleans the dataset and generates titanic_cleaned.csv.

  2. TitanicAnalysis.java
    Performs EDA and generates insights and visualizations.

  3. TitanicML.java
    Runs machine learning models on the cleaned dataset.

1. Tablesaw: Data Analysis & Preprocessing

Tablesaw is a Java library for data manipulation, cleaning, and visualization—similar to pandas in Python.

Where is Tablesaw used? TitanicPreprocess.java: Cleans and transforms the raw Titanic data. TitanicAnalysis.java: Performs exploratory data analysis and visualization. TitanicML.java: Prepares data for machine learning.

Example: Loading Data with Tablesaw

import tech.tablesaw.api.Table;
import java.io.InputStream;

InputStream inputStream = TitanicAnalysis.class.getResourceAsStream("/data/titanic.csv");
Table titanic = Table.read().csv(inputStream);

Purpose:

Loads the Titanic CSV file into a Tablesaw Table object for further processing.

Example: Data Cleaning & Feature Engineering

NumericColumn<?> sibSpColumn = titanic.numberColumn("SibSp");
NumericColumn<?> parchColumn = titanic.numberColumn("Parch");
BooleanColumn aloneColumn = BooleanColumn.create("Alone", titanic.rowCount());
for (int i = 0; i < titanic.rowCount(); i++) {
    boolean isAlone = ((Number) sibSpColumn.get(i)).doubleValue() == 0 && ((Number) parchColumn.get(i)).doubleValue() == 0;
    aloneColumn.set(i, isAlone);
}
titanic.addColumns(aloneColumn);

Purpose: Adds a new column “Alone” to indicate if a passenger was traveling alone.

Example: Data Visualization

import tech.tablesaw.plotly.Plot;
import tech.tablesaw.plotly.api.Histogram;

Plot.show(Histogram.create("Fare Distribution", titanic.numberColumn("Fare")));

Purpose:

Plots a histogram of the “Fare” column for visual analysis.

2. Weka: Machine Learning

Weka is a Java library for machine learning, providing algorithms and evaluation tools.

Where is Weka used? TitanicML.java: Converts cleaned data into a format Weka understands, trains models, and evaluates them. Example: Converting Tablesaw Table to Weka Instances

import weka.core.*;

private static Instances convertTableToWeka(Table table) {
    List<Attribute> attributes = new ArrayList<>();
    for (Column<?> col : table.columns()) {
        if (col.type().equals(ColumnType.STRING)) {
            List<String> classValues = new ArrayList<>();
            table.stringColumn(col.name()).unique().forEach(classValues::add);
            attributes.add(new Attribute(col.name(), classValues));
        } else {
            attributes.add(new Attribute(col.name()));
        }
    }
    Instances data = new Instances("Titanic", new ArrayList<>(attributes), table.rowCount());
    for (Row row : table) {
        double[] values = new double[table.columnCount()];
        for (int i = 0; i < table.columnCount(); i++) {
            Column<?> col = table.column(i);
            if (col.type() == ColumnType.INTEGER) {
                values[i] = row.getInt(i);
            } else if (col.type() == ColumnType.DOUBLE) {
                values[i] = row.getDouble(i);
            } else if (col.type() == ColumnType.STRING) {
                values[i] = attributes.get(i).indexOfValue(row.getString(i));
            }
        }
        data.add(new DenseInstance(1.0, values));
    }
    return data;
}

Purpose: Converts a Tablesaw Table into Weka Instances, which is the format Weka uses for machine learning.

Example: Training and Evaluating Models with Weka

import weka.classifiers.trees.J48;
import weka.classifiers.functions.Logistic;

J48 tree = new J48();
tree.buildClassifier(data);

Logistic logistic = new Logistic();
logistic.buildClassifier(data);

// Evaluate with cross-validation
evaluateModel(tree, data, "Decision Tree");
evaluateModel(logistic, data, "Logistic Regression");

Purpose:

Trains a Decision Tree (J48) and Logistic Regression model on the Titanic data. Evaluates models using cross-validation.

Popcorn Hack

- Run the Titanic code on your own computer

- Use Tablesaw to visualize the class distribution (first, second, third class) of the Titanic data

Lets Look at a Titanic Example