WEKA

What Is WEKA?

WEKA stands for Waikato Environment for Knowledge Analysis (what does that even mean?)
WEKA is a simple and powerful tool that helps computers learn from data
Contains a collection of visualization tools and algorithms for data analysis and predictive modeling (essentially a toolkit for machine learning on visual stuff)

Imagine you’re trying to decide if a picture is of a cat or a dog. You look at enough pictures and eventually, you start to notice patterns. WEKA does the same thing

Why Is WEKA Useful?

coding all the math and logic for machine learning would take a lot of time and expertise. WEKA saves you all that work.
it has built-in tools for data processing and modeling, so you don’t need to be experienced in coding or stats
it works directly with Java, so you can easily integrate machine learning into your Java programs

What Can WEKA Do?

Classification – This helps you sort data into categories, like whether an email is spam or not.
Regression – WEKA can predict continuous values, like predicting house prices or grades.
Clustering – It can group similar data together, even if you haven’t told it what those groups are.
Association – WEKA can find patterns in data, like which products tend to be bought together.

All of this is data mining, it cleans and prepares data

How Does WEKA Actually Work?

Load your data. This could be a spreadsheet of grades, images of animals, or anything else.
Preprocess your data. Remove any missing info, normalize it, and get it ready for learning.
Choose your learning type. Do you want to sort things into categories, predict numbers, or find patterns?
Select an algorithm. WEKA has lots of built-in algorithms, like decision trees or k-means clustering.
Evaluate the model. Test how well it’s doing by checking accuracy and other measures.
Use your model. once you’re happy, use it to make predictions or sort new data.

Titanic Data Project Overview

Let’s explore Tablesaw and Weka more by visiting the Titanic dataset. This project explores the Titanic dataset using Java, with three main components for data preprocessing, analysis, and machine learning modeling.

The titanic files can be located in this directory.

1. `TitanicPreprocess.java` — Data Cleaning & Preparation

Prepares the raw Titanic dataset for analysis and modeling:

Loads the raw dataset from a CSV file.
Adds a new "Alone" column to indicate whether a passenger was traveling alone.
Removes irrelevant columns: PassengerId, Name, Ticket, and Cabin.
Encodes categorical variables:
- Sex: male → 1, female → 0
- Embarked: C → 1, Q → 2, S → 3
Fills missing values with the median of each column.
Saves the cleaned dataset to titanic_cleaned.csv.

2. `TitanicAnalysis.java` — Exploratory Data Analysis (EDA)

Performs visual and statistical analysis on the Titanic dataset:

Loads the raw dataset from a CSV file.
Adds the "Alone" column.
Splits data into subsets of survivors and non-survivors.
Calculates and visualizes:
- Survival rate by gender
- Survival rate based on fare price
- Comparison of survival for passengers traveling alone vs. with family
- Age distribution among survivors and non-survivors

3. `TitanicML.java` — Machine Learning Modeling

Builds and evaluates ML models using the cleaned dataset:

Loads and processes the data (similar to TitanicPreprocess.java).
Normalizes numerical features to a 0–1 range.
Converts Tablesaw tables to Weka instances.
Trains two models:
- Decision Tree (J48)
- Logistic Regression
Evaluates the models using cross-validation.

Recommended Execution Order

To ensure a smooth workflow, run the scripts in this order:

TitanicPreprocess.java
Cleans the dataset and generates titanic_cleaned.csv.
TitanicAnalysis.java
Performs EDA and generates insights and visualizations.
TitanicML.java
Runs machine learning models on the cleaned dataset.

1. Tablesaw: Data Analysis & Preprocessing

Tablesaw is a Java library for data manipulation, cleaning, and visualization—similar to pandas in Python.

Where is Tablesaw used? TitanicPreprocess.java: Cleans and transforms the raw Titanic data. TitanicAnalysis.java: Performs exploratory data analysis and visualization. TitanicML.java: Prepares data for machine learning.

Example: Loading Data with Tablesaw

import tech.tablesaw.api.Table;
import java.io.InputStream;

InputStream inputStream = TitanicAnalysis.class.getResourceAsStream("/data/titanic.csv");
Table titanic = Table.read().csv(inputStream);

Purpose:

Loads the Titanic CSV file into a Tablesaw Table object for further processing.

Example: Data Cleaning & Feature Engineering

NumericColumn<?> sibSpColumn = titanic.numberColumn("SibSp");
NumericColumn<?> parchColumn = titanic.numberColumn("Parch");
BooleanColumn aloneColumn = BooleanColumn.create("Alone", titanic.rowCount());
for (int i = 0; i < titanic.rowCount(); i++) {
    boolean isAlone = ((Number) sibSpColumn.get(i)).doubleValue() == 0 && ((Number) parchColumn.get(i)).doubleValue() == 0;
    aloneColumn.set(i, isAlone);
}
titanic.addColumns(aloneColumn);

Purpose: Adds a new column “Alone” to indicate if a passenger was traveling alone.

Example: Data Visualization

import tech.tablesaw.plotly.Plot;
import tech.tablesaw.plotly.api.Histogram;

Plot.show(Histogram.create("Fare Distribution", titanic.numberColumn("Fare")));

Purpose:

Plots a histogram of the “Fare” column for visual analysis.

2. Weka: Machine Learning

Weka is a Java library for machine learning, providing algorithms and evaluation tools.

Where is Weka used? TitanicML.java: Converts cleaned data into a format Weka understands, trains models, and evaluates them. Example: Converting Tablesaw Table to Weka Instances

import weka.core.*;

private static Instances convertTableToWeka(Table table) {
    List<Attribute> attributes = new ArrayList<>();
    for (Column<?> col : table.columns()) {
        if (col.type().equals(ColumnType.STRING)) {
            List<String> classValues = new ArrayList<>();
            table.stringColumn(col.name()).unique().forEach(classValues::add);
            attributes.add(new Attribute(col.name(), classValues));
        } else {
            attributes.add(new Attribute(col.name()));
        }
    }
    Instances data = new Instances("Titanic", new ArrayList<>(attributes), table.rowCount());
    for (Row row : table) {
        double[] values = new double[table.columnCount()];
        for (int i = 0; i < table.columnCount(); i++) {
            Column<?> col = table.column(i);
            if (col.type() == ColumnType.INTEGER) {
                values[i] = row.getInt(i);
            } else if (col.type() == ColumnType.DOUBLE) {
                values[i] = row.getDouble(i);
            } else if (col.type() == ColumnType.STRING) {
                values[i] = attributes.get(i).indexOfValue(row.getString(i));
            }
        }
        data.add(new DenseInstance(1.0, values));
    }
    return data;
}

Purpose: Converts a Tablesaw Table into Weka Instances, which is the format Weka uses for machine learning.

Example: Training and Evaluating Models with Weka

import weka.classifiers.trees.J48;
import weka.classifiers.functions.Logistic;

J48 tree = new J48();
tree.buildClassifier(data);

Logistic logistic = new Logistic();
logistic.buildClassifier(data);

// Evaluate with cross-validation
evaluateModel(tree, data, "Decision Tree");
evaluateModel(logistic, data, "Logistic Regression");

Purpose:

Trains a Decision Tree (J48) and Logistic Regression model on the Titanic data. Evaluates models using cross-validation.

Weka

WEKA

What Is WEKA?

Why Is WEKA Useful?

What Can WEKA Do?

How Does WEKA Actually Work?

Titanic Data Project Overview

1. `TitanicPreprocess.java` — Data Cleaning & Preparation

2. `TitanicAnalysis.java` — Exploratory Data Analysis (EDA)

3. `TitanicML.java` — Machine Learning Modeling

Recommended Execution Order

1. Tablesaw: Data Analysis & Preprocessing

2. Weka: Machine Learning

Popcorn Hack

- Run the Titanic code on your own computer

- Use Tablesaw to visualize the class distribution (first, second, third class) of the Titanic data

WEKA

What Is WEKA?

Why Is WEKA Useful?

What Can WEKA Do?

How Does WEKA Actually Work?

Titanic Data Project Overview

1. TitanicPreprocess.java — Data Cleaning & Preparation

2. TitanicAnalysis.java — Exploratory Data Analysis (EDA)

3. TitanicML.java — Machine Learning Modeling

Recommended Execution Order

1. Tablesaw: Data Analysis & Preprocessing

2. Weka: Machine Learning

Popcorn Hack

- Run the Titanic code on your own computer

- Use Tablesaw to visualize the class distribution (first, second, third class) of the Titanic data

1. `TitanicPreprocess.java` — Data Cleaning & Preparation

2. `TitanicAnalysis.java` — Exploratory Data Analysis (EDA)

3. `TitanicML.java` — Machine Learning Modeling