Weka
Breadcrumb: /javaml/introWEKA
What Is WEKA?
- WEKA stands for Waikato Environment for Knowledge Analysis (what does that even mean?)
- WEKA is a simple and powerful tool that helps computers learn from data
- Contains a collection of visualization tools and algorithms for data analysis and predictive modeling (essentially a toolkit for machine learning on visual stuff)
Imagine you’re trying to decide if a picture is of a cat or a dog. You look at enough pictures and eventually, you start to notice patterns. WEKA does the same thing
Why Is WEKA Useful?
- coding all the math and logic for machine learning would take a lot of time and expertise. WEKA saves you all that work.
- it has built-in tools for data processing and modeling, so you don’t need to be experienced in coding or stats
- it works directly with Java, so you can easily integrate machine learning into your Java programs
What Can WEKA Do?
- Classification – This helps you sort data into categories, like whether an email is spam or not.
- Regression – WEKA can predict continuous values, like predicting house prices or grades.
- Clustering – It can group similar data together, even if you haven’t told it what those groups are.
- Association – WEKA can find patterns in data, like which products tend to be bought together.
All of this is data mining, it cleans and prepares data
How Does WEKA Actually Work?
- Load your data. This could be a spreadsheet of grades, images of animals, or anything else.
- Preprocess your data. Remove any missing info, normalize it, and get it ready for learning.
- Choose your learning type. Do you want to sort things into categories, predict numbers, or find patterns?
- Select an algorithm. WEKA has lots of built-in algorithms, like decision trees or k-means clustering.
- Evaluate the model. Test how well it’s doing by checking accuracy and other measures.
- Use your model. once you’re happy, use it to make predictions or sort new data.
Titanic Data Project Overview
Let’s explore Tablesaw and Weka more by visiting the Titanic dataset. This project explores the Titanic dataset using Java, with three main components for data preprocessing, analysis, and machine learning modeling.
The titanic files can be located in this directory.
1. TitanicPreprocess.java
— Data Cleaning & Preparation
Prepares the raw Titanic dataset for analysis and modeling:
- Loads the raw dataset from a CSV file.
- Adds a new
"Alone"
column to indicate whether a passenger was traveling alone. - Removes irrelevant columns:
PassengerId
,Name
,Ticket
, andCabin
. - Encodes categorical variables:
Sex
:male → 1
,female → 0
Embarked
:C → 1
,Q → 2
,S → 3
- Fills missing values with the median of each column.
- Saves the cleaned dataset to
titanic_cleaned.csv
.
2. TitanicAnalysis.java
— Exploratory Data Analysis (EDA)
Performs visual and statistical analysis on the Titanic dataset:
- Loads the raw dataset from a CSV file.
- Adds the
"Alone"
column. - Splits data into subsets of survivors and non-survivors.
- Calculates and visualizes:
- Survival rate by gender
- Survival rate based on fare price
- Comparison of survival for passengers traveling alone vs. with family
- Age distribution among survivors and non-survivors
3. TitanicML.java
— Machine Learning Modeling
Builds and evaluates ML models using the cleaned dataset:
- Loads and processes the data (similar to
TitanicPreprocess.java
). - Normalizes numerical features to a 0–1 range.
- Converts Tablesaw tables to Weka instances.
- Trains two models:
- Decision Tree (J48)
- Logistic Regression
- Evaluates the models using cross-validation.
Recommended Execution Order
To ensure a smooth workflow, run the scripts in this order:
-
TitanicPreprocess.java
Cleans the dataset and generatestitanic_cleaned.csv
. -
TitanicAnalysis.java
Performs EDA and generates insights and visualizations. -
TitanicML.java
Runs machine learning models on the cleaned dataset.
1. Tablesaw: Data Analysis & Preprocessing
Tablesaw is a Java library for data manipulation, cleaning, and visualization—similar to pandas in Python.
Where is Tablesaw used? TitanicPreprocess.java: Cleans and transforms the raw Titanic data. TitanicAnalysis.java: Performs exploratory data analysis and visualization. TitanicML.java: Prepares data for machine learning.
Example: Loading Data with Tablesaw
import tech.tablesaw.api.Table;
import java.io.InputStream;
InputStream inputStream = TitanicAnalysis.class.getResourceAsStream("/data/titanic.csv");
Table titanic = Table.read().csv(inputStream);
Purpose:
Loads the Titanic CSV file into a Tablesaw Table object for further processing.
Example: Data Cleaning & Feature Engineering
NumericColumn<?> sibSpColumn = titanic.numberColumn("SibSp");
NumericColumn<?> parchColumn = titanic.numberColumn("Parch");
BooleanColumn aloneColumn = BooleanColumn.create("Alone", titanic.rowCount());
for (int i = 0; i < titanic.rowCount(); i++) {
boolean isAlone = ((Number) sibSpColumn.get(i)).doubleValue() == 0 && ((Number) parchColumn.get(i)).doubleValue() == 0;
aloneColumn.set(i, isAlone);
}
titanic.addColumns(aloneColumn);
Purpose: Adds a new column “Alone” to indicate if a passenger was traveling alone.
Example: Data Visualization
import tech.tablesaw.plotly.Plot;
import tech.tablesaw.plotly.api.Histogram;
Plot.show(Histogram.create("Fare Distribution", titanic.numberColumn("Fare")));
Purpose:
Plots a histogram of the “Fare” column for visual analysis.
2. Weka: Machine Learning
Weka is a Java library for machine learning, providing algorithms and evaluation tools.
Where is Weka used? TitanicML.java: Converts cleaned data into a format Weka understands, trains models, and evaluates them. Example: Converting Tablesaw Table to Weka Instances
import weka.core.*;
private static Instances convertTableToWeka(Table table) {
List<Attribute> attributes = new ArrayList<>();
for (Column<?> col : table.columns()) {
if (col.type().equals(ColumnType.STRING)) {
List<String> classValues = new ArrayList<>();
table.stringColumn(col.name()).unique().forEach(classValues::add);
attributes.add(new Attribute(col.name(), classValues));
} else {
attributes.add(new Attribute(col.name()));
}
}
Instances data = new Instances("Titanic", new ArrayList<>(attributes), table.rowCount());
for (Row row : table) {
double[] values = new double[table.columnCount()];
for (int i = 0; i < table.columnCount(); i++) {
Column<?> col = table.column(i);
if (col.type() == ColumnType.INTEGER) {
values[i] = row.getInt(i);
} else if (col.type() == ColumnType.DOUBLE) {
values[i] = row.getDouble(i);
} else if (col.type() == ColumnType.STRING) {
values[i] = attributes.get(i).indexOfValue(row.getString(i));
}
}
data.add(new DenseInstance(1.0, values));
}
return data;
}
Purpose: Converts a Tablesaw Table into Weka Instances, which is the format Weka uses for machine learning.
Example: Training and Evaluating Models with Weka
import weka.classifiers.trees.J48;
import weka.classifiers.functions.Logistic;
J48 tree = new J48();
tree.buildClassifier(data);
Logistic logistic = new Logistic();
logistic.buildClassifier(data);
// Evaluate with cross-validation
evaluateModel(tree, data, "Decision Tree");
evaluateModel(logistic, data, "Logistic Regression");
Purpose:
Trains a Decision Tree (J48) and Logistic Regression model on the Titanic data. Evaluates models using cross-validation.