![]() |
Tablesaw | Weka | Smile |
Java ML - Weka
Categories: Java SpringWeka (Waikato Enviornment for Knowledge Analysis)
- Java-based machine learning toolkit with:
- A large number of built-in algorithms for classification, regression, clustering, etc
- Useful for teaching, rapid prototyping, and data analysis
Weka in titanic example
TableSaw Vs Smile
Aspect | Tablesaw | Smile |
---|---|---|
Primary focus | Data manipulation and exploratory data analysis (EDA), similar to pandas in Python | Machine learning, statistics, and data analysis library with ML models and algorithms |
DataFrame support | Yes, Tablesaw provides a rich DataFrame API for tabular data manipulation | Yes, Smile provides DataFrame, but often more focused on ML workflows |
ML Algorithms | Minimal or no built-in ML algorithms; mainly for data wrangling and analysis | Extensive ML support: classification, regression, clustering, dimensionality reduction, etc. |
Data types support | Supports various column types (numeric, categorical, date, etc.) with convenient API | Supports different types but with a focus on numeric data for ML |
Data visualization | Limited built-in support, but can export or integrate with Java plotting libs | Very limited visualization; focus is on ML and stats |
Performance | Efficient for in-memory tabular data; good for typical data wrangling tasks | Highly optimized for numerical computation and ML tasks |
Missing value handling | Good support for missing data in tables | Supports missing data but less focus on data cleaning than Tablesaw |
API complexity | Simple and intuitive for data manipulation and EDA | More complex, with many ML-related classes and utilities |
Community and documentation | Growing, focused on data manipulation | Mature, with focus on ML and statistics |
Integration | Easy integration with Java projects for ETL, data manipulation | Great for projects requiring ML algorithms and predictive modeling |
TableSaw
Titanic Data Project Overview
This project explores the Titanic dataset using Java, with three main components for data preprocessing, analysis, and machine learning modeling.
1. TitanicPreprocess.java
— Data Cleaning & Preparation
Prepares the raw Titanic dataset for analysis and modeling:
- Loads the raw dataset from a CSV file.
- Adds a new
"Alone"
column to indicate whether a passenger was traveling alone. - Removes irrelevant columns:
PassengerId
,Name
,Ticket
, andCabin
. - Encodes categorical variables:
Sex
:male → 1
,female → 0
Embarked
:C → 1
,Q → 2
,S → 3
- Fills missing values with the median of each column.
- Saves the cleaned dataset to
titanic_cleaned.csv
.
2. TitanicAnalysis.java
— Exploratory Data Analysis (EDA)
Performs visual and statistical analysis on the Titanic dataset:
- Loads the raw dataset from a CSV file.
- Adds the
"Alone"
column. - Splits data into subsets of survivors and non-survivors.
- Calculates and visualizes:
- Survival rate by gender
- Survival rate based on fare price
- Comparison of survival for passengers traveling alone vs. with family
- Age distribution among survivors and non-survivors
3. TitanicML.java
— Machine Learning Modeling
Builds and evaluates ML models using the cleaned dataset:
- Loads and processes the data (similar to
TitanicPreprocess.java
). - Normalizes numerical features to a 0–1 range.
- Converts Tablesaw tables to Weka instances.
- Trains two models:
- Decision Tree (J48)
- Logistic Regression
- Evaluates the models using cross-validation.
Recommended Execution Order
To ensure a smooth workflow, run the scripts in this order:
-
TitanicPreprocess.java
Cleans the dataset and generatestitanic_cleaned.csv
. -
TitanicAnalysis.java
Performs EDA and generates insights and visualizations. -
TitanicML.java
Runs machine learning models on the cleaned dataset.
1. Tablesaw: Data Analysis & Preprocessing
Tablesaw is a Java library for data manipulation, cleaning, and visualization—similar to pandas in Python.
Where is Tablesaw used? TitanicPreprocess.java: Cleans and transforms the raw Titanic data. TitanicAnalysis.java: Performs exploratory data analysis and visualization. TitanicML.java: Prepares data for machine learning.
Example: Loading Data with Tablesaw
import tech.tablesaw.api.Table;
import java.io.InputStream;
InputStream inputStream = TitanicAnalysis.class.getResourceAsStream("/data/titanic.csv");
Table titanic = Table.read().csv(inputStream);
Purpose:
Loads the Titanic CSV file into a Tablesaw Table object for further processing.
Example: Data Cleaning & Feature Engineering
NumericColumn<?> sibSpColumn = titanic.numberColumn("SibSp");
NumericColumn<?> parchColumn = titanic.numberColumn("Parch");
BooleanColumn aloneColumn = BooleanColumn.create("Alone", titanic.rowCount());
for (int i = 0; i < titanic.rowCount(); i++) {
boolean isAlone = ((Number) sibSpColumn.get(i)).doubleValue() == 0 && ((Number) parchColumn.get(i)).doubleValue() == 0;
aloneColumn.set(i, isAlone);
}
titanic.addColumns(aloneColumn);
Purpose: Adds a new column “Alone” to indicate if a passenger was traveling alone.
Example: Data Visualization
import tech.tablesaw.plotly.Plot;
import tech.tablesaw.plotly.api.Histogram;
Plot.show(Histogram.create("Fare Distribution", titanic.numberColumn("Fare")));
Purpose:
Plots a histogram of the “Fare” column for visual analysis.
2. Weka: Machine Learning
Weka is a Java library for machine learning, providing algorithms and evaluation tools.
Where is Weka used? TitanicML.java: Converts cleaned data into a format Weka understands, trains models, and evaluates them. Example: Converting Tablesaw Table to Weka Instances
import weka.core.*;
private static Instances convertTableToWeka(Table table) {
List<Attribute> attributes = new ArrayList<>();
for (Column<?> col : table.columns()) {
if (col.type().equals(ColumnType.STRING)) {
List<String> classValues = new ArrayList<>();
table.stringColumn(col.name()).unique().forEach(classValues::add);
attributes.add(new Attribute(col.name(), classValues));
} else {
attributes.add(new Attribute(col.name()));
}
}
Instances data = new Instances("Titanic", new ArrayList<>(attributes), table.rowCount());
for (Row row : table) {
double[] values = new double[table.columnCount()];
for (int i = 0; i < table.columnCount(); i++) {
Column<?> col = table.column(i);
if (col.type() == ColumnType.INTEGER) {
values[i] = row.getInt(i);
} else if (col.type() == ColumnType.DOUBLE) {
values[i] = row.getDouble(i);
} else if (col.type() == ColumnType.STRING) {
values[i] = attributes.get(i).indexOfValue(row.getString(i));
}
}
data.add(new DenseInstance(1.0, values));
}
return data;
}
Purpose: Converts a Tablesaw Table into Weka Instances, which is the format Weka uses for machine learning.
Example: Training and Evaluating Models with Weka
import weka.classifiers.trees.J48;
import weka.classifiers.functions.Logistic;
J48 tree = new J48();
tree.buildClassifier(data);
Logistic logistic = new Logistic();
logistic.buildClassifier(data);
// Evaluate with cross-validation
evaluateModel(tree, data, "Decision Tree");
evaluateModel(logistic, data, "Logistic Regression");
Purpose:
Trains a Decision Tree (J48) and Logistic Regression model on the Titanic data. Evaluates models using cross-validation.