Tablesaw
Breadcrumb: /javaml/introTablesaw is a Java Library for data manipulation and analysis.
Tablesaw Introduction
This guide introduces Tablesaw for data prep and cleaning in Java, using the Titanic dataset as an example.
Why use Tablesaw?
- When you have structured data (basically just things where you know the format of the data beforehand), libraries like tablesaw make it super easy to work with the data. In this case, we have structured data because we have a consistent set of properties of the titanic passengers every time.
Goals
- Understand Tablesaw as a Java-based “smart spreadsheet.”
- Clean messy data: fill blanks, drop useless columns, encode categories.
- Normalize data for modeling.
- Convert Tablesaw
Table
to WekaInstances
. - Sketch model building (Decision Tree, Logistic Regression).
1. Tablesaw = Spreadsheet with Superpowers
- Think of Tablesaw as a programmable Excel sheet.
- Rows = passengers, columns = features.
- Lets you filter, transform, and visualize with Java code.
2. Titanic Data = Messy Kitchen
Issues:
- Missing values (e.g., Age) → fill with median.
- Useless columns (e.g., Name, Ticket) → drop them.
- Text columns (e.g., Sex, Embarked) → convert to numbers.
- Uneven scales (e.g., Age vs. Fare) → normalize to [0,1].
Sample Code:
table.doubleColumn("Age").setMissingTo(table.numberColumn("Age").median());
table = table.removeColumns("Name", "Ticket", "Cabin");
3. Cleaned Data Snapshot
Pclass | Sex | Age_norm | Fare_norm | Embarked | Survived |
---|---|---|---|---|---|
3 | 1 | 0.27 | 0.014 | 0 | 0 |
1 | 0 | 0.47 | 0.143 | 1 | 1 |
Categories: Sex (male=1, female=0), Embarked (S=0, C=1, Q=2) Normalize: (value - min) / (max - min)
4. From Tablesaw to Weka
Think of this as “packing your data into a suitcase” Weka understands.
Steps: Define Attributes (numeric or nominal).
Create Instances and set Survived as the class.
Loop through rows to add instances.
Instances data = new Instances("Titanic", attributes, table.rowCount());
for (Row row : table) {
double[] vals = {row.getDouble("Pclass"), ..., ...};
data.add(new DenseInstance(1.0, vals));
}
5. Modeling with Weka
J48 Tree: Splits based on conditions (e.g., Pclass ≤ 2).
Logistic Regression: Combines features into a prediction formula.
Cross-validation: Slice data into 10 parts, train/test each.
Homework - Part 1
Look at the Titanic ML code for Java and Python. Where do you see similarities in code flow and structure?