Kossisoroyce's picture
Upload TASK_CLARIFICATION.md with huggingface_hub
dfc284c verified

Task Type Clarification

Primary Task: Binary Classification

This dataset is designed for binary classification of cerebral palsy presence.

Target Variable

  • Variable name: has_cp
  • Type: Boolean (True/False)
  • Classes:
    • False (0): No cerebral palsy
    • True (1): Cerebral palsy diagnosed

Example

import pandas as pd

df = pd.read_csv('africa_cp_train_5000.csv')

# Target variable
y = df['has_cp']  # Boolean: True/False

# Class distribution
print(y.value_counts())
# Output:
# False    4500  (90%)
# True      500  (10%)

Secondary Task: Risk Probability Prediction

The dataset also supports probabilistic risk prediction through the generated probability scores.

Probability Score

  • Variable name: cp_probability_score
  • Type: Float (0.0 to 1.0)
  • Interpretation: Calculated probability of CP based on risk factors
  • Use case: Risk stratification, triage, early warning systems

Example

# Use probability scores for risk stratification
df['risk_category'] = pd.cut(
    df['cp_probability_score'],
    bins=[0, 0.1, 0.3, 0.5, 1.0],
    labels=['Low', 'Medium', 'High', 'Very High']
)

print(df.groupby('risk_category')['has_cp'].mean())
# Shows actual CP rate by risk category

Task Categories on Hugging Face

When uploading to Hugging Face, use these tags:

Primary:

  • task_categories: tabular-classification
  • task_ids: binary-classification

Secondary:

  • task_ids: health-classification
  • task_ids: medical-diagnosis

Model Training Approaches

1. Binary Classification (Recommended)

Predict presence/absence of CP:

from sklearn.ensemble import RandomForestClassifier

# Features
X = df[feature_cols]
y = df['has_cp']  # Binary target

# Train classifier
model = RandomForestClassifier(class_weight='balanced')
model.fit(X, y)

# Predict class
y_pred = model.predict(X_test)  # True/False

# Predict probability
y_prob = model.predict_proba(X_test)[:, 1]  # 0.0 to 1.0

Evaluation Metrics:

  • Accuracy
  • Sensitivity (Recall) - crucial for medical screening
  • Specificity
  • AUC-ROC
  • Precision-Recall AUC

2. Risk Score Prediction (Alternative)

Predict continuous probability score:

from sklearn.ensemble import RandomForestRegressor

# Features
X = df[feature_cols]
y = df['cp_probability_score']  # Continuous target (0-1)

# Train regressor
model = RandomForestRegressor()
model.fit(X, y)

# Predict risk score
risk_scores = model.predict(X_test)  # 0.0 to 1.0

Evaluation Metrics:

  • Mean Absolute Error (MAE)
  • Root Mean Squared Error (RMSE)
  • R² Score
  • Calibration metrics

Note: This approach is less common since the probability scores are already generated. The primary use case is binary classification.


Clinical Application Context

Screening/Triage Use Case (Classification)

Goal: Identify which infants need referral to specialist

# Binary decision
if model.predict_proba(patient_features)[0][1] > threshold:
    action = "REFER to pediatric neurologist"
else:
    action = "Continue routine monitoring"

Optimal Threshold Selection:

from sklearn.metrics import precision_recall_curve

# Find threshold for 90% sensitivity
precisions, recalls, thresholds = precision_recall_curve(y_true, y_prob)
optimal_threshold = thresholds[recalls >= 0.90][0]

# Apply threshold
y_pred = (y_prob >= optimal_threshold).astype(int)

Risk Stratification (Probability Scoring)

Goal: Prioritize limited resources to highest-risk infants

# Continuous risk scoring
patient_risks = model.predict_proba(patients_features)[:, 1]

# Prioritize by risk
high_risk_patients = patients[patient_risks > 0.3]
priority_list = high_risk_patients.sort_values('risk', ascending=False)

Key Differences Summary

Aspect Binary Classification Probability Prediction
Target has_cp (boolean) cp_probability_score (float)
Output Class label (0/1) Probability (0.0-1.0)
Loss Function Binary cross-entropy Mean squared error
Primary Metric AUC-ROC, Sensitivity MAE, Calibration
Clinical Use Decision (refer/not) Risk stratification
Recommended ✅ Yes ⚠️ Alternative

Why Binary Classification is Primary

  1. Clinical Decision Making: Doctors need yes/no decisions for referral
  2. Standard Medical AI: Most diagnostic AI uses classification
  3. Interpretability: Clearer for non-technical stakeholders
  4. Evaluation: Standard medical AI metrics (sensitivity/specificity)
  5. Deployment: Simpler to implement in clinical workflows

However, models should output probability scores for:

  • Confidence estimation
  • Risk stratification
  • Threshold tuning
  • Calibration assessment

Recommended Workflow

# 1. Train binary classifier
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # Handle imbalance
    random_state=42
)

model.fit(X_train, y_train)

# 2. Predict both class AND probability
y_pred_class = model.predict(X_test)        # Binary: 0 or 1
y_pred_prob = model.predict_proba(X_test)[:, 1]  # Probability: 0.0-1.0

# 3. Evaluate classification performance
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, y_pred_class))
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_prob):.3f}")

# 4. Use probabilities for risk stratification
high_risk = X_test[y_pred_prob > 0.5]
medium_risk = X_test[(y_pred_prob > 0.2) & (y_pred_prob <= 0.5)]
low_risk = X_test[y_pred_prob <= 0.2]

Hugging Face Dataset Card YAML

Add this to the top of README.md:

---
license: cc-by-nc-4.0
task_categories:
- tabular-classification
- medical
task_ids:
- binary-classification
- medical-diagnosis
pretty_name: African Cerebral Palsy Synthetic Dataset
size_categories:
- 10K<n<100K
tags:
- cerebral-palsy
- medical
- healthcare
- africa
- synthetic-data
- pediatrics
- binary-classification
---

Summary

Primary: Binary Classification (has_cp: True/False)
Secondary: Probability scores available for risk assessment
Clinical Goal: Screen/triage infants for specialist referral
Model Output: Class prediction + confidence probability
Evaluation: Sensitivity ≥90%, AUC-ROC ≥0.90

Answer to your question: This is a CLASSIFICATION task (binary), not regression/prediction.