Datasets:
Task Type Clarification
Primary Task: Binary Classification
This dataset is designed for binary classification of cerebral palsy presence.
Target Variable
- Variable name:
has_cp - Type: Boolean (True/False)
- Classes:
False(0): No cerebral palsyTrue(1): Cerebral palsy diagnosed
Example
import pandas as pd
df = pd.read_csv('africa_cp_train_5000.csv')
# Target variable
y = df['has_cp'] # Boolean: True/False
# Class distribution
print(y.value_counts())
# Output:
# False 4500 (90%)
# True 500 (10%)
Secondary Task: Risk Probability Prediction
The dataset also supports probabilistic risk prediction through the generated probability scores.
Probability Score
- Variable name:
cp_probability_score - Type: Float (0.0 to 1.0)
- Interpretation: Calculated probability of CP based on risk factors
- Use case: Risk stratification, triage, early warning systems
Example
# Use probability scores for risk stratification
df['risk_category'] = pd.cut(
df['cp_probability_score'],
bins=[0, 0.1, 0.3, 0.5, 1.0],
labels=['Low', 'Medium', 'High', 'Very High']
)
print(df.groupby('risk_category')['has_cp'].mean())
# Shows actual CP rate by risk category
Task Categories on Hugging Face
When uploading to Hugging Face, use these tags:
Primary:
task_categories: tabular-classificationtask_ids: binary-classification
Secondary:
task_ids: health-classificationtask_ids: medical-diagnosis
Model Training Approaches
1. Binary Classification (Recommended)
Predict presence/absence of CP:
from sklearn.ensemble import RandomForestClassifier
# Features
X = df[feature_cols]
y = df['has_cp'] # Binary target
# Train classifier
model = RandomForestClassifier(class_weight='balanced')
model.fit(X, y)
# Predict class
y_pred = model.predict(X_test) # True/False
# Predict probability
y_prob = model.predict_proba(X_test)[:, 1] # 0.0 to 1.0
Evaluation Metrics:
- Accuracy
- Sensitivity (Recall) - crucial for medical screening
- Specificity
- AUC-ROC
- Precision-Recall AUC
2. Risk Score Prediction (Alternative)
Predict continuous probability score:
from sklearn.ensemble import RandomForestRegressor
# Features
X = df[feature_cols]
y = df['cp_probability_score'] # Continuous target (0-1)
# Train regressor
model = RandomForestRegressor()
model.fit(X, y)
# Predict risk score
risk_scores = model.predict(X_test) # 0.0 to 1.0
Evaluation Metrics:
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)
- R² Score
- Calibration metrics
Note: This approach is less common since the probability scores are already generated. The primary use case is binary classification.
Clinical Application Context
Screening/Triage Use Case (Classification)
Goal: Identify which infants need referral to specialist
# Binary decision
if model.predict_proba(patient_features)[0][1] > threshold:
action = "REFER to pediatric neurologist"
else:
action = "Continue routine monitoring"
Optimal Threshold Selection:
from sklearn.metrics import precision_recall_curve
# Find threshold for 90% sensitivity
precisions, recalls, thresholds = precision_recall_curve(y_true, y_prob)
optimal_threshold = thresholds[recalls >= 0.90][0]
# Apply threshold
y_pred = (y_prob >= optimal_threshold).astype(int)
Risk Stratification (Probability Scoring)
Goal: Prioritize limited resources to highest-risk infants
# Continuous risk scoring
patient_risks = model.predict_proba(patients_features)[:, 1]
# Prioritize by risk
high_risk_patients = patients[patient_risks > 0.3]
priority_list = high_risk_patients.sort_values('risk', ascending=False)
Key Differences Summary
| Aspect | Binary Classification | Probability Prediction |
|---|---|---|
| Target | has_cp (boolean) |
cp_probability_score (float) |
| Output | Class label (0/1) | Probability (0.0-1.0) |
| Loss Function | Binary cross-entropy | Mean squared error |
| Primary Metric | AUC-ROC, Sensitivity | MAE, Calibration |
| Clinical Use | Decision (refer/not) | Risk stratification |
| Recommended | ✅ Yes | ⚠️ Alternative |
Why Binary Classification is Primary
- Clinical Decision Making: Doctors need yes/no decisions for referral
- Standard Medical AI: Most diagnostic AI uses classification
- Interpretability: Clearer for non-technical stakeholders
- Evaluation: Standard medical AI metrics (sensitivity/specificity)
- Deployment: Simpler to implement in clinical workflows
However, models should output probability scores for:
- Confidence estimation
- Risk stratification
- Threshold tuning
- Calibration assessment
Recommended Workflow
# 1. Train binary classifier
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=100,
class_weight='balanced', # Handle imbalance
random_state=42
)
model.fit(X_train, y_train)
# 2. Predict both class AND probability
y_pred_class = model.predict(X_test) # Binary: 0 or 1
y_pred_prob = model.predict_proba(X_test)[:, 1] # Probability: 0.0-1.0
# 3. Evaluate classification performance
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_test, y_pred_class))
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_prob):.3f}")
# 4. Use probabilities for risk stratification
high_risk = X_test[y_pred_prob > 0.5]
medium_risk = X_test[(y_pred_prob > 0.2) & (y_pred_prob <= 0.5)]
low_risk = X_test[y_pred_prob <= 0.2]
Hugging Face Dataset Card YAML
Add this to the top of README.md:
---
license: cc-by-nc-4.0
task_categories:
- tabular-classification
- medical
task_ids:
- binary-classification
- medical-diagnosis
pretty_name: African Cerebral Palsy Synthetic Dataset
size_categories:
- 10K<n<100K
tags:
- cerebral-palsy
- medical
- healthcare
- africa
- synthetic-data
- pediatrics
- binary-classification
---
Summary
✅ Primary: Binary Classification (has_cp: True/False)
✅ Secondary: Probability scores available for risk assessment
✅ Clinical Goal: Screen/triage infants for specialist referral
✅ Model Output: Class prediction + confidence probability
✅ Evaluation: Sensitivity ≥90%, AUC-ROC ≥0.90
Answer to your question: This is a CLASSIFICATION task (binary), not regression/prediction.