# Task Type Clarification

## Primary Task: Binary Classification

This dataset is designed for **binary classification** of cerebral palsy presence.

### Target Variable
- **Variable name**: `has_cp`
- **Type**: Boolean (True/False)
- **Classes**: 
  - `False` (0): No cerebral palsy
  - `True` (1): Cerebral palsy diagnosed

### Example
```python
import pandas as pd

df = pd.read_csv('africa_cp_train_5000.csv')

# Target variable
y = df['has_cp']  # Boolean: True/False

# Class distribution
print(y.value_counts())
# Output:
# False    4500  (90%)
# True      500  (10%)
```

---

## Secondary Task: Risk Probability Prediction

The dataset also supports **probabilistic risk prediction** through the generated probability scores.

### Probability Score
- **Variable name**: `cp_probability_score`
- **Type**: Float (0.0 to 1.0)
- **Interpretation**: Calculated probability of CP based on risk factors
- **Use case**: Risk stratification, triage, early warning systems

### Example
```python
# Use probability scores for risk stratification
df['risk_category'] = pd.cut(
    df['cp_probability_score'],
    bins=[0, 0.1, 0.3, 0.5, 1.0],
    labels=['Low', 'Medium', 'High', 'Very High']
)

print(df.groupby('risk_category')['has_cp'].mean())
# Shows actual CP rate by risk category
```

---

## Task Categories on Hugging Face

When uploading to Hugging Face, use these tags:

**Primary:**
- `task_categories: tabular-classification`
- `task_ids: binary-classification`

**Secondary:**
- `task_ids: health-classification`
- `task_ids: medical-diagnosis`

---

## Model Training Approaches

### 1. Binary Classification (Recommended)

**Predict presence/absence of CP:**

```python
from sklearn.ensemble import RandomForestClassifier

# Features
X = df[feature_cols]
y = df['has_cp']  # Binary target

# Train classifier
model = RandomForestClassifier(class_weight='balanced')
model.fit(X, y)

# Predict class
y_pred = model.predict(X_test)  # True/False

# Predict probability
y_prob = model.predict_proba(X_test)[:, 1]  # 0.0 to 1.0
```

**Evaluation Metrics:**
- Accuracy
- Sensitivity (Recall) - crucial for medical screening
- Specificity
- AUC-ROC
- Precision-Recall AUC

### 2. Risk Score Prediction (Alternative)

**Predict continuous probability score:**

```python
from sklearn.ensemble import RandomForestRegressor

# Features
X = df[feature_cols]
y = df['cp_probability_score']  # Continuous target (0-1)

# Train regressor
model = RandomForestRegressor()
model.fit(X, y)

# Predict risk score
risk_scores = model.predict(X_test)  # 0.0 to 1.0
```

**Evaluation Metrics:**
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)
- R² Score
- Calibration metrics

**Note**: This approach is less common since the probability scores are already generated. The primary use case is binary classification.

---

## Clinical Application Context

### Screening/Triage Use Case (Classification)

**Goal**: Identify which infants need referral to specialist

```python
# Binary decision
if model.predict_proba(patient_features)[0][1] > threshold:
    action = "REFER to pediatric neurologist"
else:
    action = "Continue routine monitoring"
```

**Optimal Threshold Selection:**
```python
from sklearn.metrics import precision_recall_curve

# Find threshold for 90% sensitivity
precisions, recalls, thresholds = precision_recall_curve(y_true, y_prob)
optimal_threshold = thresholds[recalls >= 0.90][0]

# Apply threshold
y_pred = (y_prob >= optimal_threshold).astype(int)
```

### Risk Stratification (Probability Scoring)

**Goal**: Prioritize limited resources to highest-risk infants

```python
# Continuous risk scoring
patient_risks = model.predict_proba(patients_features)[:, 1]

# Prioritize by risk
high_risk_patients = patients[patient_risks > 0.3]
priority_list = high_risk_patients.sort_values('risk', ascending=False)
```

---

## Key Differences Summary

| Aspect | Binary Classification | Probability Prediction |
|--------|---------------------|---------------------|
| **Target** | `has_cp` (boolean) | `cp_probability_score` (float) |
| **Output** | Class label (0/1) | Probability (0.0-1.0) |
| **Loss Function** | Binary cross-entropy | Mean squared error |
| **Primary Metric** | AUC-ROC, Sensitivity | MAE, Calibration |
| **Clinical Use** | Decision (refer/not) | Risk stratification |
| **Recommended** | ✅ Yes | ⚠️ Alternative |

---

## Why Binary Classification is Primary

1. **Clinical Decision Making**: Doctors need yes/no decisions for referral
2. **Standard Medical AI**: Most diagnostic AI uses classification
3. **Interpretability**: Clearer for non-technical stakeholders
4. **Evaluation**: Standard medical AI metrics (sensitivity/specificity)
5. **Deployment**: Simpler to implement in clinical workflows

**However**, models should output **probability scores** for:
- Confidence estimation
- Risk stratification
- Threshold tuning
- Calibration assessment

---

## Recommended Workflow

```python
# 1. Train binary classifier
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # Handle imbalance
    random_state=42
)

model.fit(X_train, y_train)

# 2. Predict both class AND probability
y_pred_class = model.predict(X_test)        # Binary: 0 or 1
y_pred_prob = model.predict_proba(X_test)[:, 1]  # Probability: 0.0-1.0

# 3. Evaluate classification performance
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, y_pred_class))
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_prob):.3f}")

# 4. Use probabilities for risk stratification
high_risk = X_test[y_pred_prob > 0.5]
medium_risk = X_test[(y_pred_prob > 0.2) & (y_pred_prob <= 0.5)]
low_risk = X_test[y_pred_prob <= 0.2]
```

---

## Hugging Face Dataset Card YAML

Add this to the top of README.md:

```yaml
---
license: cc-by-nc-4.0
task_categories:
- tabular-classification
- medical
task_ids:
- binary-classification
- medical-diagnosis
pretty_name: African Cerebral Palsy Synthetic Dataset
size_categories:
- 10K<n<100K
tags:
- cerebral-palsy
- medical
- healthcare
- africa
- synthetic-data
- pediatrics
- binary-classification
---
```

---

## Summary

✅ **Primary**: Binary Classification (has_cp: True/False)  
✅ **Secondary**: Probability scores available for risk assessment  
✅ **Clinical Goal**: Screen/triage infants for specialist referral  
✅ **Model Output**: Class prediction + confidence probability  
✅ **Evaluation**: Sensitivity ≥90%, AUC-ROC ≥0.90  

**Answer to your question: This is a CLASSIFICATION task (binary), not regression/prediction.**