# Task Type Clarification ## Primary Task: Binary Classification This dataset is designed for **binary classification** of cerebral palsy presence. ### Target Variable - **Variable name**: `has_cp` - **Type**: Boolean (True/False) - **Classes**: - `False` (0): No cerebral palsy - `True` (1): Cerebral palsy diagnosed ### Example ```python import pandas as pd df = pd.read_csv('africa_cp_train_5000.csv') # Target variable y = df['has_cp'] # Boolean: True/False # Class distribution print(y.value_counts()) # Output: # False 4500 (90%) # True 500 (10%) ``` --- ## Secondary Task: Risk Probability Prediction The dataset also supports **probabilistic risk prediction** through the generated probability scores. ### Probability Score - **Variable name**: `cp_probability_score` - **Type**: Float (0.0 to 1.0) - **Interpretation**: Calculated probability of CP based on risk factors - **Use case**: Risk stratification, triage, early warning systems ### Example ```python # Use probability scores for risk stratification df['risk_category'] = pd.cut( df['cp_probability_score'], bins=[0, 0.1, 0.3, 0.5, 1.0], labels=['Low', 'Medium', 'High', 'Very High'] ) print(df.groupby('risk_category')['has_cp'].mean()) # Shows actual CP rate by risk category ``` --- ## Task Categories on Hugging Face When uploading to Hugging Face, use these tags: **Primary:** - `task_categories: tabular-classification` - `task_ids: binary-classification` **Secondary:** - `task_ids: health-classification` - `task_ids: medical-diagnosis` --- ## Model Training Approaches ### 1. Binary Classification (Recommended) **Predict presence/absence of CP:** ```python from sklearn.ensemble import RandomForestClassifier # Features X = df[feature_cols] y = df['has_cp'] # Binary target # Train classifier model = RandomForestClassifier(class_weight='balanced') model.fit(X, y) # Predict class y_pred = model.predict(X_test) # True/False # Predict probability y_prob = model.predict_proba(X_test)[:, 1] # 0.0 to 1.0 ``` **Evaluation Metrics:** - Accuracy - Sensitivity (Recall) - crucial for medical screening - Specificity - AUC-ROC - Precision-Recall AUC ### 2. Risk Score Prediction (Alternative) **Predict continuous probability score:** ```python from sklearn.ensemble import RandomForestRegressor # Features X = df[feature_cols] y = df['cp_probability_score'] # Continuous target (0-1) # Train regressor model = RandomForestRegressor() model.fit(X, y) # Predict risk score risk_scores = model.predict(X_test) # 0.0 to 1.0 ``` **Evaluation Metrics:** - Mean Absolute Error (MAE) - Root Mean Squared Error (RMSE) - R² Score - Calibration metrics **Note**: This approach is less common since the probability scores are already generated. The primary use case is binary classification. --- ## Clinical Application Context ### Screening/Triage Use Case (Classification) **Goal**: Identify which infants need referral to specialist ```python # Binary decision if model.predict_proba(patient_features)[0][1] > threshold: action = "REFER to pediatric neurologist" else: action = "Continue routine monitoring" ``` **Optimal Threshold Selection:** ```python from sklearn.metrics import precision_recall_curve # Find threshold for 90% sensitivity precisions, recalls, thresholds = precision_recall_curve(y_true, y_prob) optimal_threshold = thresholds[recalls >= 0.90][0] # Apply threshold y_pred = (y_prob >= optimal_threshold).astype(int) ``` ### Risk Stratification (Probability Scoring) **Goal**: Prioritize limited resources to highest-risk infants ```python # Continuous risk scoring patient_risks = model.predict_proba(patients_features)[:, 1] # Prioritize by risk high_risk_patients = patients[patient_risks > 0.3] priority_list = high_risk_patients.sort_values('risk', ascending=False) ``` --- ## Key Differences Summary | Aspect | Binary Classification | Probability Prediction | |--------|---------------------|---------------------| | **Target** | `has_cp` (boolean) | `cp_probability_score` (float) | | **Output** | Class label (0/1) | Probability (0.0-1.0) | | **Loss Function** | Binary cross-entropy | Mean squared error | | **Primary Metric** | AUC-ROC, Sensitivity | MAE, Calibration | | **Clinical Use** | Decision (refer/not) | Risk stratification | | **Recommended** | ✅ Yes | ⚠️ Alternative | --- ## Why Binary Classification is Primary 1. **Clinical Decision Making**: Doctors need yes/no decisions for referral 2. **Standard Medical AI**: Most diagnostic AI uses classification 3. **Interpretability**: Clearer for non-technical stakeholders 4. **Evaluation**: Standard medical AI metrics (sensitivity/specificity) 5. **Deployment**: Simpler to implement in clinical workflows **However**, models should output **probability scores** for: - Confidence estimation - Risk stratification - Threshold tuning - Calibration assessment --- ## Recommended Workflow ```python # 1. Train binary classifier from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier( n_estimators=100, class_weight='balanced', # Handle imbalance random_state=42 ) model.fit(X_train, y_train) # 2. Predict both class AND probability y_pred_class = model.predict(X_test) # Binary: 0 or 1 y_pred_prob = model.predict_proba(X_test)[:, 1] # Probability: 0.0-1.0 # 3. Evaluate classification performance from sklearn.metrics import classification_report, roc_auc_score print(classification_report(y_test, y_pred_class)) print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_prob):.3f}") # 4. Use probabilities for risk stratification high_risk = X_test[y_pred_prob > 0.5] medium_risk = X_test[(y_pred_prob > 0.2) & (y_pred_prob <= 0.5)] low_risk = X_test[y_pred_prob <= 0.2] ``` --- ## Hugging Face Dataset Card YAML Add this to the top of README.md: ```yaml --- license: cc-by-nc-4.0 task_categories: - tabular-classification - medical task_ids: - binary-classification - medical-diagnosis pretty_name: African Cerebral Palsy Synthetic Dataset size_categories: - 10K