Google Advanced Data Analytics · Courses 3–6 · End-to-End Pipeline

TikTok Claims vs Opinions
Full Analytics Pipeline

Dataset ~19,000 TikTok videos
Target Claim vs Opinion classification
Best Model Random Forest (~100% accuracy)
Courses EDA → Stats → Regression → ML
EDA Hypothesis Testing Logistic Regression Random Forest XGBoost Feature Engineering Classification
Course 3
EDA
Visualisation
Course 4
Hypothesis
Testing
Course 5
Regression
Logistic
Course 6
ML Models
RF · XGBoost
~19K
TikTok Videos
~100%
Best Accuracy (RF)
2.6e-120
Hypothesis p-value
4
Courses Covered

Core Finding: Engagement metrics perfectly predict claim vs opinion

Videos that make claims generate dramatically higher engagement than opinion videos — more views, likes, shares, and downloads. This engagement pattern is so consistent that a Random Forest model classifies claim vs opinion videos with near-perfect accuracy (~100%). The model's most predictive features were all engagement-related: video view count, share count, download count — not video content or text.

Course 3
Exploratory Data Analysis & Visualisation

Analysed the distribution of claims vs opinions, explored engagement metric distributions, identified outliers, and examined missing data patterns. Key finding: claim videos consistently show higher engagement across all metrics.

Avg Engagement — Claims vs Opinions
Claims drive significantly higher engagement across all metrics
Video Duration Distribution
Claims and opinions have similar duration profiles
Course 4
Statistical Hypothesis Testing

Two-sample t-test comparing mean video view counts between verified and unverified accounts. Significance level: 5%.

❌ Null Hypothesis (Rejected)
There is no difference in number of views between videos posted by verified vs unverified accounts.
✅ Alternative Hypothesis (Accepted)
There is a statistically significant difference in mean view counts between verified and unverified accounts.
Result: t-statistic = 25.50, p-value = 2.6 × 10⁻¹²⁰ — extremely significant. Rejected null hypothesis at 5% significance level. Unverified accounts have significantly higher view counts than verified accounts, suggesting behavioural differences — possibly clickbait content or bot-inflated views.
Course 5
Logistic Regression — Predicting Verified Status

Built a logistic regression model to predict whether a TikTok account is verified, as an intermediate step toward the final claim classification model. Addressed class imbalance via upsampling and removed multicollinear features (video_like_count, r=0.86 with view count).

ClassPrecisionRecallF1Support
Verified74%46%57%4,459
Not Verified61%84%71%4,483
Overall Accuracy65%8,942
Key insight: each additional second of video duration is associated with a +0.009 increase in log-odds of verified status. Model performance acceptable — logistic regression was an intermediate step, with ML classification as the final goal.
Course 6
ML Classification — Random Forest & XGBoost

Built and tuned Random Forest and XGBoost models using GridSearchCV to classify videos as claims or opinions. Both models achieved near-perfect performance — engagement metrics alone are sufficient to identify claim videos.

Model Performance Comparison
Precision, Recall, F1 across models (%)
Feature Importance — Random Forest
Engagement metrics dominate predictions
ModelPrecisionRecallF1AccuracyChampion?
Random Forest ★ ~100% ~100% ~100% ~100% ✅ Champion
XGBoost 99% 99% 99% 99% Close runner-up
Logistic Regression 61–74% 46–84% 57–71% 65% Intermediate step
Methodology & Skills Demonstrated
Framework
Google PACE (Plan → Analyze → Construct → Execute) applied across all 4 courses as a continuous case study building toward the final classification model.
Data Preparation
Class imbalance handled via upsampling. Multicollinearity detected and resolved (dropped video_like_count). Text length feature engineered from transcription data.
Model Tuning
GridSearchCV used for hyperparameter tuning on both Random Forest and XGBoost. Champion model selected based on F1 and recall — prioritising detection of claim videos over false alarms.
Business Context
TikTok's moderation team needs to prioritise user reports for claim-based content. A model that reliably flags claim videos reduces the backlog and allows human reviewers to focus where it matters.