Google Advanced Data Analytics · Courses 3–6 · End-to-End Pipeline
TikTok Claims vs Opinions
Full Analytics Pipeline
EDA
Hypothesis Testing
Logistic Regression
Random Forest
XGBoost
Feature Engineering
Classification
Core Finding: Engagement metrics perfectly predict claim vs opinion
Videos that make claims generate dramatically higher engagement than opinion videos — more views, likes, shares, and downloads. This engagement pattern is so consistent that a Random Forest model classifies claim vs opinion videos with near-perfect accuracy (~100%). The model's most predictive features were all engagement-related: video view count, share count, download count — not video content or text.
Analysed the distribution of claims vs opinions, explored engagement metric distributions, identified outliers, and examined missing data patterns. Key finding: claim videos consistently show higher engagement across all metrics.
Avg Engagement — Claims vs Opinions
Claims drive significantly higher engagement across all metrics
Video Duration Distribution
Claims and opinions have similar duration profiles
Two-sample t-test comparing mean video view counts between verified and unverified accounts. Significance level: 5%.
❌ Null Hypothesis (Rejected)
There is no difference in number of views between videos posted by verified vs unverified accounts.
✅ Alternative Hypothesis (Accepted)
There is a statistically significant difference in mean view counts between verified and unverified accounts.
Result: t-statistic = 25.50, p-value = 2.6 × 10⁻¹²⁰ — extremely significant. Rejected null hypothesis at 5% significance level. Unverified accounts have significantly higher view counts than verified accounts, suggesting behavioural differences — possibly clickbait content or bot-inflated views.
Built a logistic regression model to predict whether a TikTok account is verified, as an intermediate step toward the final claim classification model. Addressed class imbalance via upsampling and removed multicollinear features (video_like_count, r=0.86 with view count).
| Class | Precision | Recall | F1 | Support |
| Verified | 74% | 46% | 57% | 4,459 |
| Not Verified | 61% | 84% | 71% | 4,483 |
| Overall Accuracy | 65% | 8,942 |
Key insight: each additional second of video duration is associated with a +0.009 increase in log-odds of verified status. Model performance acceptable — logistic regression was an intermediate step, with ML classification as the final goal.
Built and tuned Random Forest and XGBoost models using GridSearchCV to classify videos as claims or opinions. Both models achieved near-perfect performance — engagement metrics alone are sufficient to identify claim videos.
Model Performance Comparison
Precision, Recall, F1 across models (%)
Feature Importance — Random Forest
Engagement metrics dominate predictions
| Model | Precision | Recall | F1 | Accuracy | Champion? |
| Random Forest ★ |
~100% |
~100% |
~100% |
~100% |
✅ Champion |
| XGBoost |
99% |
99% |
99% |
99% |
Close runner-up |
| Logistic Regression |
61–74% |
46–84% |
57–71% |
65% |
Intermediate step |
Methodology & Skills Demonstrated
Framework
Google PACE (Plan → Analyze → Construct → Execute) applied across all 4 courses as a continuous case study building toward the final classification model.
Data Preparation
Class imbalance handled via upsampling. Multicollinearity detected and resolved (dropped video_like_count). Text length feature engineered from transcription data.
Model Tuning
GridSearchCV used for hyperparameter tuning on both Random Forest and XGBoost. Champion model selected based on F1 and recall — prioritising detection of claim videos over false alarms.
Business Context
TikTok's moderation team needs to prioritise user reports for claim-based content. A model that reliably flags claim videos reduces the backlog and allows human reviewers to focus where it matters.