This was a group project for CS3244 Machine Learning in NUS.
The whole project is not here.
Scope of Project:
Viewership experience often ruined or diminished by spoilers. Spoilers detract from the thrill and genuine emotional investment from an audience, which may often result in viewers not wanting to tune into spoiled movies, leading to revenue loss for the film industry as well. Having a model that detects movie spoilers in text all across the internet might be able to help preserve this emotional investment and relationship between the viewers, the movie and film-makers.
Dataset:
IMDb Movie Reviews Dataset
Data Understanding (Exploratory Data Analysis):
- Do certain phrases contribute to a spoiler tag?
- Do certain users (reviewers) post spoilers more frequently?
- Correlation between length of review and spoiler classification?
Word Embeddings:
Glove
Models:
1. Linear models : SVM, Naive Bayes, Logistic Regression
2. Neural Networks : Convolutional Neural Network (CNN), Long Short-Term Memory Network (LSTM)
Model Evaluation:
Metric Choice:
In the context of this project, a false negative is more harmful. Allowing spoilers to fall through the net means viewers are more likely to read them and diminish their viewership
Learning Reflection
1. Dealing with Imbalanced Datasets :
This was the first project where I had to handle a dataset with imbalance data. Using resampling methods, such as oversampling and undersampling,
it was interesting to find out how to deploy such methods.
2. Metric Choice
Learning to choose appropriate metrics for evaluation in context of the project.