How I Used Machine Learning to Predict YouTube Trending Videos (And Score 95/100 In My Master's Degree)

Mike Stevenson
Jun 8
18 min read

I’m incredibly proud to share that I recently achieved a distinction (95/100) in my Machine Learning module, which is my highest mark to date in higher education, which was part of my Master’s in Business Analytics at the University of Bath. However, the reason for this blog is that this result meant much more to me than just another step towards my overall goal of achieving a distinction that I set for myself nearly a year and a half ago.

To set the scene of why this module meant so much to me was that, firstly, an incredible level of effort and personal sacrifice went into making this grade happen. I had originally booked the week I completed this assignment off work to take some time away from the pressure of day-to-day life and to celebrate my 29th birthday. In complete transparency, the reason for wanting the break is not that mentally I am a four-year-old who still needs a big birthday party, but more because I seriously needed a break from the screen and some breathing room from the pressure of working full-time during the week and completing my Master’s part-time during evenings and weekends. That has been my daily existence for the last year and a half. What I really wanted and needed was a few days away in Barcelona, a place I’ve tried and failed to visit for over half a decade. However, once again, that break never happened.

Instead of going away, I spent the entire week, from the morning of Friday the 19th of April to the early hours of Monday the 28th, glued to my laptop for an average of nine hours a day of screen time. During this time, I was spending each day writing my code and drafting my report, and then training and running my tuned models overnight whilst I slept, ready for the next sprint. However, as mad as this sounds, I did this not because I had to, but because I wanted to.

As you will soon come to see, and if you couldn’t tell already, this module genuinely mattered to me. And for that reason, just as I had once committed myself to playing professional basketball at the expense of everything else in my life, I once again committed myself to the goal of doing everything I possibly could to ensure I left nothing to chance. I wanted to ensure this was the best piece of work I had ever submitted, and at least this sprint was only for a week.

The primary reason both this module and assignment were so important to me was that it was the first and best opportunity to formally take the ideas I’ve been exploring over the past few years around the growing space and need to intersect behavioural science, consumer psychology, data science, and machine learning within social media, and build something practical rather than theoretical. It gave me the chance to build the physical evidence that supported and illustrated the concepts that had previously been floating around in my mind and my countless unpublished and drafted blogs.

This assignment also directly aligns with the direction I want to head professionally and provided the perfect springboard for building on the ideas I recently shared in my first public talk, titled Designing Smarter Strategies, Machine Learning for Impactful Social Media Strategy that I delivered in early April, just a couple of weeks before I sat down and glued myself to my desk for a whole week to complete this assignment.

The reason this assignment was so closely associated with my talk is that within it, I shared a holistic perspective on how behavioural science, data science, consumer psychology, and creative strategy all interact. I outlined how every single social media post causes ripples, and those ripples hold fascinating behavioural insights and patterns that, if captured and labelled correctly across three overarching dimensions, content, text, and time, can help us begin to truly understand, on a more scientific level, how people really think, feel, and behave in response to what we create and publish.

I expanded on this idea further by explaining how, if these ripples and all the subsequent variables are labelled and strategically structured within a dataset large enough to model, they become far more valuable than the surface-level dashboards and vanity metrics still used as the primary way of measuring social media performance. To me, this has never been good enough. By using the same tools as everyone else, we’re only adding to the noise, hoping that, at the end of each month, we see more greens next to the usual metrics compared to the month before.

Essentially, what I outlined was a more scientific approach to content creation and distribution. One that allows us to build the foundations of something much more strategic, something more accurate, something more valuable, and even something that, over time, gives us a level of foresight we’ve never had access to before. This foresight would create the basis for a more predictive approach to content engineering and planning where as our datasets grow through this approach, more meaningful patterns would begin to emerge, enabling much smarter decisions and more intentional creative outputs, all within a system designed to nudge audiences, in a measured and observable way, toward outcomes that actually matter. This, to me, is the solution to finally moving beyond the global obsession with likes and other vanity metrics as the sole indicators of success.

Taking a full week to dive deeply into this project gave me the space, the time constraints, and the dataset to begin testing the surface-level components of these ideas. It let me scratch the surface of what it might truly mean to shift from isolated, intuition-led content production to a world of predictive, behaviourally informed content sequences, sequences that actually move the needle in both the digital and physical world in an observable and measurable way. Sequences that could ultimately redefine the discipline of social media and its role within broader marketing and business operations.

However, to scale back these ideas from the usual delusions of grandeur all strategists are prone to, I have to return to the first step in the journey: discussing the actual objectives of this assignment, and whether it was even possible to predict which variables might cause a YouTube video to trend. As much as I want to be at the destination I have in mind, I’ve learned the value of setting the goal and reverse-engineering the steps required to make it real. In the same way every step must be taken to reach a dream, the same applies when building a wall; each brick is a necessary element in turning the vision of a house into reality. In this case, this assignment was me laying my first brick towards physically exploring how the ripples from each post we share could eventually be structured into smarter, adaptive content ecosystems that allow us to measure social media’s direct impact on nudging customer experiences across both our digital and physical worlds.

To end this elongated introduction, I’ll close with the same sentiment I used to finish my public talk. I truly believe that the future of creating and designing the most impactful, meaningful, and valuable social media content lies in its role within much larger, interconnected omnichannel content ecosystems. Ecosystems that continuously analyse consumer, audience, and follower behaviour in real time, across both digital and physical spaces, through AI and other supporting data science methods. At that point, success on social media will have long since moved beyond the current practice of writing a copy-and-paste strategy, propped up by basic monthly content calendars and measured by top-level reports focused on the usual vanity metrics.

Over an increasing time horizon, these omnichannel ecosystems will continue to learn, evolve, and adapt. Behavioural insights won’t just shape creative direction, they’ll inform business decisions at every level. Strategy and content distribution across platforms will become predictive, engineered around the patterns found in the broader ripple effects that connect each channel. Success won’t be judged by siloed dashboards or monthly reports focused on what happened in the last month. Instead, it will be measured through a unified, real-time view of what should happen next, driven by optimised and predictive outputs that reflect business seasonality, live audience behaviour, and real-time demand.

But before we can get there, we first need to explore whether it was even possible to predict how many days it might take for a YouTube video to trend.

Choosing the Dataset: Where Strategy Meets Data Science

For my assignment, I was given the choice between four publicly available datasets. Out of those options, I went straight for the UK YouTube Trending Videos dataset. As I touched on previously, this dataset was not only closely aligned with my recent public talk but also the kind of work I am beginning to carry out day-to-day in my current full-time role.

The dataset contained over 36,000 records of legitimate UK trending videos across a 12-month period between 2017 and 2018. It included all the standard raw metrics available when extracting data from YouTube, such as publish time, video title, description, tags, views, likes, dislikes, comments, and category ID. In other words, it contained a strong combination of metadata and engagement variables to work with.

I should note here why I have not been able to create any significant machine learning models to explore my ideas and theories previously. Most of my current work revolves around building strategies that help lay the foundations for deeper levels of analysis. Therefore, at my point of entry, most datasets I come across in my professional work are far smaller. Additionally, my work primarily focuses on B2B clients who post only a couple of times per week on LinkedIn, a platform well-known for being incredibly guarded in terms of what data you can extract. For example, LinkedIn only allows you to export twelve months of retrospective data and doesn’t retain anything beyond that point. This means that at the point my work begins, I can only analyse anywhere from as few as 50 posts to, at most, a few hundred. As a result, I’m usually working with datasets of anywhere from 50 to a few hundred posts, depending on the client.

This scale is entirely reasonable in real-world settings, where content creation is shaped by practical constraints, things like budgets for social media management and the internal capacity to actually produce and publish content. Even when clients increase their posting frequency from the start of our relationship, the effects can only be measured moving forward. That’s because the content itself has usually been refined at the same time, visually, tonally, and in how it’s being distributed. So, to make sure any comparisons are without bias, especially if we’re aiming to compare performance on a year-on-year basis, it takes time to build up a dataset that’s large enough and clean enough to work with. This is then just to allow for meaningful seasonal comparisons, never mind the scale you’d need to support proper machine learning modelling of any kind.

These limitations are further complicated by the variability in each client’s platform size and follower base, which significantly influences engagement. This makes comparisons across clients, and even between a single client’s different platforms, far more complex. It becomes even more pronounced when clients operate in very different industries, where some of their platforms consistently perform at a much higher level than others, even if they are sharing the same content.

The reason I bring all of this up is to provide the context that the dataset within my assiginmnet gave me access to a much broader sample at the scale required to experiment with machine learning models that would not be subject to the same constraints as the ones I would typically encounter when modelling from a single account.

So, with the dataset downloaded and the freedom to define my own predictive question, I began mapping out the key exploratory questions that would guide my modelling process. This became the starting point for everything that followed.

From Exploration to Prediction: Asking the Right Questions

Before building any machine learning models, I knew I had to ask better questions than just whether we could predict what causes a video to trend. There are so many nuances, reasons, and relationships between variables for each channel and category that might explain why one video trends faster than another. For this reason, I carried out a full investigation into the dataset, identified what variables were actually available, and set some exploratory questions to better understand what I was working with. These included:

What is the distribution of time taken for videos to trend?
Do certain video categories trend faster than others?
Are engagement metrics (e.g., views, likes, comments) associated with faster trending?
Does metadata (e.g., publish hour, title length, number of tags) also have a noticeable influence on trend speed?

These were simple, strategic questions based on the kind of real-world thinking that content strategists, creators, and marketers constantly engage in without always having the tools to properly answer them.

From there, I moved to the core predictive question that would frame the rest of my project.

Can we predict how many days it will take for a video to trend using video metadata and engagement metrics?

To guide the modelling process, I also framed three operational hypotheses and began shaping the basis for the feature engineering I would later need to account for variation between creators and categories. These hypotheses were:

Engagement metrics (e.g., views, likes, comments) will have the most significant impact on how quickly a video trends.
Engagement data relative to specific channels (e.g., average views, like to view ratio, dislike to view ratio) will add more predictive power than baseline engagement metrics alone.
A channel’s average lag time before a video trends will strongly influence how quickly its next video trends.

These hypotheses weren’t just there to help improve and guide the accuracy of the future models. They were designed to test and reinforce the theory that success on social isn’t always a fluke, but the result of patterns - patterns that even standard raw social media data isn’t sufficiently labelled enough to capture, and that can’t yet be identified by anyone other than those who understand the underlying madness that exists within social media in terms of perceiving the countless variables many can’t see, understand or even value.

Therefore, as the results of this project later suggest, if those variables can be labelled sufficiently, they can help us identify more objective patterns that we can actually learn from. Eventually, this is the beginning of the process for what I see as becoming the foundations for what we use to both engineer and build smarter systems that respond and adapt in real time to our specific audiences based on large-scale consumer behaviour and engagement.

With those questions in place, it was time to explore the data and begin building my models.

Building the Machine Learning Models: From Raw Data to Real Insights

As you might imagine, this was the stage that took me the best part of a full week to complete.

On my first run through the assignment, I started with a smaller sample of around 3,000 videos. I removed any duplicates where a video had trended more than once. The goal was to test whether I could predict the initial signals that caused a video to trend in the first place, before any later cases where that same video may have trended again due to algorithmic shifts or sociocultural influences that I couldn’t control for.

After several days spent completing the full assignment using that smaller sample, I made the call to go back and rewrite all of my code to run the models on the full dataset. I excluded only the entries removed during initial cleaning to better reflect what actually causes a video to trend, even if it does so more than once.

As I touched on previously, I had originally been overly cautious where I was overly concerned that social or contextual factors might be driving certain videos to trend a second or third time, and that those influences wouldn’t be measurable within the dataset itself. However, with a dataset of this size, removing nearly 30,000 entries felt like I was omitting an immense amount of valuable insight. Therefore, this change came with a massive cost.

Running the full dataset through the tuned models, especially the Random Forest model, and exporting the final output to an R Markdown HTML file took eight hours each time I made any updates and needed to see the results. For that reason, I had to let the models run overnight and just hope my laptop wouldn’t die or go to sleep and break the sequence. I also had to rewrite sections of my report during the day to reflect the updated results, which left me right up against the deadline.

In the final 48 hours, I was juggling formatting the HTML report, finalising my results, and making last-minute adjustments without undoing days of work. The models finally finished running and exported successfully at around 5 a.m. on the 28th, giving me just enough time to wake up, submit my assignment, and head straight back to the office after a lovely, relaxing week of annual leave… Not.

For those who are interested, here is a top-level breakdown of my workflow and what I built:

Data exploration: I started by checking for missing data and examining the distribution of engagement metrics.
Data cleaning and preparation: I handled missing values, removed the near-zero and low variance predictors, controlled for any outliers, extracted insights from publish time and hour variables to create a ‘publish_lag_days’ variables, and also summarised each channel’s historical performance inot new variables such as ‘avg_views’, ‘avg_lag, ‘total_videos’.
Feature engineering: I engineered new features like ‘like_ratio’: likes divided by views, ‘comment_ratio’: comments divided by views, ‘dislike_ratio’: dislikes divided by views, ‘likes_per_comment’: likes divided by comments, as well as several channel-level engagement metrics to capture historical performance. I also log-transformed engagement metrics to stabilise variance.
Stratified train/test split: I split the data into a training and testing set using a stratified random split based on ‘category_id’. I chose to split based on this variable because certain categories were much more common than others, and an unbalanced split could have biased the model training or evaluation process. I then visualised the category distribution and views between the train and test samples to make sure they were equal for reliable modelling.
Exploratory analysis: I created visualisations to visually inspect and explicitly answer my 4 exploratory questions using a bar graph for the distribution of days until trending, average time to trend by category, two separate correlation matrices for engagement and metadata variables.
Baseline modelling: I ran multiple versions of a linear regression model using different subsets of predictors, engagement only, metadata only, category only, and a fully combined version, to see which types of variables contributed most to prediction accuracy. Every other model beyond this point used the combined feature set with supporting visualisations of the models' accuray as well as a separate visualisation for feature importance which indicted that engineered features such as ‘avg_lag’, ‘avg_views’, like_ratio were more powerful within the random forest and GNm models than the standard engagement metrics.
Stacked ensemble (baseline): Combining models into a baseline stacked ensemble allowed me to reduce the weaknesses of any individual algorithm and stabilise predictive performance.
Model tuning: The next stage was an intensive and incredibly time-consuming part where I tuned both the random forest and GBM models using grid search and cross-validation. These were not quick runs, with the random forest model having to be left running overnight to complete.
Final stacked ensemble: With the best models selected and tuned, I created a refined stacked ensemble to maximise predictive accuracy and generalizability.

The best-performing model achieved:

Final model performance:R² = 0.941RMSE = 0.221

But, what does that actually mean?

An R² of 0.941 suggests that my final model was able to explain over 94% of the variance in how many days it takes for a video to trend. This is an incredibly high level of predictive accuracy, especially in a field as chaotic and context-dependent as social media. In other words, it means that we can model and forecast trend speed with a surprising amount of precision.

The RMSE (Root Mean Square Error) of 0.221 indicates that, on average, the model’s predictions of the number of days it takes for a video to trend were off by just over a fifth of a day, or roughly five hours. That’s a level of precision most social strategists and content creators don’t typically have access to, which is exactly why this kind of work matters.

However, while the numbers were strong, the more important takeaway is that the process shows that if we treat raw social media data with the same value as any behavioural, marketing, business, even financial dataset, by cleaning it properly, engineering the right features, and modeling it with the appropriate structure, we really can build intelligent systems that learn, adapt, and guide content strategy over time. Systems that go far beyond what we are able to intuitively sense or perceive as individuals.

The results of this project, therefore, reinforce some of the ideas I shared earlier, particularly the belief that social media success isn’t always a fluke. There is a science within the madness, within the chaos. The data often holds patterns waiting to be recognised. Once those patterns are uncovered, it becomes possible to start engineering and predicting what should come next as part of a sequence of compounding, meaningful engagement.

From Prediction to Possibility: What This Taught Me

This project wasn’t just the most technically demanding part of my Master’s so far, it was also the most fulfilling. It gave me the space to explore something I’ve been conceptualising and working towards for a long time and my belief that the future of content strategy will be shaped by systems that don’t simply respond as we currently do on a month to month basis interrogating the usual vanity metrics, but instead adapt intelligently in real time to the behaviours, engagement levels, and preferences of our evolving audiences.

Therefore, rather than treating each social media post as an isolated moment in time, we need to start seeing our content as a series of connected dominoes, with the subsequent engagement being made up of the collective sequence of decisions of our audiences. Therefore, each post, when modelled in this way and as a part of one of these systems, the success (or failure) of each post builds and compounds upon the last post in the sequence, with each new post creating new ripples, shaping engagement, and influencing future outcomes. That’s how machine learning models approach complex games like chess or Go, not by trying to predict the future with absolute certainty, but by constantly learning from everything that came before, recognising patterns, and selecting the most strategic next move based on the current layout of the board.

This process also reinforced something I’ve been arguing in both theory and practice for a while now, that social media data is another form of behavioural data, and it deserves to be treated with the same level of curiosity and strategic thinking as any other dataset used to inform business or customer decisions. When it’s labelled and structured correctly, it really does reveal patterns that help us understand not just what content performs well, but why it performs well, and what should come next.

This perspective isn’t about replacing the creativity of social media managers or designers with code. It’s about giving creative teams better tools. Tools that reduce guesswork, shorten feedback loops, and help identify the necessary ingredients to help us define the canvas to colour beautifully within it. Ultimately, this whole process is about ensuring creative teams are set up for success, with less margin for error, where each post can be shared at the perfect time, to the perfect people, in the perfect format to start creating more reliable results that actually matter.

This project gave me a chance to finally enter the world where these ideas can be executed in practice, and although it involved some of the longest and most intense hours I’ve ever spent on a single assignment, it also helped me take a meaningful step toward the kind of future I want to help shape.

Looking Ahead: Why This Matters and What’s Next

At surface level, this may have been a simple academic exercise and an entry-level way to explore a dataset and build machine learning models, but beneath that, it represented something far more personal and foundational to the direction I want to take in my career.

Throughout the majority of my professional working career, I have found that what excites me isn’t just the technical side of influencing behaviour through analysing data or the creative side of storytelling, it’s the space where those two things meet. Where behavioural science, consumer psychology, data science, and machine learning intersect. Where data doesn’t constrain creativity but helps guide it with more clarity and confidence, that, when done well, has measurable impacts on people’s behaviour in both digital and physical spaces.

What this project showed me is that this intersection is no longer hypothetical. Systems are already being built towards different objectives, but looking at how these techniques work, it can help give us a glimpse of what is possible within our own fields. Therefore, in the world of marketing and social media, it’s only a matter of time before the use of AI goes beyond writing captions for posts and generating images. We now have the opportunity to curate tools and different techniques to build systems that don’t just chase the usual vanity metrics but truly understand what those metrics mean and the impact they have across our different platforms, marketing channels, and even digital and physical spaces.

I truly believe that once designed, these systems will learn from every piece of content, adapt with every new data point (digital and physical), and evolve with every shift in audience behaviour; therefore, it will become vital to be ahead of the curve rather than behind it. Once these systems are made and there are individuals who understand both the value of creativity and the techniques within the world of data science, it will become very hard to get by within a creative domain if you only sit within one of the two.

For this reason, envisioning where the world is starting to go is what makes me want to do everything I can to be one of the ones, if not steering the ship, then at least on it. That way, it will be easier to qualify the necessity of designers and creative thinkers when organisations try to go lean by optimising for quantifiable data-informed outputs. Although both have value within every organisation, it will be those businesses that can perfect balance and work within the intersection of the two rather than those who will no doubt go through cyclical seasonal patterns of being a pendulum that overvalues data, then overvaluing creativity, causing chronic spluttering that’ll negatively impact growth over longer periods of time.

Therefore, this is the kind of work I want to be doing more of. Helping brands and businesses create smarter content ecosystems through an omnichannel approach that attempts to make sense of the complexity of human behaviour by turning data into insight, insight into decisions, and decisions into compounding engagement over time.

So yes, the project resulted in a 95, and I’m proud of that, but more importantly, it validated the direction I’m moving in. It sharpened my thinking, strengthened my conviction, and reminded me that the ideas I’ve been writing and obsessing about for the last 10 years can be built, tested, and applied in practice, not just talked about in theory.

If you're exploring similar ideas or want to start moving in that direction, I’d love to connect.

BY: MIKE STEVENSON