How I built a Lightweight Web Analytics Tracker

I have used many different trackers in my career and always thought, how does it work under the hood? What would it take to build a simple web analytics tracker myself?

Using server-side Google Tag Manager, we can now send data directly to BigQuery. This is a great solution and makes sense for most businesses or website owners. In my case, I don’t get much traffic, so having a cloud run instance constantly running to gather a small amount of data seemed wasteful in my case.

To address my requirements, and also as a learning experience, I decided to set up a lightweight and privacy-friendly tracking solution using Google Cloud Platform and Google Tag Manager. It’s simple, serverless, and gives full control over the data being collected.

What I built

At a high level, this is how the setup works:

  1. A JavaScript snippet runs in the browser, triggered via Google Tag Manager (GTM).
  2. That snippet sends tracking hits to an API Gateway.
  3. The API Gateway passes the data to a Cloud Run Function.
  4. The function parses the payload and writes it to a BigQuery table for storage and analysis.

There’s no large client-side library, and no external dependencies beyond GCP. It’s lean, fast, and transparent.


Step-by-step breakdown

1. JavaScript tracking snippet in GTM

I wrote a custom HTML tag in GTM containing a small JavaScript function. Similar to other trackers, this is loaded on initial page load. For each event, I then have other tags that call this function and pass the parameters I want to add to the event.

By default the tracker collects basic data like:

  • Page URL
  • Referrer
  • Timestamp
  • GA cookie id

Using the GA cookie id meant that I did not need to set additional cookies other than what’s necessary. The script can easily be adjusted so that it adds a cookie identifier for this user and even another to identify the session. However, for simplicity, I have not included them here.

I also created a flexible schema. Similar to the GA4 schema, the event parameters variable is a record object that can accept any input from the user. This makes things simpler and flexible for my purpose.

Here’s a simplified version of the script:

The request is sent as a POST to your public-facing API Gateway URL.


2. GCP API Gateway

The API Gateway acts as a secure public endpoint for the Cloud Run service. It helps abstract the backend, handle authentication if needed, and apply basic request validation or quotas if necessary.

In my case, I simply applied a rate limit to the requests coming through, and set it up to forward the POST request directly to the Cloud Run instance.


3. Cloud Run Function

The Cloud Run service is a simple Python function that parses the request and writes it to BigQuery. It’s containerised and stateless, scaling automatically depending on traffic.

Since my website sometimes goes for days without any visits, I can also set the minimum number of instances to 0. This will result in cold starts, which does pose its risks. However, this setup is not business critical so these limitations and risks are acceptable in my case.

The function:

  • Parses and validates the incoming JSON
  • Inserts the row into BigQuery via the Python BigQuery client library

4. BigQuery for storage and analysis

Once the data is in BigQuery, I can query it with SQL, build reports, or plug it into a dashboarding tool like Looker Studio. Because I control the schema, I can keep it minimal or expand it as needed.


Why this setup?

This approach gives me:

  • Full control over what data is collected and stored
  • No external tracking libraries, making it faster and more transparent
  • Reduced Cost since I can control whether to have instances always running or not
  • Customisability – add events, metrics, or enrich data however I like

It’s ideal for small projects, internal tools, or privacy-conscious websites where basic usage data is enough.

Potential Improvements

This tracker is adequate for simple setups but there are many ways this setup can be improved.

  • Setting up a load balancer for Cloud Run Functions
    • This has multiple advantages: any cookies set will be 1st party, and the setup will have Google Cloud Armor that will make this setup safer.
  • Add the tracker’s own cookie for identifying users
    • Adding a user ID cookie would make the setup more independent since the setup currently uses the GA4 cookie
  • Add a session identifier to the setup and schema
    • Adding a session identifier cookie and sending that data to BigQuery would make the analysis process much easier.
    • Doing can have its own limitations, but that discussion is outside the scope of this article.
  • Adding browser information to the schema of the setup
    • Currently, the browser used, device used, screen size, and other attributes are not gathered. Adding these can be very beneficial when trying to understand user behaviour.
  • Incorporating utm parameters in the schema of the events
    • By adding utm parameter processing client-side and sending that data with the events reduces the resources used for analysing the data after it is loaded.

This setup would be adequate for small sites with minimal volumes of traffic. As things scale, we may need to review it and set up a more complex and robust setup.


Posted

in

,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *