How to Get Accurate Per-File Download Statistics from Your CDN

Blog image

AI summary

Overview: The article addresses the need for precise, per-file download metrics from content delivery networks (CDNs). It explains why aggregate dashboard metrics are inadequate for environments serving large numbers of distinct assets and outlines the general architecture and data sources required to obtain exact delivery counts for individual files.

Core technical points: Standard CDN analytics are typically aggregated, sampled, or delayed and therefore cannot reliably report per-object counts at high cardinality. Accurate per-file statistics require ingesting raw edge access logs and processing them with a distributed pipeline that handles collection, transport, durable storage, and scalable aggregation. High-cardinality datasets demand columnar or specialized storage, efficient indexing and partitioning, and attention to normalization, filtering of error responses, and deduplication. Most use cases are well served by periodic batch aggregation, while stream processing offers lower latency at greater cost and complexity.

Bottom line: For accurate per-file download visibility, rely on raw edge logs and a purpose-built analytics pipeline backed by scalable storage and appropriate processing strategies; aggregated CDN dashboards alone are insufficient for large-scale, high-cardinality workloads.

Most CDN dashboards show you total traffic, total requests, and average cache hit ratio. But what if your business depends on understanding exactly how many times each file is downloaded?

If you serve 100,000+ assets, aggregated metrics are not enough. You need precise, per-file visibility to optimize performance, control costs, and make data-driven decisions.

How to Get Accurate Per-File Download Statistics from Your CDN

Understanding exactly how often each asset is delivered through your CDN is critical for capacity planning, cost control, monetization, and security. However, most CDN dashboards provide only aggregated metrics, which are insufficient for large-scale infrastructures serving tens or hundreds of thousands of files.

This article explains how to obtain accurate per-file download statistics, the limitations of standard CDN analytics, and how to design a scalable logging and analytics pipeline.

Why Per-File CDN Analytics Matters

Modern content delivery environments operate at a massive scale. According to Cisco, video traffic accounted for over 82% of all internet traffic before 2023, and large media platforms routinely serve millions of requests per hour.

For infrastructures serving 100K+ assets, aggregated metrics such as total bandwidth or request count are not enough. You need per-object visibility to:

  • Identify hot and cold content
  • Optimize cache efficiency
  • Detect abuse or scraping
  • Calculate ROI per asset (ads, subscriptions, licensing)
  • Plan storage and tiering strategies

Aggregated CDN Metrics

Most CDN providers expose analytics such as:

  • Total requests
  • Bandwidth usage
  • Cache hit ratio
  • Geographic distribution

However, these are aggregated metrics, not per-object statistics.

Key Limitation

Standard CDN dashboards do not track individual file downloads at scale due to performance and storage constraints.

For example:

  • A platform serving 100,000 images
  • With 10 million requests/day

Tracking per-file counters in real time would require:

  • High-cardinality storage
  • Distributed aggregation
  • Significant compute overhead

This is why most CDNs provide:

  • Sampling
  • Aggregation
  • Delayed reporting

The Only Reliable Source

To achieve accurate per-file statistics, you must rely on raw access logs generated at the edge.

These logs typically include:

  • Timestamp
  • Client IP
  • Request URI (file path)
  • HTTP status code
  • Bytes transferred
  • Cache status (HIT/MISS)
  • User-Agent

Example (Simplified Log Entry)

2026-03-30T12:00:01Z GET /images/product123.jpg 200 53214 HIT

From this, you can derive:

  • Exact download count per file
  • Bandwidth per asset
  • Cache efficiency per object

Log-Based Analytics Architecture

To process per-file statistics at scale, you need a distributed pipeline.

LayerTechnology OptionsPurpose
Log CollectionCDN logs, NGINX logsCapture raw request data
TransportKafka / FluentdStream logs in real time
StorageS3-compatible / object storageStore raw logs
ProcessingClickHouse / Apache SparkAggregate per-file metrics
VisualizationGrafana / custom dashboardsQuery and display insights

Why High-Cardinality Data Is Challenging

Per-file analytics introduces high-cardinality datasets, where each unique file path is a dimension.

According to Cloudflare, high-cardinality observability data is one of the main challenges in modern distributed systems due to:

  • Explosive growth in unique keys (URLs)
  • Increased memory consumption
  • Query performance degradation

Real-World Example

MetricValue
Total files100,000
Daily requests10,000,000
Unique file keys100,000
Log entries/day10M+

This scale requires:

  • Columnar databases (e.g., ClickHouse)
  • Efficient indexing
  • Partitioning strategies

Processing Logs for Per-File Statistics

Once logs are collected, you can compute accurate metrics.

Example Aggregation Query (Conceptual)

  • Group by: request_uri
  • Count: total requests
  • Sum: total bytes
  • Filter: status = 200

Output Example

FileDownloadsBandwidthCache Hit Ratio
/img/a.jpg120,0006.4 GB98%
/img/b.jpg12,0000.8 GB85%
/img/c.jpg50020 MB40%

This enables:

  • Identifying top-performing assets
  • Detecting inefficient caching
  • Removing unused files

Real-Time vs Batch Processing

Batch Processing (Most Common)

  • Logs processed every 5–15 minutes
  • Lower infrastructure cost
  • Suitable for most use cases

Real-Time Processing

  • Stream processing (Kafka + Flink)
  • Sub-second visibility
  • Higher cost and complexity

In practice, batch processing is sufficient for 95% of workloads.

CDN Provider Limitations

Even enterprise CDNs:

  • Do not expose full raw logs by default
  • Charge extra for log delivery
  • Limit retention (e.g., 3–7 days)

This makes external log storage mandatory for long-term analytics.

Accuracy Considerations

To ensure accurate per-file statistics:

1. Filter Only Successful Requests

Exclude:

  • 4xx errors
  • 5xx errors

2. Handle Cache Hits and Misses

Both should be counted to measure total downloads.

3. Normalize URLs

Remove:

  • Query parameters (if irrelevant)
  • Tracking tokens

4. Deduplicate Edge Cases

Some retries or partial downloads may inflate counts.

Performance Impact

Proper logging has minimal impact when:

  • Logs are streamed asynchronously
  • Edge nodes are not blocked
  • Compression is used

Modern CDNs are designed to handle logging pipelines without degrading delivery performance.

Key Takeaways

  • Per-file download statistics are not available via standard CDN dashboards
  • Raw logs are the only reliable source of truth
  • High-cardinality data requires specialized storage (e.g., ClickHouse)
  • Batch processing is sufficient for most infrastructures
  • Accurate analytics requires normalization and filtering

If your project serves tens of thousands of assets or processes millions of requests daily, standard CDN analytics are not enough.

At Advanced Hosting, we design and deploy:

  • Log-based CDN analytics pipelines
  • High-performance storage (ClickHouse, object storage)
  • Custom dashboards for per-file visibility
  • Scalable infrastructure for high-load environments

Do CDNs provide per-file download statistics out of the box?

No. Most CDN providers only offer aggregated analytics such as total requests, bandwidth usage, and cache hit ratios. Per-file visibility requires access to raw logs and external processing.

What is the most accurate way to track downloads for each file?

The most accurate method is analyzing raw edge logs. These logs record every request, including the exact file path, allowing precise counting of downloads per asset.

Can I track downloads for 100,000+ files without performance issues?

Yes, but only with the right architecture. High-cardinality datasets require:

  • Columnar databases (e.g., ClickHouse)
  • Efficient partitioning
  • Scalable ingestion pipelines (Kafka, Fluentd)

Without this, performance and query speed will degrade.

How often should CDN logs be processed?

For most use cases:

  • Batch processing (5–15 minutes) is sufficient
  • Real-time processing is only needed for fraud detection or live monitoring

Do cache hits and misses affect download statistics?

Yes. To get accurate numbers:

  • Count both cache hits and misses
  • Filter only successful responses (HTTP 200)

This ensures you measure actual content delivery, not just origin fetches.

Can CDN logs be stored long-term?

Not by default. Most CDN providers retain logs for only 3 to 7 days. For long-term analytics, logs must be exported to external storage such as S3-compatible systems.

How large can CDN log data grow?

Very quickly. For example:

  • 10 million requests/day ≈ several GBs of logs daily
  • At scale, this becomes terabytes per month

Efficient storage and compression are essential.

What tools are best for analyzing CDN logs?

Common stack:

  • Ingestion: Kafka, Fluentd
  • Storage: Object storage (S3-compatible)
  • Analytics: ClickHouse, Apache Spark
  • Visualization: Grafana

Is it possible to track unique downloads (not just total hits)?

Yes, but it requires additional processing:

  • IP + User-Agent fingerprinting
  • Cookie-based tracking
  • Session reconstruction

Note that this introduces complexity and potential privacy considerations.

How do I avoid inflated download counts?

To improve accuracy:

  • Exclude retries and partial requests
  • Normalize URLs (remove query strings if irrelevant)
  • Filter out bots if necessary

Does enabling logging affect CDN performance?

No, if implemented correctly. Modern CDNs use asynchronous log streaming, which does not impact content delivery latency.

When should I invest in advanced CDN analytics?

You should consider it when:

  • You serve 50K–100K+ assets
  • You process millions of requests daily
  • You need cost attribution or monetization per asset
  • You require detailed usage insights for optimization

Related articles

1Best Infrastructure for Kernel Video Sharing (KVS)

Best Infrastructure for Kernel Video Sharing (KVS)

Scaling a KVS platform requires more than basic hosting; it demands a multi-server architecture with dedicated conversion, storage, and CDN layers. This guide explains how to build a high-performance, cost-efficient infrastructure for video platforms handling thousands of videos and high traffic. Best Infrastructure for Kernel Video Sharing (KVS) Scaling a video platform built on Kernel […]
1Securing Video Delivery: Edge Control for Streaming at Scale

Securing Video Delivery: Edge Control for Streaming at Scale

Video delivery has some unique challenges. Short-form feeds have trained users to expect instant playback while they scroll. Long-form platforms have to sustain quality for minutes or hours without buffering. And some categories – especially platforms with high rates of unauthorized redistribution – face an additional constraint: hostile traffic (hotlinking, scraping, abuse) that can quietly […]
1Server Pricing Volatility in the AI Era: What’s Driving It and How to Stay in Control

Server Pricing Volatility in the AI Era: What’s Driving It and How to Stay in Control

Buying servers used to be predictable. You picked a configuration, got a quote, and scheduled deployment around a delivery window you could trust. In 2024-2025, that certainty has changed. Not because “servers” suddenly got complicated, but because key components are being pulled into a global AI build-out. AI demand pushed the server/storage components market to […]
1Why Video Needs a Different Kind of CDN

Why Video Needs a Different Kind of CDN

Video is the largest downstream traffic category. Video applications accounted for approximately 76% of all mobile traffic by the end of 2025, and they are projected to comprise 82% of all internet traffic by 2026. It’s also the category most sensitive to infrastructure speed. If a page loads a little late, users get frustrated. If […]
1Amsterdam GPU Infrastructure for Intensive Video Workloads

Amsterdam GPU Infrastructure for Intensive Video Workloads

In this article, we analyze a real client request and explore how to match or improve a GPU-powered video processing setup without increasing costs. We compare configurations, discuss infrastructure differences, and explain what truly matters for stable transcoding and streaming workloads. Dedicated Servers for Video Processing & GPU Workloads in Amsterdam When clients approach us […]
1Dedicated Servers vs. Bare Metal: What’s the Difference?

Dedicated Servers vs. Bare Metal: What’s the Difference?

In infrastructure, two terms appear everywhere yet remain widely misunderstood: Dedicated Server and Bare Metal Server. To some, they mean the same thing. To others, even long-standing Fortune 500 companies like IBM, they mean something different. Providers put out definitions of their own, and they’re not always aligned with how the technology actually works. The […]