How to Get Accurate Per-File Download Statistics from Your CDN

April 15, 2025 3 mins read

AI summary

Overview: The article addresses the need for precise, per-file download metrics from content delivery networks (CDNs). It explains why aggregate dashboard metrics are inadequate for environments serving large numbers of distinct assets and outlines the general architecture and data sources required to obtain exact delivery counts for individual files.

Core technical points: Standard CDN analytics are typically aggregated, sampled, or delayed and therefore cannot reliably report per-object counts at high cardinality. Accurate per-file statistics require ingesting raw edge access logs and processing them with a distributed pipeline that handles collection, transport, durable storage, and scalable aggregation. High-cardinality datasets demand columnar or specialized storage, efficient indexing and partitioning, and attention to normalization, filtering of error responses, and deduplication. Most use cases are well served by periodic batch aggregation, while stream processing offers lower latency at greater cost and complexity.

Bottom line: For accurate per-file download visibility, rely on raw edge logs and a purpose-built analytics pipeline backed by scalable storage and appropriate processing strategies; aggregated CDN dashboards alone are insufficient for large-scale, high-cardinality workloads.

Most CDN dashboards show you total traffic, total requests, and average cache hit ratio. But what if your business depends on understanding exactly how many times each file is downloaded?

If you serve 100,000+ assets, aggregated metrics are not enough. You need precise, per-file visibility to optimize performance, control costs, and make data-driven decisions.

How to Get Accurate Per-File Download Statistics from Your CDN

Understanding exactly how often each asset is delivered through your CDN is critical for capacity planning, cost control, monetization, and security. However, most CDN dashboards provide only aggregated metrics, which are insufficient for large-scale infrastructures serving tens or hundreds of thousands of files.

This article explains how to obtain accurate per-file download statistics, the limitations of standard CDN analytics, and how to design a scalable logging and analytics pipeline.

Why Per-File CDN Analytics Matters

Modern content delivery environments operate at a massive scale. According to Cisco, video traffic accounted for over 82% of all internet traffic before 2023, and large media platforms routinely serve millions of requests per hour.

For infrastructures serving 100K+ assets, aggregated metrics such as total bandwidth or request count are not enough. You need per-object visibility to:

Identify hot and cold content
Optimize cache efficiency
Detect abuse or scraping
Calculate ROI per asset (ads, subscriptions, licensing)
Plan storage and tiering strategies

Aggregated CDN Metrics

Most CDN providers expose analytics such as:

Total requests
Bandwidth usage
Cache hit ratio
Geographic distribution

However, these are aggregated metrics, not per-object statistics.

Key Limitation

Standard CDN dashboards do not track individual file downloads at scale due to performance and storage constraints.

For example:

A platform serving 100,000 images
With 10 million requests/day

Tracking per-file counters in real time would require:

High-cardinality storage
Distributed aggregation
Significant compute overhead

This is why most CDNs provide:

Sampling
Aggregation
Delayed reporting

The Only Reliable Source

To achieve accurate per-file statistics, you must rely on raw access logs generated at the edge.

These logs typically include:

Timestamp
Client IP
Request URI (file path)
HTTP status code
Bytes transferred
Cache status (HIT/MISS)
User-Agent

Example (Simplified Log Entry)

2026-03-30T12:00:01Z GET /images/product123.jpg 200 53214 HIT

From this, you can derive:

Exact download count per file
Bandwidth per asset
Cache efficiency per object

Log-Based Analytics Architecture

To process per-file statistics at scale, you need a distributed pipeline.

Layer	Technology Options	Purpose
Log Collection	CDN logs, NGINX logs	Capture raw request data
Transport	Kafka / Fluentd	Stream logs in real time
Storage	S3-compatible / object storage	Store raw logs
Processing	ClickHouse / Apache Spark	Aggregate per-file metrics
Visualization	Grafana / custom dashboards	Query and display insights

Why High-Cardinality Data Is Challenging

Per-file analytics introduces high-cardinality datasets, where each unique file path is a dimension.

According to Cloudflare, high-cardinality observability data is one of the main challenges in modern distributed systems due to:

Explosive growth in unique keys (URLs)
Increased memory consumption
Query performance degradation

Real-World Example

Metric	Value
Total files	100,000
Daily requests	10,000,000
Unique file keys	100,000
Log entries/day	10M+

This scale requires:

Columnar databases (e.g., ClickHouse)
Efficient indexing
Partitioning strategies

Processing Logs for Per-File Statistics

Once logs are collected, you can compute accurate metrics.

Example Aggregation Query (Conceptual)

Group by: request_uri
Count: total requests
Sum: total bytes
Filter: status = 200

Output Example

File	Downloads	Bandwidth	Cache Hit Ratio
/img/a.jpg	120,000	6.4 GB	98%
/img/b.jpg	12,000	0.8 GB	85%
/img/c.jpg	500	20 MB	40%

This enables:

Identifying top-performing assets
Detecting inefficient caching
Removing unused files

Real-Time vs Batch Processing

Batch Processing (Most Common)

Logs processed every 5–15 minutes
Lower infrastructure cost
Suitable for most use cases

Real-Time Processing

Stream processing (Kafka + Flink)
Sub-second visibility
Higher cost and complexity

In practice, batch processing is sufficient for 95% of workloads.

CDN Provider Limitations

Even enterprise CDNs:

Do not expose full raw logs by default
Charge extra for log delivery
Limit retention (e.g., 3–7 days)

This makes external log storage mandatory for long-term analytics.

Accuracy Considerations

To ensure accurate per-file statistics:

1. Filter Only Successful Requests

Exclude:

4xx errors
5xx errors

2. Handle Cache Hits and Misses

Both should be counted to measure total downloads.

3. Normalize URLs

Remove:

Query parameters (if irrelevant)
Tracking tokens

4. Deduplicate Edge Cases

Some retries or partial downloads may inflate counts.

Performance Impact

Proper logging has minimal impact when:

Logs are streamed asynchronously
Edge nodes are not blocked
Compression is used

Modern CDNs are designed to handle logging pipelines without degrading delivery performance.

Key Takeaways

Per-file download statistics are not available via standard CDN dashboards
Raw logs are the only reliable source of truth
High-cardinality data requires specialized storage (e.g., ClickHouse)
Batch processing is sufficient for most infrastructures
Accurate analytics requires normalization and filtering

If your project serves tens of thousands of assets or processes millions of requests daily, standard CDN analytics are not enough.

At Advanced Hosting, we design and deploy:

Log-based CDN analytics pipelines
High-performance storage (ClickHouse, object storage)
Custom dashboards for per-file visibility
Scalable infrastructure for high-load environments

Do CDNs provide per-file download statistics out of the box?

No. Most CDN providers only offer aggregated analytics such as total requests, bandwidth usage, and cache hit ratios. Per-file visibility requires access to raw logs and external processing.

What is the most accurate way to track downloads for each file?

The most accurate method is analyzing raw edge logs. These logs record every request, including the exact file path, allowing precise counting of downloads per asset.

Can I track downloads for 100,000+ files without performance issues?

Yes, but only with the right architecture. High-cardinality datasets require:

Columnar databases (e.g., ClickHouse)
Efficient partitioning
Scalable ingestion pipelines (Kafka, Fluentd)

Without this, performance and query speed will degrade.

How often should CDN logs be processed?

For most use cases:

Batch processing (5–15 minutes) is sufficient
Real-time processing is only needed for fraud detection or live monitoring

Do cache hits and misses affect download statistics?

Yes. To get accurate numbers:

Count both cache hits and misses
Filter only successful responses (HTTP 200)

This ensures you measure actual content delivery, not just origin fetches.

Can CDN logs be stored long-term?

Not by default. Most CDN providers retain logs for only 3 to 7 days. For long-term analytics, logs must be exported to external storage such as S3-compatible systems.