AI summary
Overview: The article addresses the need for precise, per-file download metrics from content delivery networks (CDNs). It explains why aggregate dashboard metrics are inadequate for environments serving large numbers of distinct assets and outlines the general architecture and data sources required to obtain exact delivery counts for individual files.
Core technical points: Standard CDN analytics are typically aggregated, sampled, or delayed and therefore cannot reliably report per-object counts at high cardinality. Accurate per-file statistics require ingesting raw edge access logs and processing them with a distributed pipeline that handles collection, transport, durable storage, and scalable aggregation. High-cardinality datasets demand columnar or specialized storage, efficient indexing and partitioning, and attention to normalization, filtering of error responses, and deduplication. Most use cases are well served by periodic batch aggregation, while stream processing offers lower latency at greater cost and complexity.
Bottom line: For accurate per-file download visibility, rely on raw edge logs and a purpose-built analytics pipeline backed by scalable storage and appropriate processing strategies; aggregated CDN dashboards alone are insufficient for large-scale, high-cardinality workloads.
Most CDN dashboards show you total traffic, total requests, and average cache hit ratio. But what if your business depends on understanding exactly how many times each file is downloaded?
If you serve 100,000+ assets, aggregated metrics are not enough. You need precise, per-file visibility to optimize performance, control costs, and make data-driven decisions.
How to Get Accurate Per-File Download Statistics from Your CDN
Understanding exactly how often each asset is delivered through your CDN is critical for capacity planning, cost control, monetization, and security. However, most CDN dashboards provide only aggregated metrics, which are insufficient for large-scale infrastructures serving tens or hundreds of thousands of files.
This article explains how to obtain accurate per-file download statistics, the limitations of standard CDN analytics, and how to design a scalable logging and analytics pipeline.
Why Per-File CDN Analytics Matters
Modern content delivery environments operate at a massive scale. According to Cisco, video traffic accounted for over 82% of all internet traffic before 2023, and large media platforms routinely serve millions of requests per hour.
For infrastructures serving 100K+ assets, aggregated metrics such as total bandwidth or request count are not enough. You need per-object visibility to:
- Identify hot and cold content
- Optimize cache efficiency
- Detect abuse or scraping
- Calculate ROI per asset (ads, subscriptions, licensing)
- Plan storage and tiering strategies
Aggregated CDN Metrics
Most CDN providers expose analytics such as:
- Total requests
- Bandwidth usage
- Cache hit ratio
- Geographic distribution
However, these are aggregated metrics, not per-object statistics.
Key Limitation
Standard CDN dashboards do not track individual file downloads at scale due to performance and storage constraints.
For example:
- A platform serving 100,000 images
- With 10 million requests/day
Tracking per-file counters in real time would require:
- High-cardinality storage
- Distributed aggregation
- Significant compute overhead
This is why most CDNs provide:
- Sampling
- Aggregation
- Delayed reporting

The Only Reliable Source
To achieve accurate per-file statistics, you must rely on raw access logs generated at the edge.
These logs typically include:
- Timestamp
- Client IP
- Request URI (file path)
- HTTP status code
- Bytes transferred
- Cache status (HIT/MISS)
- User-Agent
Example (Simplified Log Entry)
2026-03-30T12:00:01Z GET /images/product123.jpg 200 53214 HIT
From this, you can derive:
- Exact download count per file
- Bandwidth per asset
- Cache efficiency per object
Log-Based Analytics Architecture
To process per-file statistics at scale, you need a distributed pipeline.
| Layer | Technology Options | Purpose |
| Log Collection | CDN logs, NGINX logs | Capture raw request data |
| Transport | Kafka / Fluentd | Stream logs in real time |
| Storage | S3-compatible / object storage | Store raw logs |
| Processing | ClickHouse / Apache Spark | Aggregate per-file metrics |
| Visualization | Grafana / custom dashboards | Query and display insights |
Why High-Cardinality Data Is Challenging
Per-file analytics introduces high-cardinality datasets, where each unique file path is a dimension.
According to Cloudflare, high-cardinality observability data is one of the main challenges in modern distributed systems due to:
- Explosive growth in unique keys (URLs)
- Increased memory consumption
- Query performance degradation
Real-World Example
| Metric | Value |
| Total files | 100,000 |
| Daily requests | 10,000,000 |
| Unique file keys | 100,000 |
| Log entries/day | 10M+ |
This scale requires:
- Columnar databases (e.g., ClickHouse)
- Efficient indexing
- Partitioning strategies
Processing Logs for Per-File Statistics
Once logs are collected, you can compute accurate metrics.
Example Aggregation Query (Conceptual)
- Group by: request_uri
- Count: total requests
- Sum: total bytes
- Filter: status = 200
Output Example
| File | Downloads | Bandwidth | Cache Hit Ratio |
| /img/a.jpg | 120,000 | 6.4 GB | 98% |
| /img/b.jpg | 12,000 | 0.8 GB | 85% |
| /img/c.jpg | 500 | 20 MB | 40% |
This enables:
- Identifying top-performing assets
- Detecting inefficient caching
- Removing unused files
Real-Time vs Batch Processing
Batch Processing (Most Common)
- Logs processed every 5–15 minutes
- Lower infrastructure cost
- Suitable for most use cases
Real-Time Processing
- Stream processing (Kafka + Flink)
- Sub-second visibility
- Higher cost and complexity
In practice, batch processing is sufficient for 95% of workloads.
CDN Provider Limitations
Even enterprise CDNs:
- Do not expose full raw logs by default
- Charge extra for log delivery
- Limit retention (e.g., 3–7 days)
This makes external log storage mandatory for long-term analytics.
Accuracy Considerations
To ensure accurate per-file statistics:
1. Filter Only Successful Requests
Exclude:
- 4xx errors
- 5xx errors
2. Handle Cache Hits and Misses
Both should be counted to measure total downloads.
3. Normalize URLs
Remove:
- Query parameters (if irrelevant)
- Tracking tokens
4. Deduplicate Edge Cases
Some retries or partial downloads may inflate counts.
Performance Impact
Proper logging has minimal impact when:
- Logs are streamed asynchronously
- Edge nodes are not blocked
- Compression is used
Modern CDNs are designed to handle logging pipelines without degrading delivery performance.

Key Takeaways
- Per-file download statistics are not available via standard CDN dashboards
- Raw logs are the only reliable source of truth
- High-cardinality data requires specialized storage (e.g., ClickHouse)
- Batch processing is sufficient for most infrastructures
- Accurate analytics requires normalization and filtering
If your project serves tens of thousands of assets or processes millions of requests daily, standard CDN analytics are not enough.
At Advanced Hosting, we design and deploy:
- Log-based CDN analytics pipelines
- High-performance storage (ClickHouse, object storage)
- Custom dashboards for per-file visibility
- Scalable infrastructure for high-load environments
Do CDNs provide per-file download statistics out of the box?
No. Most CDN providers only offer aggregated analytics such as total requests, bandwidth usage, and cache hit ratios. Per-file visibility requires access to raw logs and external processing.
What is the most accurate way to track downloads for each file?
The most accurate method is analyzing raw edge logs. These logs record every request, including the exact file path, allowing precise counting of downloads per asset.
Can I track downloads for 100,000+ files without performance issues?
Yes, but only with the right architecture. High-cardinality datasets require:
- Columnar databases (e.g., ClickHouse)
- Efficient partitioning
- Scalable ingestion pipelines (Kafka, Fluentd)
Without this, performance and query speed will degrade.
How often should CDN logs be processed?
For most use cases:
- Batch processing (5–15 minutes) is sufficient
- Real-time processing is only needed for fraud detection or live monitoring
Do cache hits and misses affect download statistics?
Yes. To get accurate numbers:
- Count both cache hits and misses
- Filter only successful responses (HTTP 200)
This ensures you measure actual content delivery, not just origin fetches.
Can CDN logs be stored long-term?
Not by default. Most CDN providers retain logs for only 3 to 7 days. For long-term analytics, logs must be exported to external storage such as S3-compatible systems.
How large can CDN log data grow?
Very quickly. For example:
- 10 million requests/day ≈ several GBs of logs daily
- At scale, this becomes terabytes per month
Efficient storage and compression are essential.
What tools are best for analyzing CDN logs?
Common stack:
- Ingestion: Kafka, Fluentd
- Storage: Object storage (S3-compatible)
- Analytics: ClickHouse, Apache Spark
- Visualization: Grafana
Is it possible to track unique downloads (not just total hits)?
Yes, but it requires additional processing:
- IP + User-Agent fingerprinting
- Cookie-based tracking
- Session reconstruction
Note that this introduces complexity and potential privacy considerations.
How do I avoid inflated download counts?
To improve accuracy:
- Exclude retries and partial requests
- Normalize URLs (remove query strings if irrelevant)
- Filter out bots if necessary
Does enabling logging affect CDN performance?
No, if implemented correctly. Modern CDNs use asynchronous log streaming, which does not impact content delivery latency.
When should I invest in advanced CDN analytics?
You should consider it when:
- You serve 50K–100K+ assets
- You process millions of requests daily
- You need cost attribution or monetization per asset
- You require detailed usage insights for optimization