Report this

What is the reason for this report?

How to Get File Size in Python with os and pathlib

Updated on September 8, 2025
How to Get File Size in Python with os and pathlib

Introduction

At some point in your Python journey, you’re going to need to figure out how big a file is. Whether you’re building a file uploader, managing assets, or just doing a quick check on disk space, it’s a common task that’ll come your way. The good news is that Python makes this incredibly simple with its built-in file handling capabilities. This article will walk you through the best and most common ways to get a file’s size.

We’ll start with the classic os.path.getsize() function, the go-to for a quick and direct answer. Then, we’ll explore the more modern and elegant pathlib approach, which is a fantastic tool to have in your belt. We’ll also cover how to handle errors gracefully when a file is missing, and finally, how to convert raw byte counts into a clean, human-readable format like “KB” or “MB”.

Tested with: Python 3.12 on Ubuntu 22.04 LTS. The code samples are OS-agnostic and also work on macOS and Windows unless noted.

Prerequisites

  • Python 3.8+ (examples validated on Python 3.12) - Download Python
  • Basic terminal access to run Python scripts - Command line basics
  • A sample data/ directory with at least one file for testing

We recommend running the examples inside a Python virtual environment to avoid dependency conflicts. If you’re new to venv, see How To Use Python Virtual Environments with venv (Ubuntu 22.04)

Key Takeaways

  • Use os.path.getsize('path/to/file') for the most direct, standard way to get a file’s size in bytes.
  • For modern, readable, and cross-platform code, use the object-oriented pathlib module.
  • Use os.stat() when you need additional file metadata, such as the last modification time, in addition to the file’s size.
  • Wrap your code in a try...except block to gracefully handle potential FileNotFoundError and PermissionError.
  • Always convert the raw byte count into a human-readable format like KB or MB for a better user experience.

Python os.path.getsize(): The Standard Way to Get File Size

For a quick and direct answer, the os.path.getsize() function is the best choice. It’s part of Python’s standard os module and is the most common way to get a file’s size. It does one thing and does it well: it takes a path to a file and returns its size.

It’s important to remember that the function returns the size as an integer representing the number of bytes.

import os

file_path = 'data/my_document.txt'

file_size = os.path.getsize(file_path)
print(f"The file size is: {file_size} bytes")

Output:

The file size is: 437 bytes

Get File Size with pathlib.Path (Modern, Pythonic Approach)

Introduced in Python 3.4, the pathlib module offers a modern, object-oriented way to handle filesystem paths. If you’re writing new code, this is often the recommended approach because it makes your code more readable and expressive. Instead of working with plain strings, you create a Path object that has its own methods, including one for getting file stats.

To get the size, you call the .stat() method on your Path object, which returns a result object (similar to os.stat()), and then you access its .st_size attribute.

from pathlib import Path

file_path = Path('data/my_document.txt')

file_size = file_path.stat().st_size
print(f"The file size is: {file_size} bytes")

Output:

The file size is: 437 bytes

How to Get File Metadata with os.stat()

When you need more than just the size of a file, os.stat() function is the tool for the job. While os.path.getsize() is a convenient shortcut, os.stat() is the underlying function that retrieves a full “status” report on the file. This report is an object containing a wealth of metadata.

The file size is available via the st_size attribute of the result object. This method is perfect when you also need to know things like the file’s last modification time (st_mtime) or creation time (st_ctime).

import os
import datetime

file_path = 'data/my_document.txt'

stat_info = os.stat(file_path)

file_size = stat_info.st_size

mod_time_timestamp = stat_info.st_mtime
mod_time = datetime.datetime.fromtimestamp(mod_time_timestamp)

print(f"File Size: {file_size} bytes")
print(f"Last Modified: {mod_time.strftime('%Y-%m-%d %H:%M:%S')}")

Output:

File Size: 437 bytes
Last Modified: 2025-07-16 17:42:05

Make File Sizes Human-Readable (KB, MB, GB)

From the previous examples, you may have noticed that the file sizes are always returned in bytes. While getting the file size in bytes is technically accurate, a number like 1474560 doesn’t mean much to most people at a glance. Is that big? Is it small? For a better user experience, it’s essential to convert this raw byte count into a more familiar format, like kilobytes (KB), megabytes (MB), or gigabytes (GB).

This is easily done with a small helper function. The logic is simple: we repeatedly divide the number of bytes by 1024 (the number of bytes in a kilobyte) and keep track of the unit until the number is small enough to be readable.

Here is a function that handles this conversion gracefully and can be integrated directly into your code.

Human-Readable Size Conversion Function (bytes → KB/MB/GB)

This function takes the size in bytes and an optional number of decimal places for formatting.

def format_size(size_bytes, decimals=2):

    if size_bytes == 0:
        return "0 Bytes"
    
    # Define the units and the factor for conversion (1024)
    power = 1024
    units = ["Bytes", "KB", "MB", "GB", "TB", "PB"]
    
    # Calculate the appropriate unit
    import math
    i = int(math.floor(math.log(size_bytes, power)))
    
    # Format the result
    return f"{size_bytes / (power ** i):.{decimals}f} {units[i]}"

Let’s use it in an example:

import os

file_path = 'data/large_file.zip'

raw_size = os.path.getsize(file_path) 

readable_size = format_size(raw_size)

print(f"Raw size: {raw_size} bytes")
print(f"Human-readable size: {readable_size}") 

Output:

Raw size: 1474560 bytes
Human-readable size: 1.41 MB

By integrating a simple function like this, you can make your program’s output significantly more intuitive and professional.

Error Handling for File Size Operations (Robust and Safe)

In a perfect world, every file path would be correct and every file accessible. But in reality, things go wrong. Your script might try to access a file that has been moved, or it might not have the permissions to read it. Without proper error handling, these situations will crash your program. A robust script anticipates these issues and handles them gracefully.

Let’s see how to handle the most common errors you’ll encounter when getting a file’s size.

Handle FileNotFoundError (Missing Files)

This is the most common error you’ll face. It occurs when you try to get the size of a file that doesn’t exist at the specified path. Wrapping your code in a try...except FileNotFoundError block is the standard way to manage this. Learn more about Python exception handling for robust error management.

import os

file_path = 'path/to/non_existent_file.txt'

try:
    file_size = os.path.getsize(file_path)
    print(f"File size: {file_size} bytes")

except FileNotFoundError:
    print(f"Error: The file at '{file_path}' was not found.")

How to Handle PermissionError (Access Denied)

Sometimes the file exists, but your script doesn’t have the necessary operating system permissions to read it or its metadata. This will raise a PermissionError. You can catch this error specifically to give a more informative message to the user.

import os

file_path = '/root/secure_file.dat' 

try:
    file_size = os.path.getsize(file_path)
    print(f"File size: {file_size} bytes")
except FileNotFoundError:
    print(f"Error: The file at '{file_path}' was not found.")
except PermissionError:
    print(f"Error: Insufficient permissions to access '{file_path}'.")

Symbolic links (or symlinks) are pointers to other files. What happens if a symlink points to a file that has been deleted? The link itself exists, but it’s “broken.” Calling os.path.getsize() on a broken symlink will raise an OSError.

A good practice is to first check if the path is a link and then resolve its actual path before getting the size. This approach helps avoid issues with broken symbolic links that can cause unexpected errors.

import os

symlink_path = 'data/broken_link.txt' 

try:
    file_size = os.path.getsize(symlink_path)
    print(f"File size: {file_size} bytes")

except FileNotFoundError:
    print(f"Error: The file pointed to by '{symlink_path}' was not found.")

except OSError as e:
    print(f"OS Error: Could not get size for '{symlink_path}'. It may be a broken link. Details: {e}")

Note: A broken symlink might raise FileNotFoundError on some operating systems.

By catching these specific exceptions, you make your code more resilient and user-friendly, providing clear feedback when something goes wrong instead of just crashing.

Method Comparison (Quick Reference)

Single-file size methods

Method Returns Best for Notes
os.path.getsize(path) Integer bytes Fast, minimal call when you only need size Thin wrapper over stat(); no extra metadata.
os.stat(path).st_size Integer bytes (via struct) Getting size along with other metadata (mtime, mode, etc.) One system call; exposes full stat_result.
Path(path).stat().st_size Integer bytes (via struct) Modern, readable code using pathlib Negligible overhead; integrates well with Path APIs.

Directory totals (recursive)

Method Pattern Best for Notes
os.scandir() Imperative loop with queue/stack Maximum throughput on large trees Fewer syscalls via DirEntry; typically faster.
Path(root).rglob('*') Iterator over Path objects Readable, concise traversal Slight overhead for object creation; very close in practice.

Performance Benchmarks: os.path.getsize() vs os.stat() vs pathlib

When you only need the size of a single file, all three approaches end up calling the same underlying system stat, so they’re functionally equivalent. The micro‑difference is in the Python overhead: os.path.getsize() is a thin wrapper, os.stat() returns a full struct you then read from, and pathlib.Path.stat() adds a small object‑oriented layer. In practice the gap is tiny (microseconds), but it can matter in tight loops.

When you need the total size of a directory tree, the filesystem traversal dominates. Here the choice between os.scandir() (imperative style) and pathlib.Path.rglob() (iterator style) is more impactful than the choice between getsize() and stat().

Benchmark 1: Repeated single‑file size calls

The snippet below measures each API by calling it many times on the same file (to isolate Python‑level overhead). Run it in a directory that has a data/large_file.bin.

import os
from pathlib import Path
import time

TEST_FILE = Path('data/large_file.bin')
N = 200_000  # increase/decrease based on your machine

# Warm-up (prime filesystem caches)
for _ in range(5_000):
    os.path.getsize(TEST_FILE)

start = time.perf_counter()
for _ in range(N):
    os.path.getsize(TEST_FILE)
getsize_s = time.perf_counter() - start

start = time.perf_counter()
for _ in range(N):
    os.stat(TEST_FILE).st_size
stat_s = time.perf_counter() - start

start = time.perf_counter()
for _ in range(N):
    TEST_FILE.stat().st_size
pathlib_s = time.perf_counter() - start

print(f"getsize()  : {getsize_s:.3f}s for {N:,} calls")
print(f"os.stat()  : {stat_s:.3f}s for {N:,} calls")
print(f"Path.stat(): {pathlib_s:.3f}s for {N:,} calls")

Interpretation: Expect getsize() and os.stat() to be neck‑and‑neck, with Path.stat() close behind. If you’re writing new code, prefer pathlib for readability unless you’re inside a hot loop where the last few microseconds truly matter.

Tip: You can also use the built‑in timeit module for more formal micro‑benchmarks:

import timeit, os
from pathlib import Path
p = Path('data/large_file.bin')
print('getsize :', timeit.timeit(lambda: os.path.getsize(p), number=200_000))
print('os.stat :', timeit.timeit(lambda: os.stat(p).st_size, number=200_000))
print('Path.stat:', timeit.timeit(lambda: p.stat().st_size, number=200_000))

Benchmark 2: Total size of a directory tree

Below are two equivalent implementations that sum sizes for all regular files under a root directory. This is a more realistic scenario where traversal cost dominates.

Using os.scandir() (fast, imperative):

import os
from collections import deque


def du_scandir(root: str) -> int:
    total = 0
    dq = deque([root])
    while dq:
        path = dq.popleft()
        with os.scandir(path) as it:
            for entry in it:
                try:
                    if entry.is_file(follow_symlinks=False):
                        total += entry.stat(follow_symlinks=False).st_size
                    elif entry.is_dir(follow_symlinks=False):
                        dq.append(entry.path)
                except (PermissionError, FileNotFoundError):
                    # Skip unreadable or concurrently-removed entries
                    continue
    return total

Using pathlib (readable, expressive):

from pathlib import Path

def du_pathlib(root: str) -> int:
    p = Path(root)
    total = 0
    for child in p.rglob('*'):
        try:
            if child.is_file():
                total += child.stat().st_size
        except (PermissionError, FileNotFoundError):
            continue
    return total

Timing the directory methods with timeit:

import timeit

print('scandir:', timeit.timeit(lambda: du_scandir('data'), number=10))
print('pathlib:', timeit.timeit(lambda: du_pathlib('data'), number=10))

Interpretation: On most systems, os.scandir() is often a bit faster because it exposes low‑level DirEntry attributes and reduces extra system calls. pathlib typically trades a small amount of speed for clarity. For very large trees or tight SLAs, use os.scandir(); for maintainable application code, pathlib is usually preferred.

When to prefer which

  • os.path.getsize(): Fast, minimal wrapper; fine for quick scripts and tight loops.
  • os.stat(): Use when you also need other metadata (mtime, mode) in the same call.
  • pathlib.Path.stat(): Prefer for new code for readability and cross‑platform path handling; overhead is negligible outside micro‑benchmarks.
  • Directory totals: Prefer os.scandir() for maximum throughput; use pathlib when code clarity and consistency matter more than micro‑optimizations.

Note on caches: Re‑running benchmarks on the same files/directories benefits from OS filesystem caches. If you want to compare cold‑cache behavior, vary the dataset or insert unrelated I/O between runs.

Cross-Platform Nuances (Linux, macOS, Windows)

While the examples in this guide are portable, there are important platform differences in os.stat() metadata and behavior. For comprehensive cross-platform development guidance:

  • st_ctime semantics

    • Windows (NTFS): creation time of the file.
    • Unix (Linux/macOS): inode change time (metadata change), not creation time.
    • Prefer explicit naming in docs/UI (e.g., “created/changed”) or gate logic by platform.
  • Permissions & modes

    • POSIX (Linux/macOS): st_mode encodes rwx bits and file type; stat.S_ISDIR, stat.S_ISREG, etc. are reliable. st_uid/st_gid present.
    • Windows: POSIX bits are best‑effort. Owner/group fields and execute bit are not meaningful in the same way; a read‑only attribute may appear as the absence of write bit. Learn more about file permissions across different operating systems.
  • Symlink handling

    • Windows: creating symlinks may require admin privileges or Developer Mode. Use follow_symlinks=False to avoid resolving targets; use os.lstat()/Path.lstat() to stat the link itself.
    • Unix: symlinks are common; broken links raise FileNotFoundError/OSError when followed. Understanding symbolic link behavior is crucial for cross-platform compatibility.
  • Timestamps & precision

    • Python exposes seconds as floats (st_mtime) and nanoseconds (st_mtime_ns). Filesystems differ: NTFS typically ~100 ns ticks; ext4/APFS often nanosecond resolution; FAT may be coarse.
  • Other practical quirks

    • Path limits: legacy Windows has MAX_PATH (~260 chars) unless long paths are enabled.
    • Case sensitivity: Windows is case‑insensitive by default; macOS is often case‑insensitive; Linux is case‑sensitive.
    • Sparse/compressed files (NTFS, APFS): logical size (st_size) can exceed on‑disk bytes. Use du/platform APIs if you need physical allocation. For more details, see Python’s path handling documentation.

Writing portable code

Use platform checks to interpret metadata correctly and pick safe defaults for symlinks:

import os, sys, stat
from pathlib import Path

p = Path('data/example.txt')
info = p.stat()  # follows symlinks

if sys.platform.startswith('win'):
    created_or_changed = 'created'  # st_ctime is creation time on Windows
else:
    created_or_changed = 'changed'  # inode metadata change time on Unix

print({'size': info.st_size, 'ctime_semantics': created_or_changed})

# If you need to stat a symlink itself (portable):
try:
    link_info = os.lstat('link.txt')  # or Path('link.txt').lstat()
except FileNotFoundError:
    link_info = None

# When traversing trees, avoid following symlinks unless you intend to:
for entry in os.scandir('data'):
    if entry.is_symlink():
        continue  # or handle explicitly
    # Use follow_symlinks=False to be explicit:
    if entry.is_file(follow_symlinks=False):
        size = entry.stat(follow_symlinks=False).st_size

Guideline: Treat st_ctime as creation on Windows and metadata‑change on Unix; document the distinction in user‑facing output, and avoid logic that assumes a universal “created” timestamp across platforms.

Real-World Use Cases

1) File size checks before upload (web apps, APIs)

In upload workflows, validating size before accepting a body prevents wasted bandwidth and broken UX. On the client or a pre‑processing step, read the file size and reject early if it exceeds policy (e.g., 10 MB for images, 100 MB for PDFs). On the server, verify again using os.stat() or Path.stat() after writing to a temp location. Emit precise errors (limit, actual size, allowed types) and log metrics by route to identify abusive clients or misconfigured mobile apps. This approach helps prevent unrestricted file upload vulnerabilities and ensures better security.

from pathlib import Path
MAX_BYTES = 10 * 1024 * 1024  # 10 MB

p = Path('uploads/tmp/user_image.jpg')
size = p.stat().st_size
if size > MAX_BYTES:
    raise ValueError(f"Payload too large: {size} > {MAX_BYTES}")

2) Disk monitoring scripts (cron jobs, storage quotas)

Ops teams routinely track growth of logs, caches, and user‑generated content. A lightweight cron job can calculate the total size of key directories using os.scandir() for throughput, then alert when thresholds are crossed (e.g., 80% and 95% of volume capacity). Include trend deltas (day‑over‑day growth) to distinguish spikes from steady leaks, and exclude ephemeral paths (e.g., sockets, tmp). This guards against outages caused by full disks and gives capacity planning signals. Understanding disk space management is crucial for system administrators.

import shutil
from datetime import datetime

used = shutil.disk_usage('/')
print({
    'ts': datetime.utcnow().isoformat(),
    'total': used.total,
    'used': used.used,
    'free': used.free,
})

3) Preprocessing datasets for ML pipelines (ignore files under a threshold)

In data ingestion, tiny files often signal corrupted shards, incomplete downloads, or unhelpful signal‑to‑noise. Gatekeeping by size speeds up training and reduces IO. Combine a minimum byte threshold with file‑type checks to keep only useful samples. Persist stats (kept vs. skipped counts, total bytes) so pipeline runs are reproducible and auditable. Use Path.rglob() for readability in research code; switch to os.scandir() if throughput becomes a bottleneck.

from pathlib import Path
MIN_BYTES = 8 * 1024  # skip files smaller than 8KB

kept, skipped = 0, 0
for f in Path('data/train').rglob('*.jsonl'):
    try:
        if f.stat().st_size >= MIN_BYTES:
            kept += 1
        else:
            skipped += 1
    except FileNotFoundError:
        continue
print({'kept': kept, 'skipped': skipped})

4) Edge Cases to Consider

Large files on 32-bit systems

On legacy 32-bit systems, especially with older Python versions or C libraries, file size APIs may incorrectly report sizes above 2GB or 4GB due to integer overflows or filesystem limitations. Modern Python builds usually handle this transparently, but you should verify that st_size returns a long (64-bit) and not a truncated value. Always test large media (e.g., videos or compressed archives) in deployment environments that still run on 32-bit platforms or embedded devices.

import os

size = os.stat('data/huge_video.mkv').st_size
print(f"Size in GB: {size / (1024 ** 3):.2f} GB")

Recursively walking directory size

Getting the total size of a directory (especially with nested folders) is not as simple as getsize(). You must walk the entire directory tree and sum each file. Use os.walk() or pathlib.Path.rglob() depending on whether you want full control or expressive iteration. Consider skipping symbolic links to avoid infinite loops and guard with try/except in case of permission issues.

import os

def get_total_size(path):
    total = 0
    for dirpath, _, filenames in os.walk(path):
        for f in filenames:
            try:
                fp = os.path.join(dirpath, f)
                if not os.path.islink(fp):
                    total += os.path.getsize(fp)
            except (FileNotFoundError, PermissionError):
                continue
    return total

Network-mounted files (latency & consistency)

When working with files on NFS, SMB, or cloud-mounted volumes (e.g., Dropbox, EFS, or Google Drive FUSE mounts), file metadata calls like stat() can have much higher latency and looser consistency than local disk. File size values may lag behind actual content or fail on disconnected mounts. To make scripts resilient, cache metadata when possible, retry transient failures, and test the mount type before assuming os.path.getsize() will behave like local filesystems.

import os

try:
    size = os.path.getsize('/mnt/nfs_share/data.csv')
    print(f"Size: {size} bytes")
except (OSError, TimeoutError) as e:
    print(f"NFS access failed: {e}")

Summary Table of Edge Cases

The table below consolidates the edge cases with descriptive notes, providing a quick yet detailed reference. Each cell includes not only the core idea but also the context of why it matters in production environments.

Edge Case Description
Large files on 32-bit systems File size reporting may fail on older 32-bit builds due to integer overflow, causing incorrect values for files larger than 2GB or 4GB. Although modern 64-bit Python generally resolves this, teams maintaining embedded devices or legacy environments must test with realistic datasets like large videos or archives. Always validate that st_size returns 64-bit integers and consider explicit error handling for overflow risks.
Recursively walking directory size Calculating total size for a folder is non-trivial because directories only contain entries, not aggregated file sizes. Scripts must walk the entire tree, summing each file with functions like os.walk() or pathlib.Path.rglob(). Recursive walking must also defend against symlink loops, permission errors, and transient missing files. Properly implemented, this method provides accurate metrics for disk usage reports, backups, or user quota enforcement, even at scale.
Network-mounted files Files residing on NFS, SMB, or cloud-mounted volumes often show high latency for metadata calls such as stat(). Unlike local disks, results may be inconsistent if synchronization is delayed, and failures may occur if mounts are disconnected. Scripts should be robust by retrying operations, caching metadata when acceptable, and providing clear error feedback. Understanding this distinction is vital when deploying to hybrid environments where both local and remote files coexist.

AI/ML Workflow Integrations

1) Filter dataset files by size before model training

In many ML tasks, extremely small or extremely large files can degrade training quality or slow throughput. For example, corrupted JSONL shards may be only a few bytes, while runaway data exports can be multi‑GB and exceed GPU memory budgets at load time. Add a size gate to your dataset loader so that only samples within an expected range are passed to the training job. Persist counters for kept/skipped files and emit Prometheus‑friendly metrics to correlate model performance with data hygiene. During hyperparameter sweeps or A/B runs, log thresholds alongside experiment IDs so results are reproducible.

from pathlib import Path

MIN_B = 4 * 1024        # 4KB: likely non-empty JSONL row/chunk
MAX_B = 200 * 1024**2   # 200MB: cap to protect RAM/VRAM

kept, skipped = 0, 0
valid_paths = []
for f in Path('datasets/train').rglob('*.jsonl'):
    try:
        s = f.stat().st_size
        if MIN_B <= s <= MAX_B:
            valid_paths.append(f)
            kept += 1
        else:
            skipped += 1
    except (FileNotFoundError, PermissionError):
        skipped += 1

print({'kept': kept, 'skipped': skipped, 'ratio': kept / max(1, kept + skipped)})
# pass valid_paths to your DataLoader / tf.data pipeline

Why this matters: Size filters remove low‑signal noise and protect downstream memory use. They’re fast (metadata only) and complement content‑aware validation (schema checks, row counts) without expensive parsing.

2) Automate log cleanup with an AI scheduler (n8n + Python)

Production systems generate logs, traces, and checkpoints that can balloon storage. Use an automation tool such as n8n to orchestrate periodic scans and decisions. A simple workflow: (1) cron trigger in n8n, (2) run a Python script that enumerates log directories and emits a JSON list of files over a threshold (e.g., 500MB), (3) optional LLM step in n8n to classify files into delete, archive, retain based on filename, age, and service, (4) execute delete/move actions with audit logging. Keep the Python side deterministic; keep the “policy” flexible in n8n so ops can adjust thresholds without redeploying code.

# emit JSON for n8n to consume
import os, json, time
THRESHOLD = 500 * 1024**2  # 500 MB
ROOTS = ['/var/log/myapp', '/var/log/nginx']

candidates = []
now = time.time()
for root in ROOTS:
    for dirpath, _, files in os.walk(root):
        for name in files:
            fp = os.path.join(dirpath, name)
            try:
                st = os.stat(fp)
                if st.st_size >= THRESHOLD:
                    candidates.append({
                        'path': fp,
                        'size_bytes': st.st_size,
                        'mtime': st.st_mtime,
                        'age_days': (now - st.st_mtime) / 86400,
                    })
            except (FileNotFoundError, PermissionError):
                continue
print(json.dumps({'candidates': candidates}))

Why this matters: You separate concerns—Python handles fast file system introspection; n8n handles policy, approvals, and notifications. This reduces toil, prevents full disks, and creates an auditable trail for compliance.

3) Size validation in streaming/ batch ingestion pipelines

Ingest pipelines (Apache Kafka consumers, S3 batch pulls, BigQuery exports) benefit from pre‑parse size checks to short‑circuit bad inputs and protect memory. For streaming, apply a size guard per message/blob before deserialization; for batch, annotate manifests with st_size and reject outliers or quarantine them for review. Publish metrics (p50/p95/p99 sizes) to catch regressions when an upstream service starts emitting unexpectedly large payloads. Couple size thresholds with backoff/retry so transient spikes don’t trigger false positives.

# Example: pre-parse guard in a consumer loop
import os

def accept(path: str, min_b=1_024, max_b=512 * 1024**2):
    try:
        s = os.stat(path).st_size
        return min_b <= s <= max_b
    except FileNotFoundError:
        return False

for blob_path in get_next_blobs():  # your iterator
    if not accept(blob_path):
        quarantine(blob_path)  # move aside, alert, and continue
        continue
    process(blob_path)  # safe to parse and load

Why this matters: Early size validation protects parsers, keeps consumer lag under control, and makes capacity predictable. It also produces actionable telemetry so data teams can negotiate contracts and SLAs with upstream producers.

Frequently Asked Questions (FAQs)

1. How do I get the size of a file in Python using the standard library?

The most straightforward way to get the size of a file in Python is by using the os.path.getsize() function. This function is part of the built-in os module and returns the size of the file in bytes. Here’s a quick example:

import os

file_size = os.path.getsize('data/example.txt')
print(f"File size: {file_size} bytes")

This method works well for most use cases where you just need a fast and simple byte count.

2. What’s the difference between os.path.getsize() and os.stat()?

While both functions return file size, they serve different purposes. os.path.getsize() is a convenience function that returns only the size in bytes. In contrast, os.stat() provides a full status object (stat_result) that includes various metadata such as:

  • st_size: file size in bytes
  • st_mtime: last modification time
  • st_ctime: creation time (or metadata change time, depending on the OS)

Example:

import os

stat = os.stat('data/example.txt')
print(f"Size: {stat.st_size} bytes, Last Modified: {stat.st_mtime}")

Use os.stat() when you need more than just the size.

3. Should I use pathlib instead of os.path for file size?

Yes, especially in modern Python code (version 3.4 and above). The pathlib module provides an object-oriented interface for file system operations. It improves readability and is considered more Pythonic.

Instead of working with plain strings, you work with Path objects:

from pathlib import Path

file_path = Path('data/example.txt')
file_size = file_path.stat().st_size

This approach is cross-platform, cleaner, and integrates well with other modern Python features.

4. How can I convert file sizes from bytes to KB, MB, or GB in Python?

Raw byte counts can be hard to interpret, especially for larger files. To display sizes in a human-readable format, you can use a helper function that divides the size by 1024 repeatedly and appends the correct unit:

def format_size(size_bytes, decimals=2):

    if size_bytes == 0:
        return "0 Bytes"
    
    power = 1024
    units = ["Bytes", "KB", "MB", "GB", "TB", "PB"]
    
    import math
    i = int(math.floor(math.log(size_bytes, power)))
    
    return f"{size_bytes / (power ** i):.{decimals}f} {units[i]}"

Using this function, 1474560 bytes would become 1.41 MB, which is much more user-friendly.

5. What happens if the file doesn’t exist or can’t be accessed?

If the file path is incorrect or the file doesn’t exist, Python will raise a FileNotFoundError. If the file exists but your script doesn’t have permission to access it, a PermissionError is raised. To prevent your program from crashing, wrap the operation in a try...except block. For comprehensive error handling, refer to Python’s exception hierarchy:

try:
    size = os.path.getsize('some/file.txt')
except FileNotFoundError:
    print("The file does not exist.")
except PermissionError:
    print("You do not have permission to access this file.")

This ensures your program handles errors gracefully and provides helpful feedback. For production applications, consider implementing proper logging to track these errors and monitor system health.

If the symbolic link points to a valid file, os.path.getsize() will return the size of the target file. However, if the symlink is broken (i.e., the target no longer exists), calling this function will raise a FileNotFoundError or OSError, depending on the operating system. Understanding symbolic link behavior is essential for robust file handling.

To avoid this, you can check if the path is a symlink and whether its target exists:

import os

if os.path.islink('link.txt') and os.path.exists(os.readlink('link.txt')):
    size = os.path.getsize('link.txt')
else:
    print("Broken symbolic link or target not found.")

This way, you can handle broken symlinks gracefully.

Conclusion

You now know how to get a file’s size in Python using the direct os.path.getsize(), the modern pathlib module, or the more detailed os.stat() function. We also covered how to handle errors and convert byte counts into a human-readable format. While the simpler methods work well, remember that pathlib is the recommended standard for writing robust, maintainable code.

To build on these skills, you can explore how to handle plain text files to read and write data or build a complete command-line utility by handling user arguments.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Pankaj Kumar
Pankaj Kumar
Author
See author profile

Java and Python Developer for 20+ years, Open Source Enthusiast, Founder of https://www.askpython.com/, https://www.linuxfordevices.com/, and JournalDev.com (acquired by DigitalOcean). Passionate about writing technical articles and sharing knowledge with others. Love Java, Python, Unix and related technologies. Follow my X @PankajWebDev

Manikandan Kurup
Manikandan Kurup
Editor
Senior Technical Content Engineer I
See author profile

With over 6 years of experience in tech publishing, Mani has edited and published more than 75 books covering a wide range of data science topics. Known for his strong attention to detail and technical knowledge, Mani specializes in creating clear, concise, and easy-to-understand content tailored for developers.

Vinayak Baranwal
Vinayak Baranwal
Editor
See author profile

Building future-ready infrastructure with Linux, Cloud, and DevOps. Full Stack Developer & System Administrator @ DigitalOcean | GitHub Contributor | Passionate about Docker, PostgreSQL, and Open Source | Exploring NLP & AI-TensorFlow | Nailed over 50+ deployments across production environments.

Category:
Tags:

Still looking for an answer?

Was this helpful?

Thanks for the example. It was helpful. As an FYI I use a mac os version 10.15.5. The get info feature of Finder reports file size in multiples of 1,000,000 ( 1000 * 1000) not 1,048,596 (1024 * 1024). Not sure if this changed at some point. Thanks again

- John

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.