4 minute read

You might have seen some buzz about DeepSeek AI’s new open-source tool, Smallpond, and its companion, 3FS. Claims of it being a “Spark killer” might have crossed your feed. While it’s definitely interesting technology, let’s take a more grounded look at what Smallpond is, who it’s for, and whether it truly signals the end for established data processing frameworks.

What is Smallpond?

Smallpond1 is a lightweight, distributed data processing framework designed to extend the capabilities of DuckDB. DuckDB is typically a single-node, in-process analytical database. Smallpond allows DuckDB to handle datasets that are larger than the memory capacity of a single machine by distributing the workload across a cluster of machines. It enables parallel processing and distributed analytics. Key features of Smallpond include:

  • Distributed Analytics: Enables DuckDB to process larger-than-memory datasets by partitioning the data and running analytics tasks in parallel across multiple nodes.
  • Open Source Deployment: Offers the potential for performant storage at a lower cost compared to proprietary alternatives, assuming you can manage the deployment.
  • Manual Partitioning: Requires users to manually partition their data, which Smallpond then distributes across the nodes in the cluster for parallel processing.

    What is 3FS?

    3FS (Fire-Flyer File System) is a high-performance, parallel file system also developed by DeepSeek. It’s designed and optimised for AI and HPC (High Performance Computing) workloads. It aims to provide extremely high throughput and low latency by using SSDs (Solid State Drives) and RDMA (Remote Direct Memory Access) networking technology. 3FS functions as the high-speed, distributed storage backend that Smallpond uses to achieve its performance.

According to DeepSeek, 3FS can achieve a read throughput of 6.6 TiB/s on a 180-node cluster, which is significantly higher than many traditional distributed file systems2.

How Can I Use It?

As with most Python packages, you can install Smallpond using:

uv pip install smallpond

(Feel free to replace uv with pip if you prefer, but be warned, you might prefer uv for its speed).

However, unlocking the real benefits of Smallpond requires more effort and depends heavily on the size of your data and your existing infrastructure. Here’s a rough guide:

  • Under 10TB: Smallpond is likely overkill unless you have very specific and unusual distributed computing requirements. A single-node DuckDB instance, or other simpler storage solutions, will likely be easier to manage and potentially more performant. Frankly, using Smallpond at this scale (without Ray/3FS) is likely slower and more complex than using DuckDB on its own.
  • 10TB to 1PB: This is where Smallpond starts to become more compelling. You’d need to set up a cluster of machines (see below) and leverage 3FS (or another fast storage backend) to enable rapid, parallel processing.
  • Over 1PB (Petabyte-Scale): Smallpond and 3FS are explicitly designed for handling these massive datasets. At this scale, deploying a larger cluster with significant infrastructure investment is necessary.

Deployment typically involves:

  1. Setting up a compute cluster (e.g., using AWS EC2, Google Compute Engine, or on-premise hardware).
  2. Deploying 3FS on nodes equipped with high-performance SSDs and RDMA networking. This is the trickiest step, as 3FS is new and documentation may be limited.
  3. Installing Smallpond via pip to run distributed DuckDB tasks across your cluster.

Is Smallpond For Me?

The short answer is: probably not, at least not yet.

Whether Smallpond is the right tool for you depends on several factors:

  • Data Scale: If your dataset is below 10TB, Smallpond will likely add unnecessary complexity. For larger datasets, it can offer significant performance improvements.
  • Infrastructure Capability: Smallpond and 3FS demand substantial infrastructure and DevOps expertise. Without a team experienced in cluster management, deployment and maintenance will be challenging.
  • Analytical Complexity: Smallpond is well-suited for partition-level parallelism but less optimised for complex joins across partitions. If your workloads require intricate joins, you might not see the performance gains you expect.

How Smallpond Works (Under the Hood)

Smallpond uses a lazy Directed Acyclic Graph (DAG) execution strategy. Operations such as map(), filter(), and partial_sql() aren’t executed immediately. Instead, Smallpond constructs a logical execution plan as a DAG, where each operation is represented as a node (e.g., SqlEngineNode, HashPartitionNode, DataSourceNode).

Execution is triggered by actions like:

  • write_parquet() – Writes data to disk.
  • to_pandas() – Converts results to a Pandas DataFrame.
  • compute() – Explicitly forces computation.
  • count() – Counts rows.
  • take() – Retrieves a subset of rows.

When an action is triggered, the logical plan is transformed into an execution plan comprising specific tasks (e.g., SqlEngineTask, HashPartitionTask). These tasks are distributed and executed by Ray. Smallpond utilises Ray Core for distribution, leveraging partitions for scalability. Partitions can be created manually using hash partitioning, even partitioning, or random shuffle partitioning.

Conclusion

Smallpond and 3FS offer interesting and potentially powerful capabilities for scaling DuckDB analytics across large datasets. However, their complexity and infrastructure requirements mean they are best suited for organisations with large datasets and the necessary DevOps expertise. For simpler use cases, sticking with a single-node DuckDB instance or managed solutions is likely the more practical option.

So, is Spark dead? Almost certainly not. Smallpond is a promising technology, but it’s not a drop-in replacement for existing data processing frameworks, and it requires significant investment to deploy and manage effectively.

  1. https://github.com/deepseek-ai/smallpond 

  2. https://jurnals.net/deepseek-unveils-3fs-file-system-revolutionizing-ai-data-management/