Skip to contents

Overview

In this R package, dfStats provides highly optimized functions for computing descriptive summaries and aggregations on data.frame objects in R. Similar to matrixStats, but specifically designed for R data.frame objects with focus on both elegance and computational efficiency.

Please note that this project is in its early developmental stage, so the functionality and performance may be limited compared to future versions.

Why dfStats?

Working with data.frame objects often involves computing various summaries across rows or columns. While base R’s apply function can handle these operations, it comes with several limitations:

  • Unintuitive syntax with the MARGIN parameter (1 for rows, 2 for columns).
  • Poor computational performance from repeated function calls.
  • Memory inefficiency from creating multiple hard copies.
  • Requires manual handling of mixed column types.

The dfStats package provides a collection of functions for performing efficient row-wise and column-wise computations on data.frame objects with improved performance compared to base R operations. This package address these issues by providing intuitive and consistently named functions using C++ implementations for core operations to minimize hard copies.

Installation

# Install latest stable version from CRAN
install.packages("dfStats")

# Or install the development version from GitHub
if (!"remotes" %in% as.data.frame(installed.packages())$Package)
  install.packages("remotes")
remotes::install_github("randef1ned/dfStats", upgrade = "always",
                        build_manual = TRUE, build_vignettes = TRUE)

Quick start

library(dfStats)

# Create sample data
df <- data.frame(
  a = 1:5,
  b = rnorm(5),
  c = letters[1:5]
)

# Column operations
colMeans3(df)      # Fast column means

# Row operations
rowSums3(df)       # Efficient row sums

Benchmarks

# Create a large data frame for benchmark
large_df <- data.frame(
  matrix(rnorm(1000000), ncol = 100)
)

# Compare with base R
microbenchmark::microbenchmark(
  apply    = apply(large_df, 1, mean),
  rowMeans = rowMeans(large_df),
  dfStats  = rowMeans3(large_df)
)

Benchmark on Windows (unit: milliseconds):

expr min lq mean median uq max neval
apply 62.2120 70.81215 80.056904 72.9547 79.81565 199.0736 100
rowMeans 4.1759 4.39515 11.023450 4.6652 4.90150 140.4091 100
dfStats 1.5329 1.88900 2.157443 2.2555 2.36300 2.7162 100

Benchmark on Fedora Linux 41 (unit: milliseconds):

expr min lq mean median uq max neval
apply 51.582414 63.18536 78.625475 65.956271 71.267663 258.0433 100
rowMeans 4.805873 4.96690 9.031157 5.085617 6.756312 194.5995 100
dfStats 12.767709 13.10204 13.543744 13.253293 13.481091 26.6272 100

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under GPL v2 - see the LICENSE file for details.

Other information

  • Author: Yinchun Su