A resampled set of runners from all marathons with more 50 runners.

Each marathon will have 100 runners (50 male, 50 female) per year. So any marathon with less than 50 runners in the group will have multiple resampled runners. This data set has over 500k runners. The original data had close to 10 million runners and a few more columns. The NYT had a good article - https://www.nytimes.com/2014/04/23/upshot/what-good-marathons-and-bad-investments-have-in-common.html?rref=upshot&_r=1
marathon
Author

DS 150

Published

January 25, 2024

Data details

There are 608,650 rows and 20 columns. The data source1 is used to create our data that is stored in our pins table. You can access this pin from a connection to posit.byui.edu using hathawayj/marathon_sample.

This data is available to all.

Variable description

  • age The age of the runner
  • gender The gender of the runner (M/F)
  • chiptime The time in minutes for the runner
  • year The year of the marathon
  • marathon The name of the marathon
  • country The country where the marathon was held
  • finishers The number of finishers at the marathon

Variable summary

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 258256 0.58 40.37 11.28 7.00 32.00 40.00 48.00 95.00 ▁▇▆▁▁
split_half 550610 0.10 125.26 26.20 59.02 109.03 121.47 137.52 954.00 ▇▁▁▁▁
clocktime 372674 0.39 269.54 55.77 11.02 231.53 261.77 298.50 792.35 ▁▇▂▁▁
chiptime 0 1.00 273.14 60.21 122.15 231.95 263.73 303.28 1212.00 ▇▁▁▁▁
year 0 1.00 2007.60 4.41 1970.00 2005.00 2008.00 2011.00 2013.00 ▁▁▁▂▇
split_10k 580137 0.05 61.64 23.64 26.43 52.12 58.32 66.25 838.01 ▇▁▁▁▁
split_30k 593599 0.02 186.90 43.00 82.52 159.23 181.17 207.83 949.01 ▇▁▁▁▁
split_40k 602490 0.01 260.97 61.96 121.45 218.98 251.68 292.54 574.63 ▃▇▂▁▁
finishers 0 1.00 1570.65 4330.49 51.00 126.00 312.00 1031.00 50062.00 ▇▁▁▁▁
meantime 0 1.00 269.36 32.89 139.07 251.21 262.36 277.65 614.50 ▁▇▁▁▁
female 53700 0.91 0.50 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇
us 0 1.00 0.78 0.42 0.00 1.00 1.00 1.00 1.00 ▂▁▁▁▇
canada 0 1.00 0.07 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
europe 0 1.00 0.09 0.29 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
other 0 1.00 0.06 0.23 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
age_gender 0 1.00 0.57 0.50 0.00 0.00 1.00 1.00 1.00 ▆▁▁▁▇

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 53700 0.91 1 1 0 2 0
marathon 0 1.00 6 62 0 988 0
country 0 1.00 2 14 0 43 0
marathon2 0 1.00 11 67 0 5968 0
Explore generating code using R
pacman::p_load(pins, tidyverse, downloader, fs, glue, rvest, googledrive, connectapi)


# Data is from master_marathon. 
sdrive <- shared_drive_find("byuids_data") # This will ask for authentication.
google_file <- drive_ls(sdrive)  |>
  filter(stringr::str_detect(name, "master_marathon"))
tempf <- tempfile()
drive_download(google_file, tempf)
dat <- read_csv(tempf)



# Wrangle
marathon_sample <- dat %>%
  filter(finishers > 50) %>%
  group_by(marathon, year, gender) %>%
  sample_n(50, replace = TRUE) %>%
  ungroup() %>%
  mutate(finishers = as.integer(finishers), year = as.integer(year))



board <- board_connect()
pin_write(board, marathon_sample, type = "parquet")

pin_name <- "marathon_sample"
meta <- pin_meta(board, paste0("hathawayj/", pin_name))
client <- connect()
my_app <- content_item(client, meta$local$content_id)
set_vanity_url(my_app, paste0("data/", pin_name))

Access data

This data is available to all.

Direct Download: marathon_sample.parquet

R and Python Download:

URL Connections:

For public data, any user can connect and read the data using pins::board_connect_url() in R.

library(pins)
url_data <- "https://posit.byui.edu/data/marathon_sample/"
board_url <- board_connect_url(c("dat" = url_data))
dat <- pin_read(board_url, "dat")

Use this custom function in Python to have the data in a Pandas DataFrame.

import pandas as pd
import requests
from io import BytesIO

def read_url_pin(name):
  url = "https://posit.byui.edu/data/" + name + "/" + name + ".parquet"
  response = requests.get(url)
  if response.status_code == 200:
    parquet_content = BytesIO(response.content)
    pandas_dataframe = pd.read_parquet(parquet_content)
    return pandas_dataframe
  else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")
    return None

# Example usage:
pandas_df = read_url_pin("marathon_sample")

Authenticated Connection:

Our connect server is https://posit.byui.edu which you assign to your CONNECT_SERVER environment variable. You must create an API key and store it in your environment under CONNECT_API_KEY.

Read more about environment variables and the pins package to understand how these environment variables are stored and accessed in R and Python with pins.

library(pins)
board <- board_connect(auth = "auto")
dat <- pin_read(board, "hathawayj/marathon_sample")
import os
from pins import board_rsconnect
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv('CONNECT_API_KEY')
SERVER = os.getenv('CONNECT_SERVER')

board = board_rsconnect(server_url=SERVER, api_key=API_KEY)
dat = board.pin_read("hathawayj/marathon_sample")