CSV Statistical Summary
Mean
30
Median
30
How it works
A statistical summary of a CSV dataset provides distribution intelligence (count, mean, median, standard deviation, min, max, quartiles) for every numeric column in a single operation. This mirrors the output of Python's df.describe() or R's summary(), allowing rapid data profiling without writing any code.
**Key statistics explained** Count: number of non-null values (reveals missing data). Mean: arithmetic average (sensitive to outliers). Median (P50): middle value when sorted — more robust than mean for skewed distributions. Standard deviation: average distance from the mean — large std relative to mean suggests high variability. P25/P75 (quartiles): 25th and 75th percentiles. IQR = P75−P25 (interquartile range used for outlier detection: outliers are values outside P25−1.5×IQR to P75+1.5×IQR). Min/Max: extreme values, often revealing data entry errors (age of 999, salary of 0).
**Detecting data quality issues** Mean ≫ median: right-skewed distribution (income data, purchase amounts — a few very large values pull the mean up). Large max relative to P75: likely outliers. Count < total rows: missing values — flag columns with >5% null rate for imputation decisions. Std = 0: constant column — no predictive value; usually an artifact.
**Column type detection** Columns containing only digits should be treated as numeric. Columns with mixed types (mostly numbers but some "N/A" or "—" strings) require null-coercion before statistics are meaningful. This tool auto-detects numeric columns and reports non-numeric columns separately with value frequency counts.
Frequently Asked Questions
- When mean >> median, the distribution is right-skewed: a few very large values are pulling the mean upward. Common examples: income distributions (median US household income ~$74K, mean ~$102K due to high earners), purchase amounts (most transactions are small, occasional large purchases), response times (most requests are fast, occasional slow outliers). For skewed data, the median is a more representative 'typical value' than the mean.
- Use the IQR (interquartile range = P75 − P25) method: outliers are values below P25 − 1.5×IQR or above P75 + 1.5×IQR (Tukey's fences). For a column with P25=10, P75=20: IQR=10, lower fence = 10 − 15 = −5, upper fence = 20 + 15 = 35. Any value outside [−5, 35] is a mild outlier. For extreme outliers, use P25 − 3×IQR and P75 + 3×IQR. Also check: does max seem physically impossible? (age=999, salary=0, temperature=500)
- The tool auto-detects: columns where >90% of non-null values parse as numbers are treated as numeric and receive full statistical summaries. Columns with mostly non-numeric values are treated as categorical and receive value frequency counts instead. Edge cases: a column with '1', '2', 'N/A', 'unknown' — the non-numeric strings are noted as 'invalid count' and statistics are computed on the valid numeric subset. Inspect the invalid count to decide if data cleaning is needed.
- High standard deviation relative to the mean (coefficient of variation CV = std/mean > 1) suggests high variability — check for data entry errors or that you've mixed multiple populations in one column. Std = 0 means all values are identical — often an artifact of a constant field or a failed data export. Very small std with a mean near zero might indicate the column is effectively useless for modeling. Compare std across similar datasets to flag anomalous variability.