sample
This page explains how to use the sample operator function in APL.
The sample
operator in APL psuedo-randomly selects rows from the input dataset at a rate specified by a parameter. This operator is useful when you want to analyze a subset of data, reduce the dataset size for testing, or quickly explore patterns without processing the entire dataset. The sampling algorithm is not statistically rigorous but provides a way to explore and understand a dataset. For statistically rigorous analysis, use summarize
instead.
You can find the sample
operator useful when working with large datasets, where processing the entire dataset is resource-intensive or unnecessary. It’s ideal for scenarios like log analysis, performance monitoring, or sampling for data quality checks.
For users of other query languages
If you come from other query languages, this section explains how to adjust your existing queries to achieve the same results in APL.
Usage
Syntax
Parameters
ProportionOfRows
: A float greater than 0 and less than 1 which specifies the proportion of rows to return from the dataset. The rows are selected randomly.
Returns
The operator returns a table containing the specified number of rows, selected randomly from the input dataset.
Use case examples
In this use case, you sample a small number of rows from your HTTP logs to quickly analyze trends without working through the entire dataset.
Query
Output
_time | req_duration_ms | id | status | uri | method | geo.city | geo.country |
---|---|---|---|---|---|---|---|
2023-10-16 12:45:00 | 234 | user1 | 200 | /index | GET | New York | US |
2023-10-16 12:47:00 | 120 | user2 | 404 | /login | POST | Paris | FR |
2023-10-16 12:48:00 | 543 | user3 | 500 | /checkout | POST | Tokyo | JP |
This query returns a random subset of 5 % of all rows from the HTTP logs, helping you quickly identify any potential issues or patterns without analyzing the entire dataset.