Read a document as Markdown — ragnar

ragnar_read() uses markitdown to convert a document to markdown. If frame_by_tags or split_by_tags is provided, the converted markdown content is then split and converted to a data frame, otherwise, the markdown is returned as a string.

Usage

ragnar_read(x, ..., split_by_tags = NULL, frame_by_tags = NULL)

Arguments

x: file path or url.
...: passed on markitdown.convert.
split_by_tags: character vector of html tag names used to split the returned text
frame_by_tags: character vector of html tag names used to create a dataframe of the returned content

Value

Always returns a data frame with the columns:

origin: the file path or url
hash: a hash of the text content
text: the markdown content

If split_by_tags is not NULL, then a tag column is also included containing the corresponding tag for each text chunk. "" is used for text chunks that are not associated with a tag.

If frame_by_tags is not NULL, then additional columns are included for each tag in frame_by_tags. The text chunks are associated with the tags in the order they appear in the markdown content.

Examples

file <- tempfile(fileext = ".html")
download.file("https://r4ds.hadley.nz/base-R.html", file, quiet = TRUE)

# with no arguments, returns a single row data frame.
# the markdown content is in the `text` column.
file |> ragnar_read() |> str()
#> tibble [1 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ origin: chr "/tmp/RtmpgEcLxB/file1c94186b8141.html"
#>  $ hash  : chr "d45481e5676c8dfebb6f111a5ae2dada"
#>  $ text  : 'glue' chr "# 27  A field guide to base R – R for Data Science (2e)\n\n1. [Program](./program.html)\n2. [27  A field guide "| __truncated__

# use `split_by_tags` to get a data frame where the text is split by the
# specified tags (e.g., "h1", "h2", "h3")
file |>
  ragnar_read(split_by_tags = c("h1", "h2", "h3"))
#> # A tibble: 37 × 4
#>    origin                                hash                        text  tag  
#>    <chr>                                 <chr>                       <chr> <chr>
#>  1 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8dfebb6f111a5a… "# 2… "h1" 
#>  2 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8dfebb6f111a5a… "1. … ""   
#>  3 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8dfebb6f111a5a… "## … "h2" 
#>  4 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8dfebb6f111a5a… "* [… ""   
#>  5 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8dfebb6f111a5a… "# 2… "h1" 
#>  6 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8dfebb6f111a5a… "## … "h2" 
#>  7 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8dfebb6f111a5a… "To … ""   
#>  8 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8dfebb6f111a5a… "###… "h3" 
#>  9 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8dfebb6f111a5a… "Thi… ""   
#> 10 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8dfebb6f111a5a… "## … "h2" 
#> # ℹ 27 more rows

# use `frame_by_tags` to get a dataframe where the
# headings associated with each text chunk are easily accessible
file |>
  ragnar_read(frame_by_tags = c("h1", "h2", "h3"))
#> # A tibble: 18 × 6
#>    origin                                hash            text  h1    h2    h3   
#>    <chr>                                 <chr>           <chr> <chr> <chr> <chr>
#>  1 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "1. … # 27… NA    NA   
#>  2 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "* [… # 27… ## T… NA   
#>  3 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "To … # 27… ## 2… NA   
#>  4 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "Thi… # 27… ## 2… ### …
#>  5 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "`[`… # 27… ## 2… NA   
#>  6 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "The… # 27… ## 2… ### …
#>  7 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "The… # 27… ## 2… ### …
#>  8 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "Sev… # 27… ## 2… ### …
#>  9 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "1. … # 27… ## 2… ### …
#> 10 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "`[`… # 27… ## 2… NA   
#> 11 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "`[[… # 27… ## 2… ### …
#> 12 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "The… # 27… ## 2… ### …
#> 13 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "`[[… # 27… ## 2… ### …
#> 14 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "1. … # 27… ## 2… ### …
#> 15 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "In … # 27… ## 2… NA   
#> 16 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "`fo… # 27… ## 2… NA   
#> 17 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "Man… # 27… ## 2… NA   
#> 18 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5676c8d… "In … # 27… ## 2… NA   

# use `split_by_tags` and `frame_by_tags` together to further break up `text`.
file |>
  ragnar_read(
    split_by_tags = c("p"),
    frame_by_tags = c("h1", "h2", "h3")
  )
#> # A tibble: 163 × 7
#>    origin                                hash      text  h1    h2    h3    tag  
#>    <chr>                                 <chr>     <chr> <chr> <chr> <chr> <chr>
#>  1 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5… "1. … # 27… NA    NA    ""   
#>  2 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5… "[R … # 27… NA    NA    "p"  
#>  3 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5… "*"   # 27… NA    NA    ""   
#>  4 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5… "[We… # 27… NA    NA    "p"  
#>  5 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5… "*"   # 27… NA    NA    ""   
#>  6 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5… "[Pr… # 27… NA    NA    "p"  
#>  7 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5… "*"   # 27… NA    NA    ""   
#>  8 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5… "[In… # 27… NA    NA    "p"  
#>  9 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5… "*"   # 27… NA    NA    ""   
#> 10 /tmp/RtmpgEcLxB/file1c94186b8141.html d45481e5… "[Wh… # 27… NA    NA    "p"  
#> # ℹ 153 more rows

# Example workflow adding context to each chunk
file |>
  ragnar_read(frame_by_tags = c("h1", "h2", "h3")) |>
  glue::glue_data(r"--(
    ## Excerpt from the book "R for Data Science (2e)"
    chapter: {h1}
    section: {h2}
    content: {text}

    )--") |>
  # inspect
  _[6:7] |> cat(sep = "\n~~~~~~~~~~~\n")
#> ## Excerpt from the book "R for Data Science (2e)"
#> chapter: # 27  A field guide to base R
#> section: ## 27.2 Selecting multiple elements with `[`
#> content: There are five main types of things that you can subset a vector with, i.e., that can be the `i` in `x[i]`:
#> 
#> 1. **A vector of positive integers**. Subsetting with positive integers keeps the elements at those positions:
#> 
#>    ```
#>    x <- c("one", "two", "three", "four", "five")
#>    x[c(3, 2, 5)]
#>    #> [1] "three" "two"   "five"
#>    ```
#> 
#>    By repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.
#> 
#>    ```
#>    x[c(1, 1, 5, 5, 5, 2)]
#>    #> [1] "one"  "one"  "five" "five" "five" "two"
#>    ```
#> 2. **A vector of negative integers**. Negative values drop the elements at the specified positions:
#> 
#>    ```
#>    x[c(-1, -3, -5)]
#>    #> [1] "two"  "four"
#>    ```
#> 3. **A logical vector**. Subsetting with a logical vector keeps all values corresponding to a `TRUE` value. This is most often useful in conjunction with the comparison functions.
#> 
#>    ```
#>    x <- c(10, 3, NA, 5, 8, 1, NA)
#> 
#>    # All non-missing values of x
#>    x[!is.na(x)]
#>    #> [1] 10  3  5  8  1
#> 
#>    # All even (or missing!) values of x
#>    x[x %% 2 == 0]
#>    #> [1] 10 NA  8 NA
#>    ```
#> 
#>    Unlike `[filter()](https://dplyr.tidyverse.org/reference/filter.html)`, `NA` indices will be included in the output as `NA`s.
#> 4. **A character vector**. If you have a named vector, you can subset it with a character vector:
#> 
#>    ```
#>    x <- c(abc = 1, def = 2, xyz = 5)
#>    x[c("xyz", "def")]
#>    #> xyz def
#>    #>   5   2
#>    ```
#> 
#>    As with subsetting with positive integers, you can use a character vector to duplicate individual entries.
#> 5. **Nothing**. The final type of subsetting is nothing, `x[]`, which returns the complete `x`. This is not useful for subsetting vectors, but as we’ll see shortly, it is useful when subsetting 2d structures like tibbles.
#> 
#> ~~~~~~~~~~~
#> ## Excerpt from the book "R for Data Science (2e)"
#> chapter: # 27  A field guide to base R
#> section: ## 27.2 Selecting multiple elements with `[`
#> content: There are quite a few different ways[1](#fn1) that you can use `[` with a data frame, but the most important way is to select rows and columns independently with `df[rows, cols]`. Here `rows` and `cols` are vectors as described above. For example, `df[rows, ]` and `df[, cols]` select just rows or just columns, using the empty subset to preserve the other dimension.
#> 
#> Here are a couple of examples:
#> 
#> ```
#> df <- tibble(
#>   x = 1:3,
#>   y = c("a", "e", "f"),
#>   z = runif(3)
#> )
#> 
#> # Select first row and second column
#> df[1, 2]
#> #> # A tibble: 1 × 1
#> #>   y
#> #>   <chr>
#> #> 1 a
#> 
#> # Select all rows and columns x and y
#> df[, c("x" , "y")]
#> #> # A tibble: 3 × 2
#> #>       x y
#> #>   <int> <chr>
#> #> 1     1 a
#> #> 2     2 e
#> #> 3     3 f
#> 
#> # Select rows where `x` is greater than 1 and all columns
#> df[df$x > 1, ]
#> #> # A tibble: 2 × 3
#> #>       x y         z
#> #>   <int> <chr> <dbl>
#> #> 1     2 e     0.834
#> #> 2     3 f     0.601
#> ```
#> 
#> We’ll come back to `$` shortly, but you should be able to guess what `df$x` does from the context: it extracts the `x` variable from `df`. We need to use it here because `[` doesn’t use tidy evaluation, so you need to be explicit about the source of the `x` variable.
#> 
#> There’s an important difference between tibbles and data frames when it comes to `[`. In this book, we’ve mainly used tibbles, which *are* data frames, but they tweak some behaviors to make your life a little easier. In most places, you can use “tibble” and “data frame” interchangeably, so when we want to draw particular attention to R’s built-in data frame, we’ll write `data.frame`. If `df` is a `data.frame`, then `df[, cols]` will return a vector if `col` selects a single column and a data frame if it selects more than one column. If `df` is a tibble, then `[` will always return a tibble.
#> 
#> ```
#> df1 <- data.frame(x = 1:3)
#> df1[, "x"]
#> #> [1] 1 2 3
#> 
#> df2 <- tibble(x = 1:3)
#> df2[, "x"]
#> #> # A tibble: 3 × 1
#> #>       x
#> #>   <int>
#> #> 1     1
#> #> 2     2
#> #> 3     3
#> ```
#> 
#> One way to avoid this ambiguity with `data.frame`s is to explicitly specify `drop = FALSE`:
#> 
#> ```
#> df1[, "x" , drop = FALSE]
#> #>   x
#> #> 1 1
#> #> 2 2
#> #> 3 3
#> ```
#> 

# Advanced example of postprocessing the output of ragnar_read()
# to add language to code blocks, markdown style
library(dplyr, warn.conflicts = FALSE)
library(stringr)
library(rvest)
library(xml2)
file |>
  ragnar_read(frame_by_tags = c("h1", "h2", "h3"),
              split_by_tags = c("p", "pre")) |>
  mutate(
    is_code = tag == "pre",
    text = ifelse(is_code, str_replace(text, "```", "```r"), text)
  ) |>
  group_by(h1, h2, h3) |>
  summarise(text = str_flatten(text, "\n\n"), .groups = "drop") |>
  glue::glue_data(r"--(
    # Excerpt from the book "R for Data Science (2e)"
    chapter: {h1}
    section: {h2}
    content: {text}

    )--") |>
  # inspect
  _[9:10] |> cat(sep = "\n~~~~~~~~~~~\n")
#> # Excerpt from the book "R for Data Science (2e)"
#> chapter: # 27  A field guide to base R
#> section: ## 27.3 Selecting a single element with `$` and `[[`
#> content: There are a couple of important differences between tibbles and base `data.frame`s when it comes to `$`. Data frames match the prefix of any variable names (so-called **partial matching**) and don’t complain if a column doesn’t exist:
#> 
#> ```r
#> df <- data.frame(x1 = 1)
#> df$x
#> #> [1] 1
#> df$z
#> #> NULL
#> ```
#> 
#> Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn’t exist:
#> 
#> ```r
#> tb <- tibble(x1 = 1)
#> 
#> tb$x
#> #> Warning: Unknown or uninitialised column: `x`.
#> #> NULL
#> tb$z
#> #> Warning: Unknown or uninitialised column: `z`.
#> #> NULL
#> ```
#> 
#> For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.
#> 
#> ~~~~~~~~~~~
#> # Excerpt from the book "R for Data Science (2e)"
#> chapter: # 27  A field guide to base R
#> section: ## 27.3 Selecting a single element with `$` and `[[`
#> content: `[[` and `$` are also really important for working with lists, and it’s important to understand how they differ from `[`. Let’s illustrate the differences with a list named `l`:
#> 
#> ```r
#> l <- list(
#>   a = 1:3,
#>   b = "a string",
#>   c = pi,
#>   d = list(-1, -5)
#> )
#> ```
#> 
#> *
#> 
#> `[` extracts a sub-list. It doesn’t matter how many elements you extract, the result will always be a list.
#> 
#> ```r
#>   str(l[1:2])
#>   #> List of 2
#>   #>  $ a: int [1:3] 1 2 3
#>   #>  $ b: chr "a string"
#> 
#>   str(l[1])
#>   #> List of 1
#>   #>  $ a: int [1:3] 1 2 3
#> 
#>   str(l[4])
#>   #> List of 1
#>   #>  $ d:List of 2
#>   #>   ..$ : num -1
#>   #>   ..$ : num -5
#>   ```
#> 
#> Like with vectors, you can subset with a logical, integer, or character vector.
#> 
#> *
#> 
#> `[[` and `$` extract a single component from a list. They remove a level of hierarchy from the list.
#> 
#> ```r
#>   str(l[[1]])
#>   #>  int [1:3] 1 2 3
#> 
#>   str(l[[4]])
#>   #> List of 2
#>   #>  $ : num -1
#>   #>  $ : num -5
#> 
#>   str(l$a)
#>   #>  int [1:3] 1 2 3
#>   ```
#> 
#> The difference between `[` and `[[` is particularly important for lists because `[[` drills down into the list while `[` returns a new, smaller list. To help you remember the difference, take a look at the unusual pepper shaker shown in [Figure 27.1](#fig-pepper). If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet. `pepper[2]` would look the same, but would contain the second packet. `pepper[1:2]` would be a pepper shaker containing two pepper packets. `pepper[[1]]` would extract the pepper packet itself.
#> 
#> ![Three photos. On the left is a photo of a glass pepper shaker. Instead of  the pepper shaker containing pepper, it contains a single packet of pepper. In the middle is a photo of a single packet of pepper. On the right is a  photo of the contents of a packet of pepper.](diagrams/pepper.png)
#> 
#> Figure 27.1: (Left) A pepper shaker that Hadley once found in his hotel room. (Middle) `pepper[1]`. (Right) `pepper[[1]]`
#> 
#> This same principle applies when you use 1d `[` with a data frame: `df["x"]` returns a one-column data frame and `df[["x"]]` returns a vector.
#>