dj(1) - disk jockey #15

Closed
opened 2023-12-26 09:53:58 -07:00 by trinity · 14 comments
Owner

This is a proposal for a Bonsai analogue to dd(1p)'s input/output redirection and control functionality, or disk jockeying, and pertains to #23.

Much like how a deejay / DJ / disc jockey queues up certain parts of records and plays them at certain times during their performance, dj(1) queues up certain parts of files and outputs them to certain parts of other files.

Usage: dj
    (-i [input file]) (-b [input buffer size]) (-s [input start])
    (-o [output file]) (-B [output buffer size]) (-S [output start])
    (-c [count])

For example, to copy 4 blocks of size 64B, starting at 0x80, from in, to out at 0x80, with a 512B write buffer:

dj -i in -b 64 -s 128 -o out -B 512 -S 128 -c 4

The syntax seems intuitive to me except -b, -B, -s, and -S, which may be confused.

- as input or output would refer to /dev/stdin and /dev/stdout respectively. Input and output files have to be specifiable because neither standard input nor standard output can be seeked. Standard input when used with a nonzero -s would have [input seek] bytes read and discarded. Standard output when used with a nonzero -S would write [output seek] nul bytes before writing from the input. It may be better to error out if either standard input and -s or standard output and -S are combined so as to avoid unintended behavior.

Invoked without arguments the following sane defaults would be assumed:

dj -i - -b 1024 -s 0 -o - -B 1024 -S 0 -c 0

A count of 0 would copy the entirety of the input file.

Pseudocode for the program would be something like this:

if the input is stdin, read [input seek] bytes from the stream
otherwise, skip [input seek] bytes in the file

if the output is stdout, print [output seek] nul bytes
otherwise, skip [output seek] bytes in the file

allocate [input buffer] bytes for the input buffer
allocate [output buffer] bytes for the output buffer

repeat the following
    fill the input buffer from the input file
    if the input buffer isn't saturated, count becomes 1
    while the input buffer isn't empty
        fill the output buffer from the input buffer
        write the output buffer
    if count isn't 0
        decrement count; if it's 0, exit

dj(1) would be as useful as dd(1) for this purpose but not an improvement per se. However the invocation syntax would make sense as opposed to dd(1), which would be nice.

This is a proposal for a Bonsai analogue to [dd(1p)](https://www.man7.org/linux/man-pages/man1/dd.1p.html)'s input/output redirection and control functionality, or disk jockeying, and pertains to #23. Much like how a deejay / DJ / disc jockey queues up certain parts of records and plays them at certain times during their performance, dj(1) queues up certain parts of files and outputs them to certain parts of other files. ``` Usage: dj (-i [input file]) (-b [input buffer size]) (-s [input start]) (-o [output file]) (-B [output buffer size]) (-S [output start]) (-c [count]) ``` For example, to copy 4 blocks of size 64B, starting at 0x80, from in, to out at 0x80, with a 512B write buffer: ``` dj -i in -b 64 -s 128 -o out -B 512 -S 128 -c 4 ``` The syntax seems intuitive to me except `-b`, `-B`, `-s`, and `-S`, which may be confused. `-` as input or output would refer to `/dev/stdin` and `/dev/stdout` respectively. Input and output files have to be specifiable because neither standard input nor standard output can be seeked. Standard input when used with a nonzero `-s` would have [input seek] bytes read and discarded. Standard output when used with a nonzero `-S` would write [output seek] nul bytes before writing from the input. It may be better to error out if either standard input and `-s` or standard output and `-S` are combined so as to avoid unintended behavior. Invoked without arguments the following sane defaults would be assumed: ``` dj -i - -b 1024 -s 0 -o - -B 1024 -S 0 -c 0 ``` A count of 0 would copy the entirety of the input file. Pseudocode for the program would be something like this: ``` if the input is stdin, read [input seek] bytes from the stream otherwise, skip [input seek] bytes in the file if the output is stdout, print [output seek] nul bytes otherwise, skip [output seek] bytes in the file allocate [input buffer] bytes for the input buffer allocate [output buffer] bytes for the output buffer repeat the following fill the input buffer from the input file if the input buffer isn't saturated, count becomes 1 while the input buffer isn't empty fill the output buffer from the input buffer write the output buffer if count isn't 0 decrement count; if it's 0, exit ``` dj(1) would be as useful as dd(1) for this purpose but not an improvement per se. However the invocation syntax would make sense as opposed to dd(1), which would be nice.
Owner

thoughts on removing -i and -o?

dj [-b input_size] [-B output_size] [-s input_start] [-S output_start] file1 [file2]

if file2 is not specified, use stdout.

thoughts on removing `-i` and `-o`? `dj [-b input_size] [-B output_size] [-s input_start] [-S output_start] file1 [file2]` if `file2` is not specified, use stdout.
emma added the
enhancement
label 2023-12-26 11:21:48 -07:00
Author
Owner

The amount of scenarios where one would want an input file and standard output, and the amount where one would want standard input and an output file, seem to be roughly equal to me. Requiring an input argument even if standard input would favor the former scenario ergonomically (dj in versus dj - out).

The amount of scenarios where one would want an input file and standard output, and the amount where one would want standard input and an output file, seem to be roughly equal to me. Requiring an input argument even if standard input would favor the former scenario ergonomically (`dj in` versus `dj - out`).
Owner

Requiring an input argument even if standard input would favor the former scenario ergonomically (dj in versus dj - out).

please rephrase this, i have no idea what this is saying

> Requiring an input argument even if standard input would favor the former scenario ergonomically (`dj in` versus `dj - out`). please rephrase this, i have no idea what this is saying
Author
Owner

Requiring an input argument even if standard input would favor the former scenario ergonomically (dj in versus dj - out).

please rephrase this, i have no idea what this is saying

For example, using your proposed argument format.

Using dj(1) to write a file directly to a block device (this is a weird use case but a friend and I did this yesterday):

$ tar cz book.pdf | tee book.pdf.tar.gz | doas dj -B 4096 - /dev/sdc # dj's argc == 5
$ wc -c <book.pdf.tar.gz
3333

(Perhaps dj(1) should output to standard error a summary of operations, like dd(1p).)

Using dj(1) to read a file directly from a block device:

$ doas dj -c 3333 /dev/sdc | tar -C Downloads zx # dj's argc == 4

There's an asymmetry. dj(1) needs argc == 5 to write, but argc == 4 to read, both each with one option and option argument specified. It might be a little silly but I really don't like the imbalance.

> > Requiring an input argument even if standard input would favor the former scenario ergonomically (`dj in` versus `dj - out`). > > please rephrase this, i have no idea what this is saying For example, using your proposed argument format. Using dj(1) to write a file directly to a block device (this is a weird use case but a friend and I did this yesterday): ```sh $ tar cz book.pdf | tee book.pdf.tar.gz | doas dj -B 4096 - /dev/sdc # dj's argc == 5 $ wc -c <book.pdf.tar.gz 3333 ``` (Perhaps dj(1) should output to standard error a summary of operations, like dd(1p).) Using dj(1) to read a file directly from a block device: ```sh $ doas dj -c 3333 /dev/sdc | tar -C Downloads zx # dj's argc == 4 ``` There's an asymmetry. dj(1) needs `argc == 5` to write, but `argc == 4` to read, both each with one option and option argument specified. It might be a little silly but I really don't like the imbalance.
Owner

you get free reign on how this works :)

you get free reign on how this works :)
Author
Owner

Some thoughts about dj(1) I have today.

Some dd(1p) behavior I find interesting:

  • If a partial block is read (e.g. read(fd, buf, ibs) < ibs) dd(1p) operates on that block in particular. This means dd if=/dev/zero bs=10 count=10 | wc -c can output 0 or any unsigned value less than bs * count. Many would assume this could only output 100.
  • If a partial block is read and conv=sync dd(1p) will fill the unused bytes of the input buffer with nuls (or spaces if conv=block or conv=ublock).

Character conversion including the ascii, ebcdic, ibm, block, ublock, lcase, ucase, and swab aren't relevant to dj(1) but noerror, notrunc, and sync probably are and should be implemented as well.

Some thoughts about dj(1) I have today. Some dd(1p) behavior I find interesting: - If a partial block is read (e.g. `read(fd, buf, ibs) < ibs`) dd(1p) operates on that block in particular. This means `dd if=/dev/zero bs=10 count=10 | wc -c` can output `0` or any unsigned value less than `bs * count`. Many would assume this could only output `100`. - If a partial block is read and `conv=sync` dd(1p) will fill the unused bytes of the input buffer with nuls (or spaces if `conv=block` or `conv=ublock`). Character conversion including the `ascii`, `ebcdic`, `ibm`, `block`, `ublock`, `lcase`, `ucase`, and `swab` aren't relevant to dj(1) but `noerror`, `notrunc`, and `sync` probably are and should be implemented as well.
Author
Owner

Actually - not truncating the output file should be the default. Truncation can be achieved easily:

dj -i in > out # through stdout and a shell redirect
>out; dj -i in -o out # the same but with dj handling out as a file
Actually - not truncating the output file should be the default. Truncation can be achieved easily: ```sh dj -i in > out # through stdout and a shell redirect >out; dj -i in -o out # the same but with dj handling out as a file ```
Author
Owner

There are some nuances I'm trying to figure out.

One is specifically at the last dd(1p) invocation here:

$ </dev/zero dd bs=10 count=10 | dd skip=1000

POSIX says the following:

skip=n
Skip n input blocks (using the specified input block
size) before starting to copy. On seekable files, the
implementation shall read the blocks or seek past them;
on non-seekable files, the blocks shall be read and the
data shall be discarded.

Because 100 bytes will be written to the pipe from which 1000 bytes are trying to be skipped, Busybox dd(1) exits with 0+0 records read. In strace and ltrace it seems its dd(1) gives read(2) a second try if it initially exits 0 (due to an unavailable device or something) so I implemented this:

static int
_read(int fd, void *buf, size_t bs){
    int retval;

    return ((retval = read(fd, buf, bs)) == 0)
        ? read(fd, buf, bs) /* second chance */
        : retval;
}

Busybox dd(1) exits saying 0+0 because the reading for the purpose of skipping failed. This behavior is the same whether or not conv=noerror is specified.

I find the conv=noerror description to be a little confusing:

noerror
Do not stop processing on an input error. When
an input error occurs, a diagnostic message
shall be written on standard error, followed
by the current input and output block counts
in the same format as used at completion (see
the STDERR section). If the sync conversion is
specified, the missing input shall be replaced
with null bytes and processed normally;
otherwise, the input block shall be omitted
from the output.

The end of a file apparently doesn't count as an ignorable error which makes sense. I wonder how to simulate an ignorable error though.

There are some nuances I'm trying to figure out. One is specifically at the last dd(1p) invocation here: ```sh $ </dev/zero dd bs=10 count=10 | dd skip=1000 ``` POSIX says the following: >skip=n >Skip n input blocks (using the specified input block size) before starting to copy. On seekable files, the implementation shall read the blocks or seek past them; on non-seekable files, the blocks shall be read and the data shall be discarded. Because 100 bytes will be written to the pipe from which 1000 bytes are trying to be skipped, Busybox dd(1) exits with 0+0 records read. In strace and ltrace it seems its dd(1) gives read(2) a second try if it initially exits 0 (due to an unavailable device or something) so I implemented this: ```c static int _read(int fd, void *buf, size_t bs){ int retval; return ((retval = read(fd, buf, bs)) == 0) ? read(fd, buf, bs) /* second chance */ : retval; } ``` Busybox dd(1) exits saying 0+0 because the reading for the purpose of skipping failed. This behavior is the same whether or not `conv=noerror` is specified. I find the `conv=noerror` description to be a little confusing: > noerror > Do not stop processing on an input error. When an input error occurs, a diagnostic message shall be written on standard error, followed by the current input and output block counts in the same format as used at completion (see the STDERR section). If the sync conversion is specified, the missing input shall be replaced with null bytes and processed normally; otherwise, the input block shall be omitted from the output. The end of a file apparently doesn't count as an ignorable error which makes sense. I wonder how to simulate an ignorable error though.
Author
Owner

I just noticed dd(1p) specifies that seek=100 and skip=100 seek/skip 100 blocks, not bytes. I'm not replicating that behavior because it sucks.

I just noticed dd(1p) specifies that `seek=100` and `skip=100` seek/skip 100 *blocks*, not bytes. I'm not replicating that behavior because it sucks.
Author
Owner

Busybox dd(1) fails to seek on streams because lseek(3) returns -1 and sets errno to 29, "Invalid seek". I don't think this follows POSIX, which says the following:

seek=n
Skip n blocks (using the specified output block size)
from the beginning of the output file before copying.
On non-seekable files, existing blocks shall be read
and space from the current end-of-file to the specified
offset, if any, filled with null bytes; on seekable
files, the implementation shall seek to the specified
offset or read the blocks as described for non-seekable
files.

I'll be following POSIX for this behavior of dj(1) (though with bytes, not blocks).

Busybox dd(1) fails to seek on streams because lseek(3) returns `-1` and sets `errno` to `29`, "Invalid seek". I don't think this follows POSIX, which says the following: > seek=n > Skip n blocks (using the specified output block size) from the beginning of the output file before copying. On non-seekable files, existing blocks shall be read and space from the current end-of-file to the specified offset, if any, filled with null bytes; on seekable files, the implementation shall seek to the specified offset or read the blocks as described for non-seekable files. I'll be following POSIX for this behavior of dj(1) (though with bytes, not blocks).
Author
Owner

Busybox dd(1) doesn't count skipped records written as nuls in its statistics output. Given the use of the phrasing "before copying" I think this is fine.

Busybox dd(1) doesn't count skipped records written as nuls in its statistics output. Given the use of the phrasing "before copying" I think this is fine.
Author
Owner

POSIX dd(1p) does this:

sync
Pad every input block to the size of the ibs=
buffer, appending null bytes. (If either block
or unblock is also specified, append
characters, rather than null bytes.)

dj(1) will have a -a option ("Align") to follow this behavior.

POSIX dd(1p) does this: > sync > Pad every input block to the size of the ibs= buffer, appending null bytes. (If either block or unblock is also specified, append <space> characters, rather than null bytes.) dj(1) will have a `-a` option ("Align") to follow this behavior.
trinity self-assigned this 2024-01-04 07:00:53 -07:00
Author
Owner

-A will sync using nuls as the padding, -a will take a single-byte argument and use that.

`-A` will sync using nuls as the padding, `-a` will take a single-byte argument and use that.
Owner

Make sure to close issues when you’re done with them!

Make sure to close issues when you’re done with them!
emma closed this issue 2024-01-23 15:03:08 -07:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: bonsai/harakit#15
No description provided.