Communities

The Great Outdoors

Photography & Video

Scientific Speculation

Electrical Engineering

Languages & Linguistics

Software Development

Community Proposals

tag:snake search within a tag

answers:0 unanswered questions

user:xxxx search by author id

score:0.5 posts with 0.5+ score

"snake oil" exact phrase

votes:4 posts with 4+ votes

created:<1w created < 1 week ago

post_type:xxxx type of post

Notifications

Mark all as read See all your notifications »

Q&A

Posts Tags Edits

Comments on How to extract string from file, run filter, and replace in file with new value?

Parent

How to extract string from file, run filter, and replace in file with new value?

+4

−0

TASK

I am coding up ebooks to a specific standard, and have a script that converts a string into the correct titlecase for this publisher. When working with some public domain source files, one often gets this for a chapter title string:

<p>HERE IS MY TITLE</p>

Using VSCodium (FOSS VS Code alternative), I can open each file, select the string between the p tags, then run the titlecase script with a hotkey that I've assigned it to. I end up with

<p>Here Is My Title</p>

(VSCodium's native titlecase filter isn't up to this job.) I save the file, and go on to the next one.

If you only have a few of these to do, that's fine. But sometimes there can be dozens, and it gets very tedious.

QUESTION

Is there a way that I can script this? I have scratched my head over both awk and sed, thinking that these are my prime options. But (as a rank amateur) I cannot work out how to:

iterate through all chapter-*.xhtml files in a directory,
extract my string (ALWAYS line 12 in the file, only string on line, between <p>...</p> tags),
run my "external" titlecase filter on that string,
replace the new string for the original one in the source file,
for all those files. :)

(The step in bold is the one that is my biggest stumbling block.)

UPDATE: Note that for my titlecase filter, ONLY the string between the tags can be used, so that step #2 (extracting the string) is mandatory. Both the answers so far look very promising, but is it possible to do something like e.g. a regex on sed -n '12p' in one answer?

The other answer suggests using pup although it would be helpful not to need extra packages if a simple regex would do.

UPDATE 2: for "real" data, one could download the ZIP of this commit in a Github repo - the files in question are found at: /src/epub/text/chapter-*.xhtml = the 12th line of every "chapter-nn.xhtml" file.

scripting sed text-processing

posted over 1 year ago

CC BY-SA 4.0

1y ago

31 reputation 1 1 5 4

Raw

Markdown

is a duplicate

This question has been asked before and has already been answered. It should be marked as a duplicate.

Please enter the URL of the proposed duplicate in the details field below.

not constructive

This question cannot be answered in a way that is helpful to anyone. It's not possible to learn something from possible answers, except for the solution for the specific problem of the asker.

0 comment threads

Post

+1

−0

Both replies, at different points, provided the basis for this working script. Assuming that the 12th line of file has something like:

    <p>HERE IS MY TITLE</p>

where HERE... begins at column 8 (I need to omit the opening <p> tag, as noted in the original post), then:

for filename in chapter-*.xhtml; do
    new12=$(sed -n '12p' "$filename" | cut -b 8- | se titlecase -n)
    sed -i -e '12s#^\(.*<p>\).*#'"\1$new12"'#g' "$filename"
  done

The two middle lines work this way:

Line 2:

sed -n '12p' "$filename" = print the 12th line of the file
cut -b 8- = "cut" from the 8th column, so in this example, passing the string HERE IS MY TITLE</p> to the pipe
se titlecase -n = run the titlecase script (-n prevents it from generating a "newline")
all that assigned to $new12.

Line 3

sed -i -e '12s#^\(.*<p>\).*#'"\1$new12"'#g' = replace original line 12, capturing the first part of the line, up to the opening <p> in a backreference group, so \1 in the "replace", combined with the $new12 value.

Produces:

    <p>Here Is My Title</p>

in "$filename". Done. :)

posted over 1 year ago

CC BY-SA 4.0

1y ago

31 reputation 1 1 5 4

Copy Link

Raw

Markdown

1 comment thread

Command substitution `$(...)` already strips the trailing newline. (1 comment)