Comments on How to extract string from file, run filter, and replace in file with new value?
Parent
How to extract string from file, run filter, and replace in file with new value?
TASK
I am coding up ebooks to a specific standard, and have a script that converts a string into the correct titlecase for this publisher. When working with some public domain source files, one often gets this for a chapter title string:
<p>HERE IS MY TITLE</p>
Using VSCodium (FOSS VS Code alternative), I can open each file, select the string between the p
tags, then run the titlecase
script with a hotkey that I've assigned it to. I end up with
<p>Here Is My Title</p>
(VSCodium's native titlecase filter isn't up to this job.) I save the file, and go on to the next one.
If you only have a few of these to do, that's fine. But sometimes there can be dozens, and it gets very tedious.
QUESTION
Is there a way that I can script this? I have scratched my head over both awk
and sed
, thinking that these are my prime options. But (as a rank amateur) I cannot work out how to:
- iterate through all
chapter-*.xhtml
files in a directory, - extract my string (ALWAYS line 12 in the file, only string on line, between
<p>...</p>
tags), -
run my "external"
titlecase
filter on that string, - replace the new string for the original one in the source file,
- for all those files. :)
(The step in bold is the one that is my biggest stumbling block.)
UPDATE: Note that for my titlecase filter, ONLY the string between the tags can be used, so that step #2 (extracting the string) is mandatory. Both the answers so far look very promising, but is it possible to do something like e.g. a regex on sed -n '12p'
in one answer?
The other answer suggests using pup
although it would be helpful not to need extra packages if a simple regex would do.
UPDATE 2: for "real" data, one could download the ZIP of this commit in a Github repo - the files in question are found at: /src/epub/text/chapter-*.xhtml
= the 12th line of every "chapter-nn.xhtml" file.
Post
The following users marked this post as Works for me:
User | Comment | Date |
---|---|---|
alx |
Thread: Works for me Nice sed(1) regex! It looks obvious after seeing it. |
Dec 6, 2023 at 00:34 |
for file in in chapter-*.xhtml
do
sed -ir "12s/\b([A-Z])([A-Z]+)/\1\L\2/g;" "$file"
done
This -ir
tells GNU-sed so alter the file in place (-i) and use regexp-extended (-r).
For line 12 substitute from word boundary an uppercase letter (1) followed by multiple uppercase letters (2) with no1 untouched, but the 2nd pattern replaced by lowercase (\L), and to repeat this procedure globally (/g).
Note that this will turn USB to Usb, USA to Usa, UNO to Uno and so on.
For reaching into subdirectories,
find -name "chapter-*.xhtml" -exec sed -ir "12s/\b([A-Z])([A-Z]+)/\1\L\2/g;" {} ";"
Again, GNU-tools (find, sed) are assumed (as default on most Linux systems).
0 comment threads