Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Comments on How to extract string from file, run filter, and replace in file with new value?

Parent

How to extract string from file, run filter, and replace in file with new value?

+4
−0

TASK

I am coding up ebooks to a specific standard, and have a script that converts a string into the correct titlecase for this publisher. When working with some public domain source files, one often gets this for a chapter title string:

   <p>HERE IS MY TITLE</p>

Using VSCodium (FOSS VS Code alternative), I can open each file, select the string between the p tags, then run the titlecase script with a hotkey that I've assigned it to. I end up with

   <p>Here Is My Title</p>

(VSCodium's native titlecase filter isn't up to this job.) I save the file, and go on to the next one.

If you only have a few of these to do, that's fine. But sometimes there can be dozens, and it gets very tedious.

QUESTION

Is there a way that I can script this? I have scratched my head over both awk and sed, thinking that these are my prime options. But (as a rank amateur) I cannot work out how to:

  1. iterate through all chapter-*.xhtml files in a directory,
  2. extract my string (ALWAYS line 12 in the file, only string on line, between <p>...</p> tags),
  3. run my "external" titlecase filter on that string,
  4. replace the new string for the original one in the source file,
  5. for all those files. :)

(The step in bold is the one that is my biggest stumbling block.)

UPDATE: Note that for my titlecase filter, ONLY the string between the tags can be used, so that step #2 (extracting the string) is mandatory. Both the answers so far look very promising, but is it possible to do something like e.g. a regex on sed -n '12p' in one answer?

The other answer suggests using pup although it would be helpful not to need extra packages if a simple regex would do.

UPDATE 2: for "real" data, one could download the ZIP of this commit in a Github repo - the files in question are found at: /src/epub/text/chapter-*.xhtml = the 12th line of every "chapter-nn.xhtml" file.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

Post
+2
−0
  • iterate through all chapter-*.xhtml files in a directory

Assuming bash, and assuming that at least one such file exists in the current directory (otherwise adjust the path and/or shopt -s nullglob), you can use a simple for loop to do this.

for filename in chapter-*.xhtml; do
    ...
done
  • extract my string (ALWAYS line 12 in the file, only string on line, between <p>...</p> tags)

Since you know that this will always be line 12, the easiest-to-read way to do this is probably awk, in which it becomes:

line12=$(awk 'NR==12{print;exit;}' "$filename")

Do note that the resulting $line12 will include whitespace and tags in addition to the textual content of the tag. If this is a problem, you can use pup to extract only the text from within the <p> tag:

line12=$(awk 'NR==12{print;exit;}' "$filename" | pup p text{})

in which case of course you will need to adjust the replacement step accordingly.

(pup is a tool to parse HTML and extract portions of it based on CSS selectors.)

  • run my "external" titlecase filter on that string

Assuming that titlecase is executable, accepts the old title on standard input, and emits the new title on standard output, you can pipe the output from awk above into titlecase, as in:

newline12=$(awk 'NR==12{print;exit;}' "$filename" | titlecase)
  • replace the new string for the original one in the source file

There are many ways to do this, but assuming that the replacement doesn't contain special characters, you can do something similar to:

sed -i '12s#^.*$#'"$newline12"'#' "$filename"

This will replace the entirety of line 12 in the file with the contents of the $newline12 environment variable. Adjust the 12 if you need to replace a differently numbered line. I use # as delimeters here because the traditional / will conflict with the end-tag marker in </p>.

-i is inline editing mode; if you omit it, sed will print the result on standard output, which you can redirect to another file:

sed '12s#^.*$#'"$newline12"'#' "$filename" >"$filename".new

Putting it all together:

for filename in chapter-*.xhtml; do
    newline12=$(awk 'NR==12{print;exit;}' "$filename" | titlecase)
    sed -i '12s#^.*$#'"$newline12"'#' "$filename"
done

Example:

Input chapter-1.xhtml

<html>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
  <p>HERE IS MY TITLE</p>

</html>

Execution

I use an alias in place of your likely actual titlecase here, but the principle is exactly the same:

$ alias titlecase='tr A-Z a-z'
$ for filename in chapter-*.xhtml; do
    newline12=$(awk 'NR==12{print;exit;}' "$filename" | titlecase)
    sed -i '12s#^.*$#'"$newline12"'#' "$filename"
  done

Output chapter-1.xhtml

<html>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
  <p>here is my title</p>

</html>
History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

1 comment thread

Thanks @#8049 - helpful! I see from your answer (and even more the other one) that I need to edit my ... (4 comments)
Thanks @#8049 - helpful! I see from your answer (and even more the other one) that I need to edit my ...
David‭ wrote about 1 year ago

Thanks Canina‭ - helpful! I see from your answer (and even more the other one) that I need to edit my original post to clarify one point. But there is a pending edit which I don't have rep to approve, and it's blocking me being able to edit my own post. Could you (as mod) approve, then I can update? Thanks again!

Canina‭ wrote about 1 year ago · edited about 1 year ago

David‭ Done. However, please keep in mind that if an edit would invalidate existing answers, it is generally better to post a new question incorporating what you have learned and highlighting how the new is different, than to edit an original question in such a way that existing answers no longer apply to the new version. Answer-invalidating question edits may be rolled back. It's fine to link back to a previous question for context.

David‭ wrote about 1 year ago

Thanks for that - in this case (I think you'll see) the update was just to foreground something already in the existing question, not to alter it in any way. The existing answers still very much apply, and your pup suggestion already addressed it to some degree.

Thanks for the answer, and for the follow-up. I'm trying to apply what I'm learning!

David‭ wrote about 1 year ago

I think I got it! This actually works (using your approach and structure):

for filename in chapter-*.xhtml; do
    new12=$(sed -n '12p' "$filename" | cut -b 8- | se titlecase -n)
    sed -i -e '12s#^\(.*<p>\).*#'"\1$new12"'#g' "$filename"
  done

(Hopefully that will show as "code" with Markdown.) The se titlecase -n is my bespoke "filter".