Skip to content
Learn Netverks
0

How to mass edit XML?

asked 8 hours ago by @qa-jgskzsf9qum0inpv8mcg 0 rep · 75 views

xml

I have a bunch of xml files which needs some correction.

<body>
  <text>
    <text>
       second level text before
    </text>
    main text
    <text>
       second level text after
    </text>
  </text>
</body>

I need to convert all text tags which are "second level" to comment. So that final xml would become

<body>
  <text>
    <comment>
       second level text before
    </comment>
    main text
    <comment>
       second level text after
    </comment>
  </text>
</body>

Is there a tool which can help with this?

Comments on this question (0)

Use comments to ask for clarification — answers go in the answer box below.

Log in to comment on this question.

7 answers

1

In xsh, you can write

rename comment /body/text/text ;

to rename any text in a text in the body to comment.

To process several files, you can write a longer xsh script:

for $file in { glob '*.xml' } {
    open $file ;
    rename comment /body/text/text ;
    save :b ;  # Leava a backup.
}

Notice: I am the current maintainer of the tool.

Jamie Reed · 0 rep · 8 hours ago

1

You can use XSLT for the task. XSLT is a native XML technology stack API for that.

Jordan Lopez · 0 rep · 8 hours ago

1

An XSLT 3.0 solution:

<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:mode on-no-match="shallow-copy">
  <xsl:template name="xsl:initial-template">
    <xsl:for-each select="collection('dir?select=*.xml')">
      <xsl:result-document href="out/{base-uri(.)}">
        <xsl:apply-templates/>
      </xsl:result-document>
    </xsl:for-each>
  </xsl:template>
  <xsl:template match="text/text">
    <comment><xsl:apply-templates/></comment>
  </xsl:template>
</xsl:transform>

Will need a little bit of adjustment for the input and output file names.

Alex Brooks · 0 rep · 8 hours ago

1

If you're going to use Python, I'd use the ElementTree API in the standard library over the text-based approach you describe here. That includes basic XPath support you can use to find /body/text/text child elements.

Jordan Garcia · 0 rep · 8 hours ago

0

I realize this is not really sophisticated, but if, and only if your XML is formatted exactly as you've shown me (with the tabs and newlines exactly like that), then you can just use the script I just made you:

import os
from sys import exit

file_to_fix = input("file to fix: ")

if not os.path.exists(file_to_fix):
    print("File doesn't exist.")
    exit(1)

if not os.path.isfile(file_to_fix):
    print("Path is not file.")
    exit(1)

dry_run = None
while dry_run == None:
    dry_run_read = input("Dry run (y/n)? ")
    if dry_run_read.lower() == "y":
        dry_run = True
    else:
        if dry_run_read.lower() == "n":
            dry_run = False
        else:
            print("Invalid input, either y or n.")

file = open(file_to_fix, "r")
file_contents = file.read()

lines = file_contents.split("\n")

for index, line in enumerate(lines):
    if line.startswith("    "):
        if line[len("    "):] == "<text>":
            if dry_run:
                print("Would've fixed line " + str(index + 1) + ".")
            else:
                print("Fixed line " + str(index + 1) + ".")
                lines[index] = "    <comment>"

file.close()

if not dry_run:
    file = open(file_to_fix, "w")
    print("Writing file...")
    file.write("\n".join(lines))
    file.close()
    print("Wrote file.")
    print("Fixed file '" + file_to_fix + "'.")
else:
    print("Would've written file.")
    print("would've fixed file '" + file_to_fix + "'.")

Proof that it works (my terminal output):

/tmp/tmp.2MoTo46mcK via 🐍 v3.14.3 
❯ cat file_to_fix.xml
<body>
  <text>
    <text>
       second level text before
    </text>
    main text
    <text>
       second level text after
    </text>
  </text>
</body>

/tmp/tmp.2MoTo46mcK via 🐍 v3.14.3 
❯ python patcher.py
file to fix: file_to_fix.xml
Dry run (y/n)? y
Would've fixed line 3.
Would've fixed line 7.
Would've written file.
would've fixed file 'file_to_fix.xml'.

/tmp/tmp.2MoTo46mcK via 🐍 v3.14.3 took 3s 
❯ python patcher.py
file to fix: file_to_fix.xml
Dry run (y/n)? n
Fixed line 3.
Fixed line 7.
Writing file...
Wrote file.
Fixed file 'file_to_fix.xml'.

/tmp/tmp.2MoTo46mcK via 🐍 v3.14.3 took 4s 
❯ cat file_to_fix.xml
<body>
  <text>
    <comment>
       second level text before
    </text>
    main text
    <comment>
       second level text after
    </text>
  </text>
</body>

/tmp/tmp.2MoTo46mcK via 🐍 v3.14.3 
❯ echo "yay"
yay

Note that this will work only and only if the file is formatted EXACTLY as you've shown me, I cannot stress this enough.

But yeah as you can see, works with your example.

Best of luck and good day to you :)

Reese Reed · 0 rep · 8 hours ago

0

How to work with namespaces in xsh?

In my actual document I have

<?xml version="1.0" encoding="utf-8"?>
<doc xmlns="http://www.examle.con/xml/document/2.0" xmlns:xlink="http://www.w3.org/1999/xlink">
<body>
  <text>
    <text>
       second level text before
    </text>
    main text
    <text>
       second level text after
    </text>
  </text>
</body>
</doc>

I am doing xsh -I sample.xml list //text

parsing sample.xml
done.

Found 0 node(s).

Without namespace in the document, I am getting proper list of three nodes.

And yes, the URL in xmlns has a proper xsd file and both variations text/text and text/comment are allowed.

Drew Diaz · 0 rep · 8 hours ago

0

xmlstarlet edit accepts multiple pathnames as input, for example:

# shellcheck shell=sh disable=SC2154,SC2086

export infileglob='./file*.xml abc/file*.xml def/*.xml'
test "$bettersorrythansafe" || zip keep.zip $infileglob

xmlstarlet edit -P -L \
  -r '_:doc/_:body/_:text/_:text' -v 'comment' \
$infileglob
  • using Info-Zip to create a backup (which edit cannot do)
  • -P preserves input's original formatting
  • -L edits files in-place (NB: timestamps are always updated)
  • xmlstarlet's edit and select subcommands use the namespace definitions in the outermost element of the first input file
  • the _ (underscore) is xmlstarlet's default namespace shortcut

To undo the edits (-o to overwrite), unzip -o keep.zip $infileglob.

Alternatively, use {} + with find's -exec operand to fill the command line:

find . -type f -name '*.xml' -exec \
  xmlstarlet edit -P -L \
    -r '_:doc/_:body/_:text/_:text' -v 'comment' \
  {} +

Skyler Chen · 0 rep · 8 hours ago

Your answer