Lord.DragonFly.of.Dawn Advanced Member

Joined: 18 Jul 2025 Posts: 607 Location: South Portland, Maine, USA, Earth, Sol System
|
Posted: Tue Mar 16, 2025 7:14 pm Post subject: Command Line XPath selection |
|
|
Web Scrapers
We've all written them. We all hate them.
What do we hate about them?
well often we end up selecting a link to download like this:
Code: | NEXT="http://target.server"`cat ${THIS}|grep next|tr \" '\n' |grep albums|head -n1` |
What's ugly about that? well it's at least partially positional for one. If they website changes style even slightly your script will fail and you will have to figure out the new positions. Also if you have multiple things that you are looking for the complexity of the command string grows exponentially and the understandability of the pipe chain decreases at and exponential rate as well!
What we'd rather do is issue a command like this:
Code: | NEXT="http://target.server$(XPath -f "${THIS}" -x "//a[@title='Next Image']/@href"|head -n1)" |
See how much cleaner that is. Well formed HTML looks a lot like XML and there are a few XML parsers out there that have an HTML compatible mode. But using those parsers usually requires some significant coding and is a bit disproportionate to the problem at hand.
But we do have the advantage of general solution. The site can change its styling and you are less likely to break your script, and cites with a less rigid style can be searched as well. And even better multiple searches could be built into a single command without it getting needlessly complicated!
But who shall free us from our private h*** of grep, sed, and awk? Not me, that's for certain! But I will add to your collection of utilities a new command....
I give to you XPath!
Code: |
#!/usr/bin/python
# Filename: XPath
#
# Copyright (c) 2025 Patrick Libby <[email protected]>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
from lxml import etree
from StringIO import StringIO
from optparse import OptionParser
from os import path
import sys
import re
import codecs
class XPathSelector():
def __init__(self, data, asHTML=False):
"""XPathSelector(data) -> a new XPathSelector ready to be used.
Raises errors as defined by lxml.etree.parse()
"""
parser=None
if asHTML:
parser = etree.HTMLParser()
else:
parser = etree.XMLParser()
d = StringIO(data)
self.tree = etree.parse(d, parser)
def GetValues(self, path):
"""GetValues(xpath) -> a list of elements that match the XPath.
"""
return self.tree.xpath(path)
if __name__ == "__main__":
def getCMDParser():
""" getCMDParser() -> Build the OptionParser that shall be used to parse
the command line.
"""
parser = OptionParser()
parser.add_option ("-x", "--xpath", dest="xpaths", default=[],
help="Select using this XPath", action="append")
parser.add_option ("-X", "--XML", dest="html", default=None,
help="Interpret file as XML", action="store_false")
parser.add_option ("-H", "--HTML", dest="html", default=None,
help="Interpret file as HTML", action="store_true")
parser.add_option ("-f", "--file", dest="filename",
help="Read file from FILE. (default: stdin)",
metavar="FILE", default=None )
parser.add_option ("-v", "--verbose", dest= "verbose",
help="Output additional information about processing",
action="count", default=0)
parser.add_option ("-q", "--quiet", dest="silent",
help="Output nothing but results (overrides --verbose)",
action="store_true", default=False)
parser.add_option ("-s", "--silent", dest="silent",
help="Output nothing but results (overrides --verbose)",
action="store_true", default=False)
parser.add_option ("-n", "--new-lines", dest="newLines",
help="Suppress new lines in match results, replace with specified character(s)",
metavar="CHARACTER", default=None)
parser.add_option ("-r", "--carriage-returns", dest="carriageReturns",
help="Suppress carriage returns in match results, replace with specified character(s)",
metavar="CHARACTER", default=None)
return parser
def getFileContents(name):
""" getFileContentes(name) -> read file and return contents
"""
if name:
# Expand ~user and ${variable} constructs
file = open(path.expandvars(path.expanduser(name)))
else:
file = sys.stdin
return file.read()
def guessType(name):
"""guessType(name) -> True if file appears to be HTML, else False
Based off filename only. if this is resulting in false readings
look into using the --HTML and --XML options.
"""
if not name:
return False
isHTML = ['.xhtml', '.htm', '.html', '.html', '.cgi', '.asp']
ext = path.splitext(name)[1]
if ext:
ext = ext.lower()
return ext in isHTML
#BEGIN: Parse command line
p = getCMDParser()
(options, args)= p.parse_args()
asHTML = options.html if options.html else guessType(options.filename)
file = options.filename
data = getFileContents(options.filename)
if not options.silent and options.verbose >=3:
print "xpaths: \'{0}\'".format(options.xpaths)
print "asHTML: \'{0}\'".format(asHTML)
print "filename: \'{0}\'".format(file)
print "verbosity: \'{0}\'".format(options.verbose)
print "silent: \'{0}\'".format(options.silent)
print "new lines: \'{0}\'".format(options.newLines)
print "carriage returns: \'{0}\'".format(options.carriageReturns)
if not options.silent and options.verbose >=4:
print "data: \"\"\"{0}\"\"\"".format(data)
#END: Parse command line
# Create the Selector
xps = XPathSelector(data, asHTML)
#set the codec for the printer.
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
# Perform selects sequentially
for xpath in options.xpaths:
if not options.silent and options.verbose >=1:
#verbose feedback
print "Searching: \'{0}\'".format(xpath)
for match in xps.GetValues(xpath):
#print results
result = None
#interpret results
if type(match) == etree._Element:
# Print the whole matched element. This can cause unexpected
# output for nested results. To avoid, select only text elements
# through the use of the text() selector or through attribute
# selectors or craft your select so that matching elements are
# not nested within other matching elements.
result = etree.tostring(match)
else:
# did not match an element node. result is text or numeric.
# Use raw.
result = match
if options.newLines:
# suppress new lines, replace with newLines
result = result.replace("\n", newLines)
if options.carriageReturns:
# suppress carriage returns, replace with carriageReturns
result = result.replace("\r", carriageReturns)
#Print the result
print result
|
Let's see some examples!
Let's select the hyperlinks for all the forums on usalug.
Code: | [dragonfly@Ito ~]$ curl http://www.google.com 2>/dev/null|XPath -H -x "//input[@name='q']"
<input autocomplete="off" maxlength="2048" name="q" size="55" class="lst" title="Google Search" value=""/>
[dragonfly@Ito ~]$ curl http://usalug.org/phpBB2/ 2>/dev/null|XPath -H -x "//a[@class='forumlink']"
<a href="viewforum.html?f=29&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Application News and Releases</a>
<a href="viewforum.html?f=114&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Job Classifieds - Job Opportunities</a>
<a href="viewforum.html?f=2&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Installation and Bootloaders</a>
<a href="viewforum.html?f=16&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Networking</a>
<a href="viewforum.html?f=3&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Servers and Server Administration</a>
<a href="viewforum.html?f=8&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Hardware</a>
<a href="viewforum.html?f=7&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">System Administration and Security</a>
<a href="viewforum.html?f=82&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Linux Education and Certification</a>
<a href="viewforum.html?f=10&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Distributions</a>
<a href="viewforum.html?f=18&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">OTHER</a>
<a href="viewforum.html?f=120&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Software for Business</a>
<a href="viewforum.html?f=127&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">K12LTSP and Educational Applications</a>
<a href="viewforum.html?f=4&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Window Managers</a>
<a href="viewforum.html?f=62&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Graphics Applications</a>
<a href="viewforum.html?f=63&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Audio and Video Applications</a>
<a href="viewforum.html?f=64&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Web Browsers & Email Clients</a>
<a href="viewforum.html?f=133&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">All other software.</a>
<a href="viewforum.html?f=126&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Command Line Commands</a>
<a href="viewforum.html?f=15&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Shell Scripting and Programming</a>
<a href="viewforum.html?f=134&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Member Blogs</a>
<a href="viewforum.html?f=98&sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Reviews and Interviews</a>
|
Hmm... yes, but we can do better. We want just the link part....
Code: | [dragonfly@Ito ~]$ curl http://usalug.org/phpBB2/ 2>/dev/null|XPath -H -x "//a[@class='forumlink']/@href"
viewforum.html?f=29&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=114&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=2&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=16&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=3&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=8&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=7&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=82&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=10&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=18&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=120&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=127&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=4&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=62&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=63&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=64&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=133&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=126&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=15&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=134&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=98&sid=b5d3ae3abc42c218d588ea049441bce9
|
Ah. That's better.
See how easy it is? And because this utility is designed to work with scripts we can ensure that new lines and carriage returns do not appear inside a single result. the helpful -n and -r options can suppress new lines and carriage returns and replace them with custom character sequences! FWEEET!
Now... Isn't that helpful?
[edit]
Noticed some issues when piping the output to other commands. Output will now always be in UTF8, input will be read in default encoding and coerced to UTF8
[/edit]
_________________ ArchLinux x86_64 - Custom Built Desktop
ArchLinux x86_64 - Compaq CQ50 Laptop
ArchLinux i686 - Acer Aspire One Netbook
ArchLinux i686 - Dell Presario ze2000 (w/ shattered LCD)
PuppyLinux, CloneZilla, PartedMagic, DBAN - rescue thumbdrives
Windows 7 (x86_64 desktop alternate boot)
Last edited by Lord.DragonFly.of.Dawn on Tue Mar 16, 2025 10:14 pm; edited 1 time in total |
|