USA Linux Users Group Forum Index
Log in Register FAQ Memberlist Search USA Linux Users Group Forum Index Album

Command Line XPath selection

 
Post new topic   Reply to topic   printer-friendly view    USA Linux Users Group Forum Index » Shell Scripting and Programming
View previous topic :: View next topic  
Author Message
Lord.DragonFly.of.Dawn
Advanced Member


Joined: 18 Jul 2024
Posts: 607
Location: South Portland, Maine, USA, Earth, Sol System

PostPosted: Tue Mar 16, 2024 7:14 pm    Post subject: Command Line XPath selection Reply with quote

Web Scrapers

We've all written them. We all hate them.

What do we hate about them?

well often we end up selecting a link to download like this:
Code:
NEXT="http://target.server"`cat ${THIS}|grep next|tr \" '\n' |grep albums|head -n1`

What's ugly about that? well it's at least partially positional for one. If they website changes style even slightly your script will fail and you will have to figure out the new positions. Also if you have multiple things that you are looking for the complexity of the command string grows exponentially and the understandability of the pipe chain decreases at and exponential rate as well!

What we'd rather do is issue a command like this:
Code:
NEXT="http://target.server$(XPath -f "${THIS}" -x "//a[@title='Next Image']/@href"|head -n1)"

See how much cleaner that is. Well formed HTML looks a lot like XML and there are a few XML parsers out there that have an HTML compatible mode. But using those parsers usually requires some significant coding and is a bit disproportionate to the problem at hand.

But we do have the advantage of general solution. The site can change its styling and you are less likely to break your script, and cites with a less rigid style can be searched as well. And even better multiple searches could be built into a single command without it getting needlessly complicated!

But who shall free us from our private h*** of grep, sed, and awk? Not me, that's for certain! But I will add to your collection of utilities a new command....

I give to you XPath!

Code:

#!/usr/bin/python
# Filename:  XPath
#
# Copyright (c) 2024 Patrick Libby <patrick.libby@maine.edu>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.

from lxml import etree
from StringIO import StringIO
from optparse import OptionParser
from os import path
import sys
import re
import codecs

class XPathSelector():
   def __init__(self, data, asHTML=False):
      """XPathSelector(data) -> a new XPathSelector ready to be used.

      Raises errors as defined by lxml.etree.parse()
      """
      parser=None
      if asHTML:
         parser = etree.HTMLParser()
      else:
         parser = etree.XMLParser()
      d = StringIO(data)
      self.tree = etree.parse(d, parser)

   def GetValues(self, path):
      """GetValues(xpath) -> a list of elements that match the XPath.
      """
      return self.tree.xpath(path)

if __name__ == "__main__":
   def getCMDParser():
      """ getCMDParser() -> Build the OptionParser that shall be used to parse
      the command line.
      """
      parser = OptionParser()
      parser.add_option ("-x",  "--xpath",  dest="xpaths", default=[],
                         help="Select using this XPath",  action="append")
      parser.add_option ("-X",  "--XML",  dest="html", default=None,
                         help="Interpret file as XML",  action="store_false")
      parser.add_option ("-H",  "--HTML",  dest="html", default=None,
                         help="Interpret file as HTML",  action="store_true")
      parser.add_option ("-f",  "--file",  dest="filename",
                         help="Read file from FILE. (default: stdin)",
                                    metavar="FILE",  default=None )
      parser.add_option ("-v",  "--verbose",  dest= "verbose",
                         help="Output additional information about processing",
                     action="count",  default=0)
      parser.add_option ("-q",  "--quiet",  dest="silent",
                         help="Output nothing but results (overrides --verbose)",
                     action="store_true",  default=False)
      parser.add_option ("-s",  "--silent",  dest="silent",
                         help="Output nothing but results (overrides --verbose)",
                     action="store_true",  default=False)
      parser.add_option ("-n",  "--new-lines",  dest="newLines",
                         help="Suppress new lines in match results, replace with specified character(s)",
                     metavar="CHARACTER",  default=None)
      parser.add_option ("-r",  "--carriage-returns",  dest="carriageReturns",
                         help="Suppress carriage returns in match results, replace with specified character(s)",
                     metavar="CHARACTER",  default=None)
      return parser
   def getFileContents(name):
      """ getFileContentes(name) -> read file and return contents
      """
      if name:
         # Expand ~user and ${variable} constructs
         file = open(path.expandvars(path.expanduser(name)))
      else:
         file = sys.stdin
      return file.read()

   def guessType(name):
      """guessType(name) -> True if file appears to be HTML, else False

      Based off filename only. if this is resulting in false readings
      look into using the --HTML and --XML options.
      """
      if not name:
         return False
      isHTML = ['.xhtml',  '.htm', '.html', '.html',  '.cgi',  '.asp']
      ext = path.splitext(name)[1]
      if ext:
         ext = ext.lower()
         return ext in isHTML

   #BEGIN: Parse command line
   p = getCMDParser()
   (options,  args)= p.parse_args()

   asHTML = options.html if options.html else guessType(options.filename)
   file = options.filename
   data = getFileContents(options.filename)
   if not options.silent and options.verbose >=3:
      print "xpaths: \'{0}\'".format(options.xpaths)
      print "asHTML: \'{0}\'".format(asHTML)
      print "filename: \'{0}\'".format(file)
      print "verbosity: \'{0}\'".format(options.verbose)
      print "silent: \'{0}\'".format(options.silent)
      print "new lines: \'{0}\'".format(options.newLines)
      print "carriage returns: \'{0}\'".format(options.carriageReturns)
   if not options.silent and options.verbose >=4:
      print "data: \"\"\"{0}\"\"\"".format(data)
   #END: Parse command line

   # Create the Selector
   xps = XPathSelector(data, asHTML)

   #set the codec for the printer.
   sys.stdout = codecs.getwriter('utf8')(sys.stdout)
   # Perform selects sequentially
   for xpath in options.xpaths:
      if not options.silent and options.verbose >=1:
         #verbose feedback
         print "Searching: \'{0}\'".format(xpath)
      for match in xps.GetValues(xpath):
         #print results
         result = None

         #interpret results
         if type(match) == etree._Element:
            # Print the whole matched element. This can cause unexpected
            # output for nested results. To avoid, select only text elements
            # through the use of the text() selector or through attribute
            # selectors or craft your select so that matching elements are
            # not nested within other matching elements.
            result = etree.tostring(match)
         else:
            # did not match an element node. result is text or numeric.
            # Use raw.
            result = match

         if options.newLines:
            # suppress new lines, replace with newLines
            result = result.replace("\n", newLines)

         if options.carriageReturns:
            # suppress carriage returns, replace with carriageReturns
            result = result.replace("\r", carriageReturns)

         #Print the result
         print result


Let's see some examples!

Let's select the hyperlinks for all the forums on usalug.
Code:
[dragonfly@Ito ~]$ curl http://www.google.com 2>/dev/null|XPath -H -x "//input[@name='q']"
<input autocomplete="off" maxlength="2048" name="q" size="55" class="lst" title="Google Search" value=""/>
[dragonfly@Ito ~]$ curl http://usalug.org/phpBB2/ 2>/dev/null|XPath -H -x "//a[@class='forumlink']"
<a href="viewforum.html?f=29&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Application News and Releases</a>
<a href="viewforum.html?f=114&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Job Classifieds - Job Opportunities</a>
<a href="viewforum.html?f=2&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Installation and Bootloaders</a>
<a href="viewforum.html?f=16&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Networking</a>
<a href="viewforum.html?f=3&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Servers and Server Administration</a>
<a href="viewforum.html?f=8&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Hardware</a>
<a href="viewforum.html?f=7&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">System Administration and Security</a>
<a href="viewforum.html?f=82&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Linux Education and Certification</a>
<a href="viewforum.html?f=10&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Distributions</a>
<a href="viewforum.html?f=18&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">OTHER</a>
<a href="viewforum.html?f=120&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Software for Business</a>
<a href="viewforum.html?f=127&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">K12LTSP and Educational Applications</a>
<a href="viewforum.html?f=4&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Window Managers</a>
<a href="viewforum.html?f=62&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Graphics Applications</a>
<a href="viewforum.html?f=63&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Audio and Video Applications</a>
<a href="viewforum.html?f=64&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Web Browsers &amp; Email Clients</a>
<a href="viewforum.html?f=133&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">All other software.</a>
<a href="viewforum.html?f=126&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Command Line Commands</a>
<a href="viewforum.html?f=15&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Shell Scripting and Programming</a>
<a href="viewforum.html?f=134&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Member Blogs</a>
<a href="viewforum.html?f=98&amp;sid=a73e1c225cf04d5dc51a7a6320904aee" class="forumlink">Reviews and Interviews</a>


Hmm... yes, but we can do better. We want just the link part....
Code:
[dragonfly@Ito ~]$ curl http://usalug.org/phpBB2/ 2>/dev/null|XPath -H -x "//a[@class='forumlink']/@href"
viewforum.html?f=29&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=114&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=2&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=16&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=3&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=8&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=7&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=82&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=10&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=18&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=120&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=127&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=4&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=62&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=63&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=64&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=133&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=126&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=15&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=134&sid=b5d3ae3abc42c218d588ea049441bce9
viewforum.html?f=98&sid=b5d3ae3abc42c218d588ea049441bce9

Ah. That's better.

See how easy it is? And because this utility is designed to work with scripts we can ensure that new lines and carriage returns do not appear inside a single result. the helpful -n and -r options can suppress new lines and carriage returns and replace them with custom character sequences! FWEEET!

Now... Isn't that helpful?

[edit]
Noticed some issues when piping the output to other commands. Output will now always be in UTF8, input will be read in default encoding and coerced to UTF8
[/edit]



_________________
ArchLinux x86_64 - Custom Built Desktop
ArchLinux x86_64 - Compaq CQ50 Laptop
ArchLinux i686 - Acer Aspire One Netbook
ArchLinux i686 - Dell Presario ze2000 (w/ shattered LCD)

PuppyLinux, CloneZilla, PartedMagic, DBAN - rescue thumbdrives
Windows 7 (x86_64 desktop alternate boot)


Last edited by Lord.DragonFly.of.Dawn on Tue Mar 16, 2024 10:14 pm; edited 1 time in total
Back to top
View user's profile Send private message Visit poster's website
VHockey86
Advanced Member


Joined: 12 Dec 2024
Posts: 988
Location: Rochester

PostPosted: Tue Mar 16, 2024 7:34 pm    Post subject: Reply with quote

Give me a scripting language and a DOM-based html parser over gnu utilities anyday Smile
Have you used BeautifulSoup before?

You can specify 'default' in OptionParser.add_option which would avoid the need to have all those ternary operations in the main block



_________________
Main Desktops : Kubuntu 10.4. ArchLinux 64-bit. Windows7 64-bit. Windows XP 32-bit.

MacBook: OS X Snow Leopard (10.6)
Back to top
View user's profile Send private message
Lord.DragonFly.of.Dawn
Advanced Member


Joined: 18 Jul 2024
Posts: 607
Location: South Portland, Maine, USA, Earth, Sol System

PostPosted: Tue Mar 16, 2024 7:48 pm    Post subject: Reply with quote

Not yet.

ran across it doing research for this utility.

I'll have to try it at some point.

I really wrote this because three sites that I routinely scrape updated their style and broke my scripts and i wanted better than positional cuts and greps and seds to figure out which links I was interested in. The example I listed before the script was the absolute simplest select I used in the scripts. The others were much more complicated... In one example the pipe contained thirty commands, quite impossible to figure out what the end result should be.

XPath seemed to be the best answer.

edit:
Hmm. I should update that with the defaults. That won't fix the asHTML one as that needs to be ternary. (true, false, unspecified)



_________________
ArchLinux x86_64 - Custom Built Desktop
ArchLinux x86_64 - Compaq CQ50 Laptop
ArchLinux i686 - Acer Aspire One Netbook
ArchLinux i686 - Dell Presario ze2000 (w/ shattered LCD)

PuppyLinux, CloneZilla, PartedMagic, DBAN - rescue thumbdrives
Windows 7 (x86_64 desktop alternate boot)
Back to top
View user's profile Send private message Visit poster's website
Lord.DragonFly.of.Dawn
Advanced Member


Joined: 18 Jul 2024
Posts: 607
Location: South Portland, Maine, USA, Earth, Sol System

PostPosted: Tue Mar 16, 2024 10:16 pm    Post subject: Reply with quote

Noticed some issues when piping the output to other commands. Output will now always be in UTF8, input will be read in default encoding and coerced to UTF8

also updated the parser with defaults



_________________
ArchLinux x86_64 - Custom Built Desktop
ArchLinux x86_64 - Compaq CQ50 Laptop
ArchLinux i686 - Acer Aspire One Netbook
ArchLinux i686 - Dell Presario ze2000 (w/ shattered LCD)

PuppyLinux, CloneZilla, PartedMagic, DBAN - rescue thumbdrives
Windows 7 (x86_64 desktop alternate boot)
Back to top
View user's profile Send private message Visit poster's website
crouse
Site Admin


Joined: 17 Apr 2024
Posts: 11833
Location: Iowa

PostPosted: Wed Mar 17, 2024 4:30 pm    Post subject: Reply with quote

what, you don't like lynx/wget/curl/awk/sed/grep/egrep ????? lol.

Sadly, I can write the stuff in those much easier...

Code:

> lynx --dump http://usalug.org/phpBB2/ | grep viewforum | awk '{print $2}'
http://usalug.org/phpBB2/viewforum.html?f=29&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=114&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=2&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=16&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=3&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=8&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=7&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=82&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=10&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=18&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=120&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=127&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=4&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=62&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=63&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=64&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=133&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=126&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=15&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=134&sid=12c22e85840e72c470a7bafc895b3493
http://usalug.org/phpBB2/viewforum.html?f=98&sid=12c22e85840e72c470a7bafc895b343



But even better..removing the session id's.........

Code:

> lynx --dump http://usalug.org/phpBB2/ | grep viewforum | awk '{print $2}' | sed 's/&.*//'   
http://usalug.org/phpBB2/viewforum.html?f=29
http://usalug.org/phpBB2/viewforum.html?f=114
http://usalug.org/phpBB2/viewforum.html?f=2
http://usalug.org/phpBB2/viewforum.html?f=16
http://usalug.org/phpBB2/viewforum.html?f=3
http://usalug.org/phpBB2/viewforum.html?f=8
http://usalug.org/phpBB2/viewforum.html?f=7
http://usalug.org/phpBB2/viewforum.html?f=82
http://usalug.org/phpBB2/viewforum.html?f=10
http://usalug.org/phpBB2/viewforum.html?f=18
http://usalug.org/phpBB2/viewforum.html?f=120
http://usalug.org/phpBB2/viewforum.html?f=127
http://usalug.org/phpBB2/viewforum.html?f=4
http://usalug.org/phpBB2/viewforum.html?f=62
http://usalug.org/phpBB2/viewforum.html?f=63
http://usalug.org/phpBB2/viewforum.html?f=64
http://usalug.org/phpBB2/viewforum.html?f=133
http://usalug.org/phpBB2/viewforum.html?f=126
http://usalug.org/phpBB2/viewforum.html?f=15
http://usalug.org/phpBB2/viewforum.html?f=134
http://usalug.org/phpBB2/viewforum.html?f=98


Gotta love the shell Wink



_________________
Veronica - Arch Linux 64-bit -- Kernel 2.6.33.4-1
Archie/Jughead - Arch Linux 32-bit -- Kernel 2.6.33.4-1
Betty/Reggie - Arch Linux (VBox) 32-bit -- Kernel 2.6.33.4-1
BumbleBee - OpenSolaris-SunOS 5.11
Back to top
View user's profile Send private message Visit poster's website AIM Address
Lord.DragonFly.of.Dawn
Advanced Member


Joined: 18 Jul 2024
Posts: 607
Location: South Portland, Maine, USA, Earth, Sol System

PostPosted: Wed Mar 17, 2024 5:18 pm    Post subject: Reply with quote

well yes that is a simple one..... trivial almost.

but what about this one?

Code:
[ -e "${THIS}" ] && grep '<a href="/post/show/' "${THIS}"  -A1|tr -d '\n' |sed -e "s/>--/\n/g"|cut -d'"' -f 2|cut -d'/' -f 4|sed -e "s/ /%20/g"


when it is so much easier to replace it with this:
Code:
[ -e "${THIS}" ] && XPath -f "${THIS}" -x "//div[@class='post']/span[@class='thumb']/a/@href"|sed -e "s/ /%20/g"


It is still a fairly simple select, but the pure bash method takes 6 commands in the chain and required that the website never add attributes to the tags in question, never rearranges the attributes and keeps the same formatting including a double dash directly after a tag close without a space between. And that is not to mention that it is difficult to look at and see what the chain is trying to accomplish. The XPath solution only requires that the basic structure of a div a span and a hyperlink remains constant.

Far more readable and less prone to error as website styles change



_________________
ArchLinux x86_64 - Custom Built Desktop
ArchLinux x86_64 - Compaq CQ50 Laptop
ArchLinux i686 - Acer Aspire One Netbook
ArchLinux i686 - Dell Presario ze2000 (w/ shattered LCD)

PuppyLinux, CloneZilla, PartedMagic, DBAN - rescue thumbdrives
Windows 7 (x86_64 desktop alternate boot)
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic   printer-friendly view    USA Linux Users Group Forum Index » Shell Scripting and Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
All content © 2024-2009 - Usa Linux Users Group
This forum is powered by phpBB. © 2024-2009 phpBB Group
Theme created by phpBBStyles.com and modified by Crouse