Command Line XPath selection

Lord.DragonFly.of.Dawn · Posted: Tue Mar 16, 2025 7:14 pm Post subject: Command Line XPath selection

Web Scrapers

We've all written them. We all hate them.

What do we hate about them?

well often we end up selecting a link to download like this:

VHockey86 · Posted: Tue Mar 16, 2025 7:34 pm Post subject:

Give me a scripting language and a DOM-based html parser over gnu utilities anyday Smile

Have you used BeautifulSoup before?

You can specify 'default' in OptionParser.add_option which would avoid the need to have all those ternary operations in the main block

_________________
Main Desktops : Kubuntu 10.4. ArchLinux 64-bit. Windows7 64-bit. Windows XP 32-bit.

MacBook: OS X Snow Leopard (10.6)

Lord.DragonFly.of.Dawn · Posted: Tue Mar 16, 2025 7:48 pm Post subject:

Not yet.

ran across it doing research for this utility.

I'll have to try it at some point.

I really wrote this because three sites that I routinely scrape updated their style and broke my scripts and i wanted better than positional cuts and greps and seds to figure out which links I was interested in. The example I listed before the script was the absolute simplest select I used in the scripts. The others were much more complicated... In one example the pipe contained thirty commands, quite impossible to figure out what the end result should be.

XPath seemed to be the best answer.

edit:
Hmm. I should update that with the defaults. That won't fix the asHTML one as that needs to be ternary. (true, false, unspecified)

_________________
ArchLinux x86_64 - Custom Built Desktop
ArchLinux x86_64 - Compaq CQ50 Laptop
ArchLinux i686 - Acer Aspire One Netbook
ArchLinux i686 - Dell Presario ze2000 (w/ shattered LCD)

PuppyLinux, CloneZilla, PartedMagic, DBAN - rescue thumbdrives
Windows 7 (x86_64 desktop alternate boot)

Lord.DragonFly.of.Dawn · Posted: Tue Mar 16, 2025 10:16 pm Post subject:

Noticed some issues when piping the output to other commands. Output will now always be in UTF8, input will be read in default encoding and coerced to UTF8

also updated the parser with defaults

_________________
ArchLinux x86_64 - Custom Built Desktop
ArchLinux x86_64 - Compaq CQ50 Laptop
ArchLinux i686 - Acer Aspire One Netbook
ArchLinux i686 - Dell Presario ze2000 (w/ shattered LCD)

PuppyLinux, CloneZilla, PartedMagic, DBAN - rescue thumbdrives
Windows 7 (x86_64 desktop alternate boot)

crouse · Site Admin Joined: 17 Apr 2025 Posts: 11833 Location: Iowa

what, you don't like lynx/wget/curl/awk/sed/grep/egrep ????? lol.

Sadly, I can write the stuff in those much easier...

Lord.DragonFly.of.Dawn · Posted: Wed Mar 17, 2025 5:18 pm Post subject:

well yes that is a simple one..... trivial almost.

but what about this one?

	USA Linux Users Group Forum Index » Shell Scripting and Programming	All times are GMT
Page 1 of 1