212 lines
9.1 KiB
HTML
212 lines
9.1 KiB
HTML
<div class="post-2-1">
|
|
<div>
|
|
<p>So if you actually do read this site often, you may have noticed that there
|
|
is now an RSS feed. Its on the main posts page, up at the top right. RSS is a
|
|
very interesting technology. It was designed with the intent, it seems, to
|
|
connect the whole internet in one nice syndication protocol that was easy to
|
|
understand and use. And it really delivered on that! It just seems to have not
|
|
caught on as much as once thought. However, I still use <a href="https://newsboat.org">
|
|
newsboat</a> for most of my youtube and other feeds, and it works great.
|
|
Despite being 'dead', most things support it (or you can find a
|
|
<a href="https://github.com/RSS-Bridge/rss-bridge">tool</a> to make it
|
|
work).</p>
|
|
|
|
<p>With that said, it seems like it would be great to automate stuff with it,
|
|
unix style. Pipes, bash scripting, the whole deal. However, I didn't really
|
|
find anything that fit my needs. I just wanted a light, simple to use program
|
|
that could extract things from an rss feed and spit it out, to be further
|
|
processed by something like awk or something. Alas, with my searching I found
|
|
nothing. So pulled up a tmux session, put on some music at full volume, and one
|
|
weekend later we now have rss-cli!</p>
|
|
</div>
|
|
<div class="col-md">
|
|
<figure>
|
|
<a href="https://validator.w3.org/feed/check.cgi?url=https%3A//www.clortox.com/feed/"><img src="../img/valid-rss-rogers.png"
|
|
class="fig-img"></img></a>
|
|
<figcaption>This site hosts a <a href="https://www.clortox.com/feed/">
|
|
valid rss</a> feed.</figcaption>
|
|
</figure>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="post-1-2">
|
|
<div class="col-md">
|
|
<figure>
|
|
<a href="https://github.com/Clortox/rss-cli">
|
|
<img src="../img/github.png"
|
|
class="fig-img"></img></a>
|
|
<figcaption>You can view the code I'm describing yourself on github
|
|
(above) and <a href="https://git.clortox.com/?p=rss-cli.git;a=summary">
|
|
this site</a>.</figcaption>
|
|
</figure>
|
|
</div>
|
|
<div>
|
|
<h4>How does it work?</h4>
|
|
|
|
<p>rss-cli uses the very fast c++ library <a href="http://rapidxml.sourceforge.net">
|
|
rapidxml</a> to parse the RSS feed. Performance tends to be around 3ms total
|
|
execution time for very large RSS feeds (~30 items) on my i7-9750H. I was
|
|
getting about 10ms on a raspberry pi 4 for the same feed.</p>
|
|
|
|
<p>rss-cli will parse the rss file, which is identified by a URI. A URI is
|
|
used, because the program uses libcurl to fetch rss feeds off the internet,
|
|
however file:///some/rss/feed.rss is also valid for local files. Once the file
|
|
is grabbed, it is parsed by rapidxml, then kept in memory. When a specific
|
|
attribute is needed, it is fetched as needed. This lazy-loading approach keeps
|
|
execution times low, as often you will not need the entire feed, you will
|
|
probably only be extracting key bits of information for your next program to
|
|
parse.</p>
|
|
|
|
<p>All of the meat of rss-cli is in the rss_utils namespace. I placed this
|
|
here, along with an rss_utils::rss object for interacting with the rss feed, so
|
|
that moving rss.cpp and rss.hpp to your own project can be as easy as possible.
|
|
rss_utils also contains a rss_utils::item, which represenets the <item> tags.
|
|
These item objects are stored in a std::vector, so that your program can easily
|
|
iterate through them.</p>
|
|
|
|
<p>Both rss_utils::rss and rss_utils::item contain clone functions, the big 3,
|
|
and accessor functions for all of the possible associated elements. For
|
|
example, if you want to access an rss feed's title, you would call:</p>
|
|
|
|
<blockquote>std::string rss_utils::rss::getTitle() const</blockquote>
|
|
|
|
<p>All responses are given as std::string, to allow for the widest
|
|
compatability possible. Each time one of these functions are called, it will
|
|
search the document for attribute, and return an empty string
|
|
(std::string("")) if nothing is found. Neither of the classes ever throw
|
|
exceptions. rss_utils::rss also provides a isOk() function for checking if the
|
|
rss feed was valid. If isOk() returns false, all accessor functions will return
|
|
empty strings. When attempting to get items while isOk() is false, an empty
|
|
std::vector<rss_utils::items> will be returned</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="row no-gutters">
|
|
<div class="col-sm">
|
|
<h4>How do you use rss-cli?</h4>
|
|
|
|
<p>rss-cli provides the --help flag to display all of the optiosn it will
|
|
accept. There are alot of options, but this is because each option corresponds
|
|
to a field in the RSS 2.0 Spec. Here is a full version of the help menu (as of
|
|
7-26-21):</p>
|
|
|
|
<blockquote class="code-block">Usage: rss-cli [-u FEED_URI] [CHANNEL FLAGS] [-i ITEM_INDEX] [ITEM FLAGS]
|
|
Options:
|
|
Required Options:
|
|
[-u, --uri] URI URI of the rss stream
|
|
|
|
Channel information:
|
|
[-t, --title] Get title of channel
|
|
[-l, --link] Get link to channel
|
|
[-d, --description] Get description of channel
|
|
[-L, --language] Get language code of channel
|
|
[-m, --webmaster] Get webMaster's email
|
|
[-c, --copyright] Get copyright
|
|
[-p, --pubdate] Get publishing date
|
|
[-e, --managingeditor] Get managing editor
|
|
[-g, --generator] Get generator of this feed
|
|
[-o, --docs] Get link to RSS documentation
|
|
[-w, --ttl] Get ttl, time that channel can be
|
|
cached before being updated
|
|
[-b, --builddate] Get last time the channel's
|
|
content changed
|
|
[-Q, --imageurl] Get channel image URL
|
|
[-I, --imagetitle] Get image title, same as ALT in html
|
|
[-E, --imagelink] Get link to site, image will act as a link
|
|
[-W, --imagewidth] Get width of image
|
|
[-H, --imageheight] Get height of image
|
|
[-D, --clouddomain] Get domain of feed update service
|
|
[-P, --cloudport] Get port of feed update service
|
|
[-A, --cloudpath] Get path to access for feed update service
|
|
[-R, --cloudregister] Get register procedure for feed update service
|
|
[-O, --cloudprotocol] Get protocol feed update service uses
|
|
[-i, --item] INDEX Provide index of item to display
|
|
If no index is provided, assume the first
|
|
item in the feed. All following flags will
|
|
be parsed as item options, till another
|
|
item is provided
|
|
|
|
Item options:
|
|
[-t, --title] Get title of item
|
|
[-l, --link] Get link
|
|
[-d, --description] Get description
|
|
[-a, --author] Get author
|
|
[-C, --category] Get category list
|
|
[-f, --comments] Get link to comments
|
|
[-G, --guid] Get GUID
|
|
[-p, --pubdate] Get publishing date
|
|
[-s, --source] Get source of item
|
|
[-U, --enclosureurl] Get enclosure URL
|
|
[-T, --enclosuretype] Get enclosure MIME type
|
|
[-K, --enclosurelength]Get enclosure length, in bytes
|
|
|
|
General options:
|
|
[-h, --help] Show this message
|
|
|
|
For more information, refer to the RSS 2.0 documentation
|
|
https://validator.w3.org/feed/docs/rss2.html
|
|
</blockquote>
|
|
|
|
<p>Breaking this down, we first need the -u flag to say where to get the RSS
|
|
feed. Once we have that, we can pass flags to grab everything we need. The
|
|
Channel information flags have to be passed <i>before</i> the item options.
|
|
Once the -i flag has been passed, all following options must be item options,
|
|
and will be applied to that item. If -h is passed anywhere, the program will
|
|
display the help message and quit.</p>
|
|
|
|
<p>The slowest part of the program will be fetching the file using libcurl,
|
|
therefore if you plan to do several operations on the same feed, I recommend
|
|
downloading the file first and using file:// to tell rss-cli where the file
|
|
is.</p>
|
|
|
|
<p>All options are also displayed in the order they are listed in --help. This
|
|
means that even if you run rss-cli with:</p>
|
|
|
|
<blockquote>rss-cli -u file:///my/local/rss.rss --description --link --title</blockquote>
|
|
|
|
<p>The output will still be:</p>
|
|
|
|
<blockquote class="code-block">RSS Feed Title
|
|
Feed link
|
|
Feed description
|
|
</blockquote>
|
|
|
|
<p>This makes output predictable and easy for other programs to understand. If
|
|
a empty line is encountered, then it can be assumed the requested tag is not in
|
|
the feed. This same concept applies to each item.</p>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="row no-gutters">
|
|
<div class="col-sm-10">
|
|
<h4>Possible use cases</h4>
|
|
|
|
<h5>Get quick headlines in bashrc</h5>
|
|
|
|
<p>Grab headlines from BBC and show the top three in your bash rc</p>
|
|
|
|
<blockquote>
|
|
echo $(rss-cli -u http://feeds.bbci.co.uk/news/world/us_and_canada/rss.xml \
|
|
-i0 -td -i1 -td -i2 -td)
|
|
</blockquote>
|
|
|
|
<h5>Get weather and place into a file</h5>
|
|
|
|
<p>Grab todays weather and put it in a file, for logging</p>
|
|
|
|
<blockquote>
|
|
rss-cli -u http://www.rssweather.com/zipcode/10001/rss.php -i -d >>
|
|
weather/$(date).txt
|
|
</blockquote>
|
|
|
|
<h5>Get new posts from archive.org and automatically download them</h5>
|
|
|
|
<p>This example uses opensource_audio from archive.org, this could be put on a
|
|
cronjob</p>
|
|
|
|
<blockquote>
|
|
wget $(rss-cli -u https://archive.org/services/collection-rss.php?collection=opensource_audio -i0 --enclosureurl) -P ~/archive_audio
|
|
</blockquote>
|
|
</div>
|
|
</div>
|