This repository has been archived on 2022-12-28. You can view files and clone it, but cannot push or open issues or pull requests.
site-posts/rss-cli.html
2022-12-26 19:09:09 -05:00

212 lines
9.1 KiB
HTML

<div class="post-2-1">
<div>
<p>So if you actually do read this site often, you may have noticed that there
is now an RSS feed. Its on the main posts page, up at the top right. RSS is a
very interesting technology. It was designed with the intent, it seems, to
connect the whole internet in one nice syndication protocol that was easy to
understand and use. And it really delivered on that! It just seems to have not
caught on as much as once thought. However, I still use <a href="https://newsboat.org">
newsboat</a> for most of my youtube and other feeds, and it works great.
Despite being 'dead', most things support it (or you can find a
<a href="https://github.com/RSS-Bridge/rss-bridge">tool</a> to make it
work).</p>
<p>With that said, it seems like it would be great to automate stuff with it,
unix style. Pipes, bash scripting, the whole deal. However, I didn't really
find anything that fit my needs. I just wanted a light, simple to use program
that could extract things from an rss feed and spit it out, to be further
processed by something like awk or something. Alas, with my searching I found
nothing. So pulled up a tmux session, put on some music at full volume, and one
weekend later we now have rss-cli!</p>
</div>
<div class="col-md">
<figure>
<a href="https://validator.w3.org/feed/check.cgi?url=https%3A//www.clortox.com/feed/"><img src="../img/valid-rss-rogers.png"
class="fig-img"></img></a>
<figcaption>This site hosts a <a href="https://www.clortox.com/feed/">
valid rss</a> feed.</figcaption>
</figure>
</div>
</div>
<div class="post-1-2">
<div class="col-md">
<figure>
<a href="https://github.com/Clortox/rss-cli">
<img src="../img/github.png"
class="fig-img"></img></a>
<figcaption>You can view the code I'm describing yourself on github
(above) and <a href="https://git.clortox.com/?p=rss-cli.git;a=summary">
this site</a>.</figcaption>
</figure>
</div>
<div>
<h4>How does it work?</h4>
<p>rss-cli uses the very fast c++ library <a href="http://rapidxml.sourceforge.net">
rapidxml</a> to parse the RSS feed. Performance tends to be around 3ms total
execution time for very large RSS feeds (~30 items) on my i7-9750H. I was
getting about 10ms on a raspberry pi 4 for the same feed.</p>
<p>rss-cli will parse the rss file, which is identified by a URI. A URI is
used, because the program uses libcurl to fetch rss feeds off the internet,
however file:///some/rss/feed.rss is also valid for local files. Once the file
is grabbed, it is parsed by rapidxml, then kept in memory. When a specific
attribute is needed, it is fetched as needed. This lazy-loading approach keeps
execution times low, as often you will not need the entire feed, you will
probably only be extracting key bits of information for your next program to
parse.</p>
<p>All of the meat of rss-cli is in the rss_utils namespace. I placed this
here, along with an rss_utils::rss object for interacting with the rss feed, so
that moving rss.cpp and rss.hpp to your own project can be as easy as possible.
rss_utils also contains a rss_utils::item, which represenets the <item> tags.
These item objects are stored in a std::vector, so that your program can easily
iterate through them.</p>
<p>Both rss_utils::rss and rss_utils::item contain clone functions, the big 3,
and accessor functions for all of the possible associated elements. For
example, if you want to access an rss feed's title, you would call:</p>
<blockquote>std::string rss_utils::rss::getTitle() const</blockquote>
<p>All responses are given as std::string, to allow for the widest
compatability possible. Each time one of these functions are called, it will
search the document for attribute, and return an empty string
(std::string("")) if nothing is found. Neither of the classes ever throw
exceptions. rss_utils::rss also provides a isOk() function for checking if the
rss feed was valid. If isOk() returns false, all accessor functions will return
empty strings. When attempting to get items while isOk() is false, an empty
std::vector&lt;rss_utils::items&gt; will be returned</p>
</div>
</div>
<div class="row no-gutters">
<div class="col-sm">
<h4>How do you use rss-cli?</h4>
<p>rss-cli provides the --help flag to display all of the optiosn it will
accept. There are alot of options, but this is because each option corresponds
to a field in the RSS 2.0 Spec. Here is a full version of the help menu (as of
7-26-21):</p>
<blockquote class="code-block">Usage: rss-cli [-u FEED_URI] [CHANNEL FLAGS] [-i ITEM_INDEX] [ITEM FLAGS]
Options:
Required Options:
[-u, --uri] URI URI of the rss stream
Channel information:
[-t, --title] Get title of channel
[-l, --link] Get link to channel
[-d, --description] Get description of channel
[-L, --language] Get language code of channel
[-m, --webmaster] Get webMaster's email
[-c, --copyright] Get copyright
[-p, --pubdate] Get publishing date
[-e, --managingeditor] Get managing editor
[-g, --generator] Get generator of this feed
[-o, --docs] Get link to RSS documentation
[-w, --ttl] Get ttl, time that channel can be
cached before being updated
[-b, --builddate] Get last time the channel's
content changed
[-Q, --imageurl] Get channel image URL
[-I, --imagetitle] Get image title, same as ALT in html
[-E, --imagelink] Get link to site, image will act as a link
[-W, --imagewidth] Get width of image
[-H, --imageheight] Get height of image
[-D, --clouddomain] Get domain of feed update service
[-P, --cloudport] Get port of feed update service
[-A, --cloudpath] Get path to access for feed update service
[-R, --cloudregister] Get register procedure for feed update service
[-O, --cloudprotocol] Get protocol feed update service uses
[-i, --item] INDEX Provide index of item to display
If no index is provided, assume the first
item in the feed. All following flags will
be parsed as item options, till another
item is provided
Item options:
[-t, --title] Get title of item
[-l, --link] Get link
[-d, --description] Get description
[-a, --author] Get author
[-C, --category] Get category list
[-f, --comments] Get link to comments
[-G, --guid] Get GUID
[-p, --pubdate] Get publishing date
[-s, --source] Get source of item
[-U, --enclosureurl] Get enclosure URL
[-T, --enclosuretype] Get enclosure MIME type
[-K, --enclosurelength]Get enclosure length, in bytes
General options:
[-h, --help] Show this message
For more information, refer to the RSS 2.0 documentation
https://validator.w3.org/feed/docs/rss2.html
</blockquote>
<p>Breaking this down, we first need the -u flag to say where to get the RSS
feed. Once we have that, we can pass flags to grab everything we need. The
Channel information flags have to be passed <i>before</i> the item options.
Once the -i flag has been passed, all following options must be item options,
and will be applied to that item. If -h is passed anywhere, the program will
display the help message and quit.</p>
<p>The slowest part of the program will be fetching the file using libcurl,
therefore if you plan to do several operations on the same feed, I recommend
downloading the file first and using file:// to tell rss-cli where the file
is.</p>
<p>All options are also displayed in the order they are listed in --help. This
means that even if you run rss-cli with:</p>
<blockquote>rss-cli -u file:///my/local/rss.rss --description --link --title</blockquote>
<p>The output will still be:</p>
<blockquote class="code-block">RSS Feed Title
Feed link
Feed description
</blockquote>
<p>This makes output predictable and easy for other programs to understand. If
a empty line is encountered, then it can be assumed the requested tag is not in
the feed. This same concept applies to each item.</p>
</div>
</div>
<div class="row no-gutters">
<div class="col-sm-10">
<h4>Possible use cases</h4>
<h5>Get quick headlines in bashrc</h5>
<p>Grab headlines from BBC and show the top three in your bash rc</p>
<blockquote>
echo $(rss-cli -u http://feeds.bbci.co.uk/news/world/us_and_canada/rss.xml \
-i0 -td -i1 -td -i2 -td)
</blockquote>
<h5>Get weather and place into a file</h5>
<p>Grab todays weather and put it in a file, for logging</p>
<blockquote>
rss-cli -u http://www.rssweather.com/zipcode/10001/rss.php -i -d &gt;&gt;
weather/$(date).txt
</blockquote>
<h5>Get new posts from archive.org and automatically download them</h5>
<p>This example uses opensource_audio from archive.org, this could be put on a
cronjob</p>
<blockquote>
wget $(rss-cli -u https://archive.org/services/collection-rss.php?collection=opensource_audio -i0 --enclosureurl) -P ~/archive_audio
</blockquote>
</div>
</div>