Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically extract the download URL from the feed #47

Closed
wants to merge 3 commits into from

Conversation

mbridak
Copy link

@mbridak mbridak commented Jun 4, 2023

Description

Use lxml xpath to pull download link off of web page to clear issue #10

Fixes #10

Type of change

  • Bug fix

How has this been tested?

By running it.

Checklist

  • Issue exists for PR
  • Code reviewed by the author
  • Code documented (comments or other documentation)
  • Changes tested
  • All tests pass (see DEVELOPING.md, if it exists)
  • CHANGELOG.md updated if needed
  • Informative commit messages
  • Descriptive PR title

Extract the download link.
dl_url = f'http://www.country-files.com/bigcty/download/{update_date[:4]}/bigcty-{update_date}.zip' # TODO: Issue #10
page = session.get(update_url)
tree = html.fromstring(page.content)
dl_url = tree.xpath("//a[contains(@href,'zip')]/@href")[0]
rq = session.get(dl_url)
if rq.status_code == 404:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this codepath still necessary? I think it exists because there was a change in the feed bafa05d

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that fallback is needed, but very differently, and only if the download URL extraction fails (see other comment)

dl_url = f'http://www.country-files.com/bigcty/download/{update_date[:4]}/bigcty-{update_date}.zip' # TODO: Issue #10
page = session.get(update_url)
tree = html.fromstring(page.content)
dl_url = tree.xpath("//a[contains(@href,'zip')]/@href")[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if tree.xpath() doesn't find anything? would [0] IndexError? we should handle that, perhaps

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@@ -172,7 +173,9 @@ def update(self) -> bool:

with tempfile.TemporaryDirectory() as temp:
path = pathlib.PurePath(temp)
dl_url = f'http://www.country-files.com/bigcty/download/{update_date[:4]}/bigcty-{update_date}.zip' # TODO: Issue #10
page = session.get(update_url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably check for status before trying to parse the content

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@0x5c 0x5c changed the title Fix for issue #10 Dynamically extract the download URL from the feed Jun 4, 2023
Copy link
Member

@0x5c 0x5c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGELOG.md needs to be updated

Your commits will also need to be squashed together, and the resulting commit should describe what it does

@@ -172,7 +173,9 @@ def update(self) -> bool:

with tempfile.TemporaryDirectory() as temp:
path = pathlib.PurePath(temp)
dl_url = f'http://www.country-files.com/bigcty/download/{update_date[:4]}/bigcty-{update_date}.zip' # TODO: Issue #10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be good to keep that URL as a fallback, like the other URL format already is

dl_url = f'http://www.country-files.com/bigcty/download/{update_date[:4]}/bigcty-{update_date}.zip' # TODO: Issue #10
page = session.get(update_url)
tree = html.fromstring(page.content)
dl_url = tree.xpath("//a[contains(@href,'zip')]/@href")[0]
rq = session.get(dl_url)
if rq.status_code == 404:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that fallback is needed, but very differently, and only if the download URL extraction fails (see other comment)

@@ -7,3 +7,4 @@ sphinx
# Dependencies
feedparser
requests
lxml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is lxml not needed as a normal dependency? This file is only used for development, so lxml should also be added to setup.py (setup.py being what defines dependencies to install when someone install ctyparser)

@mbridak
Copy link
Author

mbridak commented Jun 4, 2023

Sorry, I think I'm creating more of a problem for you than what I'm solving.

@mbridak mbridak closed this Jun 4, 2023
@mbridak mbridak deleted the patch-1 branch June 4, 2023 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extracting the download link from the feed
3 participants