Skip to content
This repository has been archived by the owner on May 5, 2024. It is now read-only.

blacklist being ignore #2

Open
pkissman opened this issue Oct 26, 2017 · 2 comments
Open

blacklist being ignore #2

pkissman opened this issue Oct 26, 2017 · 2 comments

Comments

@pkissman
Copy link

This is probably a problem with my setup rather than your plugin.

I have nutch-2.3.1 and have installed your plugin to get rid of a bunch of navigation elements, breadcrumbs, footers components from my standard web pages.

I've set my nutch-site.xml property as follows (also using tika for pdf and word documents)

plugin.includes protocol-httpclient|urlfilter-regex|element-selector|index-(basic|more)|query-(basic|site|url)|indexer-solr|nutch-extensionpoints|parse-(text|html|tika|js)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)

And here are the elements I am trying to grab with the blacklist, also in nutch-site.xml

parser.html.selector.blacklist header,footer,div.breadcrumbs,div.searchForm,table.jobTable A comma-delimited.... I've tried with parse-html included and excluded. I read that if you enable tika don't use parse-html. But I thought that for your plugin to work, parse-html must be enabled.

Any guidance would be helpful.

Thanks very much in advance.

@kaqqao
Copy link
Owner

kaqqao commented Dec 20, 2017

Sorry for not responding for so long...
I don't actually maintain nor use this plugin any more, and I unfortunately no longer remember how it works :(
I hope you've managed to find your answer in the mean time. Either case, please post back your current status, and the answer if you've got it.

Cheers!

@pkissman
Copy link
Author

pkissman commented Dec 20, 2017 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants