Most of the scheduled jobs I am running are set on django-chronograph. It makes jobs easier to manage, and allows my clients to manage the scheduling. It also allows them to check the logs in a user friendly environment.
So after creating my scraper using Scrapy that runs in django management command (as described here), I tried to deploy it using chronograph.
The first problem is, method run_from_argv
is bypassed by chronograph. So I modified my management command into like this:
class Command(BaseCommand): def run_from_argv(self, argv): self._argv = argv self.execute() def handle(self, *args, **options): from scrapy.cmdline import execute try: execute(self._argv[1:-1]) except AttributeError: # when running from django-chronograph execute(list(args))
Then in the arguments, I added the django management command. In my case, my management command is "scrape".
The second problem is, the execute
command from Scrapy runs sys.exit
which stops the execution. This means django-chronograph will also stop execution and in effect cannot do the necessary tasks like saving the logs into the database, changing the status of the job from "running" to "not running".
The first workaround I tried was to create a separate thread to run Scrapy's "execute" command. However, Scrapy threw this error: signal only works in main thread
.
After some reading, Python doc says sys.exit
simply throws a SystemExit
exception. This would allow us to do some cleanup using the finally
block like this:
class Command(BaseCommand): def run_from_argv(self, argv): self._argv = argv self.execute() def handle(self, *args, **options): from scrapy.cmdline import execute try: execute(self._argv[1:-1]) except AttributeError: # when running from django-chronograph execute(list(args)) finally: return # let django-chronograph do some cleanup
And if you're running the cron of django-chronograph as root, you have to create a symlink of scrapy.cfg
from the root directory to your project folder where the file is. This will enable Scrapy to locate your crawler settings.