8000 Algolia integration by gabfr · Pull Request #7 · gabfr/work-around-the-world · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Algolia integration #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,7 @@ this is done with a few clicks on the AWS dashboard.
- [X] Fix the angel.co parser, moreover, the salary_* is not being parsed as it should. It is coming with the currency symbol.
We have to get rid of the currency symbol and cast it to double
- [X] Somehow identify the currency that is being used to describe the salary of the job (maybe with the currency symbol itself?)
- [X] Create a DAG to load the crawled `jobs` to the Algolia Search Provider, a free API to query those jobs
- [ ] Test and make the Algolia DAG work
- [ ] Create a simple web application to navigate/search in the data of these crawled jobs
- [ ] Normalize the job location information
Expand Down
23 changes: 11 additions & 12 deletions dags/algoliasearch_index_jobs_dag.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,20 @@
from airflow.hooks.postgres_hook import PostgresHook
from algoliasearch.search_client import SearchClient
from airflow.sensors.external_task_sensor import ExternalTaskSensor
import psycopg2

def index_jobs(**context):
global algolia_conn_id

pgsql = PostgresHook(postgres_conn_id="pgsql")
cur = pgsql.get_cursor()
pgsqlHook = PostgresHook(postgres_conn_id="pgsql")
pgsql = pgsqlHook.get_conn()
cur = pgsql.cursor(cursor_factory=psycopg2.extras.RealDictCursor)

algolia_conn = BaseHook.get_connection('algolia')
client = SearchClient.create(algolia_conn.login, algolia_conn.password)
index = client.init_index('jobs')

jobs_sql_query = """
SELECT
j.id AS objectID,
j.id AS "objectID",
j.provider_id AS provider_id,
j.remote_id_on_provider AS remote_id_on_provider,
j.remote_url AS remote_url,
Expand All @@ -35,17 +35,16 @@ def index_jobs(**context):
j.salary_max AS salary_max,
j.salary_frequency AS salary_frequency,
j.has_relocation_package AS has_relocation_package,
j.expires_at AS expires_at,
j.published_at AS published_at,
EXTRACT(epoch from j.expires_at) AS expires_at,
EXTRACT(epoch from j.published_at) AS published_at,
c.id AS child_company_id,
c.name AS child_company_name,
c.remote_url AS child_company_remote_url,
c.remote_url AS child_company_remote_url
FROM job_vacancies j
LEFT JOIN companies c ON (c.id = j.company_id)
WHERE
CAST(j.published_at AS DATE) = '{}'::DATE
""".format(context['execution_date'])

"""#.format(context['execution_date'])
# WHERE
# CAST(j.published_at AS DATE) = '{}'::DATE
cur.execute(jobs_sql_query)
rows = cur.fetchall()
index.save_objects(rows)
Expand Down
0