Skip to content

Conversation

@ziadhany
Copy link
Collaborator

@ziadhany ziadhany commented Jan 8, 2026

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
@ziadhany
Copy link
Collaborator Author

ziadhany commented Jan 8, 2026

INFO 2026-01-26 19:15:30.575619 UTC Pipeline [AlpineLinuxImporterPipeline] starting
INFO 2026-01-26 19:15:30.575748 UTC Step [collect_and_store_advisories] starting
Importing data using alpine_linux_importer_v2
INFO 2026-01-26 22:39:08.084020 UTC Successfully collected 108,252 advisories
INFO 2026-01-26 22:39:08.084139 UTC Step [collect_and_store_advisories] completed in 12218 seconds (3.4 hours)
INFO 2026-01-26 22:39:08.084171 UTC Pipeline completed in 12218 seconds (3.4 hours)

from vulnerabilities.models import AdvisoryV2
from django.db.models import Count
duplicates = (
    AdvisoryV2.objects
    .values('avid')
    .annotate(count=Count('id'))
    .filter(count__gt=1)
)
len(duplicates)
Out[2]: 0
AdvisoryV2.objects.count()
Out[3]: 108252

…aseImporterPipelineV2

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
@ziadhany
Copy link
Collaborator Author

ziadhany commented Jan 15, 2026

@TG1999 @pombredanne I have a question about Alpine migration. We are fetching one URL and processing the data without grouping by CVE.

The problem is that each URL reports a package version along with its fixed CVEs. How can we obtain a unique identifier for this importer? Is it a good idea to restructure the data and create a large mapping, using the CVE as the unique identifier?

Proposed structure:
CVE: [purl_1, purl_2, ...]

Example:
Package: aom

Sources:
https://secdb.alpinelinux.org/v3.22/main.json -> CVEs: "CVE-2021-30473", "CVE-2021-30474", "CVE-2021-30475"
https://secdb.alpinelinux.org/v3.21/main.json -> CVEs: "CVE-2021-30473", "CVE-2021-30474", "CVE-2021-30475"

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
)

for cve in aliases:
advisory_id = f"{pkg_infos['name']}/{qualifiers['distroversion']}/{cve}"
Copy link
Collaborator Author

@ziadhany ziadhany Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ex:

apache2/v3.20/2.4.26-r0/CVE-2017-7668

Fix duplication on advisory_id

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
@ziadhany
Copy link
Collaborator Author

ziadhany commented Jan 28, 2026

The logs in debug mode:

alpine.zip

@ziadhany ziadhany requested a review from keshav-space January 28, 2026 13:50
Copy link
Member

@keshav-space keshav-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ziadhany, see comments below.

Comment on lines +74 to +121
def fetch_advisory_directory_links(
page_response_content: str,
base_url: str,
logger: callable = None,
) -> List[str]:
"""
Return a list of advisory directory links present in `page_response_content` html string
"""
index_page = BeautifulSoup(page_response_content, features="lxml")
alpine_versions = [
link.text
for link in index_page.find_all("a")
if link.text.startswith("v") or link.text.startswith("edge")
]

if not alpine_versions:
if logger:
logger(
f"No versions found in {base_url!r}",
level=logging.DEBUG,
)
return []

advisory_directory_links = [urljoin(base_url, version) for version in alpine_versions]

return advisory_directory_links


def fetch_advisory_links(
advisory_directory_page: str,
advisory_directory_link: str,
logger: callable = None,
) -> Iterable[str]:
"""
Yield json file urls present in `advisory_directory_page`
"""
advisory_directory_page = BeautifulSoup(advisory_directory_page, features="lxml")
anchor_tags = advisory_directory_page.find_all("a")
if not anchor_tags:
if logger:
logger(
f"No anchor tags found in {advisory_directory_link!r}",
level=logging.DEBUG,
)
return iter([])
for anchor_tag in anchor_tags:
if anchor_tag.text.endswith("json"):
yield urljoin(advisory_directory_link, anchor_tag.text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ziadhany this is bit brittle. I've created a mirror for Alpine secdb here https://github.com/aboutcode-org/aboutcode-mirror-alpine-secdb let's use this instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I’ll update the code. I didn’t notice we have a mirror

return (cls.collect_and_store_advisories,)

def advisories_count(self) -> int:
return 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's return count based on packages key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants