|
Ran
|
Jobs
3
|
Files
19
|
Run time
3s
|
Badge
README BADGES
|
push
github
Add load from pdf component (#765) Fixes https://github.com/ml6team/fondant-use-cases/issues/54 PR that adds the functionality to load pdf documents from different local and remote storage. The implementation differs from the suggested solution at [#54](https://github.com/ml6team/fondant-use-cases/issues/54) since: * Accumulating different loaders and loading each document individually seems to be inefficient since it would require the initialization of a client, temp storage, ... on every invocation [link](https://github.com/langchain-ai/langchain/blob/04caf07de/libs/community/langchain_community/document_loaders/gcs_file.py#L62) * The langchain cloud loaders don't have a unified interface * Each would requires specific arguments to be passed (in contrast fsspec is much simpler) * Only the google loader enables defining a custom loader class, the rest uses the `Unstructured` loader which requires a lot of system and cuda dependencies to have it installed (a lot of overhead for just loading pdfs) The current implementation relies on copying the pdfs to a temporary local storage and loading them using the `PyPDFDirectoryLoader`, they are then loaded lazily. The assumption for now is that the loaded docs won't exceed the storage of the device which should be valid for most use cases. Later on, we can think on how to optimize this further.
1768 of 1923 relevant lines covered (91.94%)
2.75 hits per line
| ID | Job ID | Ran | Files | Coverage | |
|---|---|---|---|---|---|
| 1 | test-3.10 - 7486960489.1 | 0 |
91.84 |
||
| 2 | test-3.9 - 7486960489.2 | 0 |
91.83 |
||
| 3 | test-3.8 - 7486960489.3 | 0 |
91.93 |