IOError: [Errno 24] Too many open files: '/tmp/ftrack_api_schema_cache.json'
3 3

32 posts in this topic

Hi,

we have a python script which runs in a loop and creates a new ftrack session every few minutes. After a few hours the script crashes with the error: IOError: [Errno 24] Too many open files: '/tmp/ftrack_api_schema_cache.json'.

Is it because the session is not clouded properly? 

If we don't create a new session and the ftrack server is restarted, it will recconect the session to the server?

Can we keep a session open for days/weeks/months?

Share this post


Link to post
Share on other sites

Hi,

That is odd. I haven't seen anything like this before.

When the API session is started we load cached schemas from disk. The file should be closed after reading its contents. We also catch IOError if there is any issues with reading the file, so I am not sure what is causing the issue. Do you have any more logging statements or stack trace when the error occurs? You can send them to support@ftrack.com if you don't want to paste them here.

It should in general be possible to keep a session for a long period of time. You might experience issues with cached values with run-longing sessions depending on how it is being used.

Regards,
Lucas

Share this post


Link to post
Share on other sites

Hi Lucas,

thanks for the answer.

The stack trace when the error occurs:

Exception in thread Thread-1005:
Traceback (most recent call last):
File ".../python/2.7/centos-7_x86-64/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File ".../python/2.7/centos-7_x86-64/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File ".../python-packages/ftrack-python-api/1.0.3/centos-7_x86-64/lib/python2.7/site-packages/ftrack_python_api-1.0.3-py2.7.egg/ftrack_api/event/hub.py", line 178, in connect
.format(self.get_server_url())
EventHubConnectionError: Failed to connect to event server at http://ftrack
ftrack_api.session.Session> ERROR | 2017/02/01 11:41:15.738 (MainThread)|Schema cache could not be loaded from '/tmp/ftrack_api_schema_cache.json'
File ".../python/2.7/centos-7_x86-64/lib/python2.7/logging/__init__.py", line 1192, in exception
File ".../python-packages/ftrack-python-api/1.0.3/centos-7_x86-64/lib/python2.7/site-packages/ftrack_python_api-1.0.3-py2.7.egg/ftrack_api/session.py", line 1316, in _load_schemas
File ".../python-packages/ftrack-python-api/1.0.3/centos-7_x86-64/lib/python2.7/site-packages/ftrack_python_api-1.0.3-py2.7.egg/ftrack_api/session.py", line 254, in __init__
File ".../event.py", line 192, in run
Traceback (most recent call last):
File ".../python-packages/ftrack-python-api/1.0.3/centos-7_x86-64/lib/python2.7/site-packages/ftrack_python_api-1.0.3-py2.7.egg/ftrack_api/session.py", line 1309, in _load_schemas
File ".../python-packages/ftrack-python-api/1.0.3/centos-7_x86-64/lib/python2.7/site-packages/ftrack_python_api-1.0.3-py2.7.egg/ftrack_api/session.py", line 1270, in _read_schemas_from_cache
IOError: [Errno 24] Too many open files: '/tmp/ftrack_api_schema_cache.json'
ftrack_api.session.Session> ERROR | 2017/02/01 11:41:15.915 (MainThread)|Failed to update schema cache '/tmp/ftrack_api_schema_cache.json'.
 

At the moment we set in the session schema_cache_path key argument to False.

Share this post


Link to post
Share on other sites

Hi Lucas,

I found the real problem at the end.

All the time when a new ftrack session is created, a new sub-process is created. Even if we delete the session object with the 'del()' python command, the sub-process still exists until the running script terminates. If the script is used as a daemon, at some point there are hundreds of sub-processes and it crashes with "too many open files". Even if we don't use the cache file, at some point we are running in other problems. 

#>pstree -p | grep python
307:        |                      |-bash(27567)---python(30190)-+-{python}(30217)
308:        |                      |                                                     |-{python}(30226)
309:        |                      |                                                     |-{python}(30233)
310:        |                      |                                                     |-{python}(30609)
311:        |                      |                                                     |-{python}(30611)
312:        |                      |                                                    |-{python}(30613)
313:        |                      |                                                    |-{python}(30616)
314:        |                      |                                                     |-{python}(30620)
315:        |                      |                                                    |-{python}(30773)
316:        |                      |                                                    |-{python}(30775)
317:        |                      |                                                     |-{python}(30778)
318:        |                      |                                                     |-{python}(30784)
319:        |                      |                                                     |-{python}(30908)
320:        |                      |                                                     |-{python}(31190)
321:        |                      |                                                    |-{python}(31271)
322:        |                      |                                                     |-{python}(31275)
323:        |                      |                                                    |-{python}(31277)
324:        |                      |                                                     |-{python}(31324)
325:        |                      |                                                    |-{python}(31365)
326:        |                      |                                                     |-{python}(31368)
327:        |                      |                                                     |-{python}(31370)
328:        |                      |                                                     |-{python}(31372)
329:        |                      |                                                     |-{python}(31374)
330:        |                      |                                                    |-{python}(31376)
331:        |                      |                                                    |-{python}(31381)
332:        |                      |                                                     |-{python}(31446)
333:        |                      |                                                    |-{python}(31448)
334:        |                      |                                                     |-{python}(31450)
335:        |                      |                                                     |-{python}(31453)
336:        |                      |                                                     |-{python}(31455)
337:        |                      |                                                     |-{python}(31457)
338:        |                      |                                                     |-{python}(31568)
339:        |                      |                                                    |-{python}(31570)
340:        |                      |                                                     |-{python}(31572)
341:        |                      |                                                     |-{python}(31575)
342:        |                      |                                                     |-{python}(31577)
343:        |                      |                                                     |-{python}(31581)
344:        |                      |                                                     |-{python}(31950)
345:        |                      |                                                    |-{python}(31953)
346:        |                      |                                                     |-{python}(31956)
347:        |                      |                                                     |-{python}(31958)
348:        |                      |                                                    |-{python}(31961)
349:        |                      |                                                     |-{python}(31968)
350:        |                      |                                                     |-{python}(32051)
351:        |                      |                                                    |-{python}(32053)
352:        |                      |                                                     `-{python}(32083)

Share this post


Link to post
Share on other sites

Hi Lucas,

now we have a different problem. If we don't recreate the session, instead we clear the cache with session.cache.clear() and query the object again, we got the following error message randomly:

int() argument must be a string or a number, not 'Symbol'

Do you know way this could happen?

Unfortunately, we can not check everywhere where we fetch the data if the return value is not a Symbol.

Best,

Remus

Share this post


Link to post
Share on other sites
On 14/02/2017 at 3:45 PM, Remus Avram said:

now we have a different problem. If we don't recreate the session, instead we clear the cache with session.cache.clear() and query the object again, we got the following error message randomly:

int() argument must be a string or a number, not 'Symbol'

Do you know way this could happen?

Unfortunately, we can not check everywhere where we fetch the data if the return value is not a Symbol.

Could you come up with a small script that reproduces this issue - or is it completely random? It always helps when debugging things. 

Share this post


Link to post
Share on other sites

Unfortunately it's completely random. It's hard to debug it from our side as well.

I will describe a bit in which context we are getting the issue. Fortunately, our Event System is up and running. We have a listener action which spool all the events in an internal DB, and a dispatcher which gets event by event and process it. The process is happening almost instant.

Workflow example:

create a shot in Ftrack -> Ftrack event generated -> listener spool event to custom DB -> dispatcher gets event -> if conditions pass run action -> ftrack_session.query("Shot where id is <ID>".one()

Share this post


Link to post
Share on other sites
32 minutes ago, Lucas Correia said:

Are you using components or locations anywhere in your code? They need to be rediscovered if the cache is cleared. This is done automatically by Session.reset().

We don't use components or locations in the script where the error is raising. But I will reset the session after clearing the cache as well. We clean the cache everytime before processing the event.

 

39 minutes ago, Mattias Lagergren said:

@Remus Avram are you doing any threading where you are sharing same the Session between threads?

No threading at the moment. The session is not share.

Is it possible that the event is generated before the changes in the UI are fully saved?  

Share this post


Link to post
Share on other sites
23 hours ago, Remus Avram said:

Is it possible that the event is generated before the changes in the UI are fully saved?  

No, those events are emitted after the database transaction has completed. So should not be an issue if you do not have a some master/slave replication setup where slave may have old data.

Quote

Unfortunately it's completely random. It's hard to debug it from our side as well.

Sorry if I've already asked this but are there are any particular attributes or relations where this happens? E.g. custom attributes has recently been mentioned by another client as problematic

Share this post


Link to post
Share on other sites
25 minutes ago, Mattias Lagergren said:

Sorry if I've already asked this but are there are any particular attributes or relations where this happens? E.g. custom attributes has recently been mentioned by another client as problematic

Yes, this happens with the custom attributes.

Share this post


Link to post
Share on other sites
On 17/02/2017 at 11:13 AM, Remus Avram said:

Yes, this happens with the custom attributes.

We're just about to release a major update on the backend for custom attributes that among other things will allow you to filter on custom attribute and also should improve the performance on them. As part of this we will change some of the implementation on the python client side.

I'm not 100% but I suspect that these changes will solve this as well, as we change from the "special" ftrack_api.attribute.CustomAttributeCollectionAttribute to the one we use for metadata (ftrack_api.attribute.KeyValueMappedCollectionAttribute).

Share this post


Link to post
Share on other sites

We did more investigations in order to find what exactly the problem is with the Symbol(NOT_SET).

When it returns a ftrack_api.symbol.Symbol type instead of the right value of the custom attribute of a ftrack entity,  we tried:

 - clearing the cache:  session.cache.clear()

 - resetting the session: session.reset()

 - getting the ftrack entity again: ftrack_entity = ftrack_session.get('TypedContext', ftrack_entity['id'])

 - querying again the DB: ftrack_entity = ftrack_session.query('{0} where id is {1}'.format(ftrack_entity.entity_type, ftrack_entity['id'])

None of the above commands fix the problem. The only solution is to create a new session. 

If we recreate a new session, then the old session is not closed until the daemon (python scrip) finish running. In this case, at some point is crashing because of hundred of python sub-processes still running.

Is there any workaround for this?

 

Share this post


Link to post
Share on other sites

I see, there are some substantial changes in 3.5.0 to custom attributes - hopefully the issue will be resolved then. I've contacted you in support-case #30782 about this.

31 minutes ago, Remus Avram said:

If we recreate a new session, then the old session is not closed until the daemon (python scrip) finish running. In this case, at some point is crashing because of hundred of python sub-processes still running.

This would be interesting to solve though. We keep the requests session as _request on the ftrack session object:

session._request

Perhaps calling close on it before destroying the session helps? We also have the event hub, maybe worth trying to disconnect it.

session.event_hub.disconnect()

If those helps and you can confirm it we could introduce something like: session.close() to disconnect everything.

Share this post


Link to post
Share on other sites
16 minutes ago, Mattias Lagergren said:

Perhaps calling close on it before destroying the session helps? We also have the event hub, maybe worth trying to disconnect it.


session.event_hub.disconnect()

 

This helps! Thanks!

Still it would be awesome to have session.close() fore disconnection everything; it's more common.

 

Share this post


Link to post
Share on other sites

For reference, in our local wrapper of the API we have introduced a `close` method that disconnects the event hub and does some cleanup. We have also made the `Session` a context manager so you can do:

with session:
    ...

Seems to work well.

If desired, we can put together a pull request.

Share this post


Link to post
Share on other sites
1 hour ago, Martin Pengelly-Phillips said:

For reference, in our local wrapper of the API we have introduced a `close` method that disconnects the event hub and does some cleanup.

Nice! Thanks! 

1 hour ago, Martin Pengelly-Phillips said:

We have also made the `Session` a context manager so you can do:


with session:
    ...

Seems to work well.

I think this can be used for small scripts. At the moment we are using a session globally.

Share this post


Link to post
Share on other sites

I just rechecked session.event_hub.disconnect() in production and it seems that it doesn't fix the problem with closing the session properly. All the time when I recreate the session (disconnect and create a new session) it start a python subprocess. At the moment it's ok as we receive an Symbol Type Error around three times per day, and we restart the dispatcher ones per week. But it would be nice to be fixed in the future.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
3 3